[BioC] multiple locations for probeset in hgu133plus2CHRLOC vs. UCSC PSL data

Thu Nov 20 21:03:07 CET 2008

Hi Peter,

You can find the genome coordinates of a set of probes by using the
matchPDict()/countPDict() tool from Biostrings + the genome for hg18:

  library(hgu133plus2probe)
  library(BSgenome.Hsapiens.UCSC.hg18)
  probes <- DNAStringSet(
              hgu133plus2probe$sequence[hgu133plus2probe$Probe.Set.Name 
== "201268_at"]
            )

  pdict1 <- PDict(probes)
  nhits1 <- t(sapply(seqnames(Hsapiens),
                     function(name) countPDict(pdict1, Hsapiens[[name]])))

The 'nhits1' matrix summarizes the number of hits per chromosome (plus
strand only). See which chromosomes have hits with:

  > which(rowSums(nhits1) != 0)
  chr12 chr17
     12    17

For completeness, we should also search the minus strands of all 
chromosomes.
This can easily be done by taking the reverse complement sequences of the
probes with pdict2 <- PDict(reverseComplement(probes)) and then using the
same code as above.

Then use matchPDict() to get the coordinates of the hits in chr17:

  chr17_hits <- unlist(matchPDict(pdict1, Hsapiens$chr17))

If you want to know the genes where those hits occur, you can use
biomaRt as explained by Steffen on this list a few days ago:

  https://stat.ethz.ch/pipermail/bioconductor/2008-November/025208.html

Once you've retrieved all the genes coordinates, you can do the
following:

  # Extract the genes that belong to chr17 (plus strand only):
  chr17_genes <- geneLocs[geneLocs$chromosome_name == "17" & 
geneLocs$strand == 1, ]

  # Use overlap() from the IRanges package to find the genes
  # where the hits occur:
  > tree <- IntervalTree(IRanges(start=chr17_genes$start_position, 
end=chr17_genes$end_position))
  > overlap(tree, chr17_hits, multiple = FALSE)
  [1] 33 33 33 33 33 33 33 33 33 33

So the gene hit by probe 201268_at is:

  > chr17_genes[33, ]
      ensembl_gene_id strand chromosome_name start_position end_position
  954 ENSG00000011052      1              17       46585919     46604104

which is what is reported by the hgu133plus2ENSEMBL map:

  > hgu133plus2ENSEMBL[['201268_at']]
  [1] "ENSG00000011052"

Note that the same approach shows that the hits in chr12 are also within
a gene boundaries (gene ENSG00000123009) but I'll let more qualified people
comment about this.

Cheers,
H.

Bazeley, Peter wrote:
> Hello,
>
> R version: 2.8.0
>
> I just installed the hgu133plus2.db package, and am looking at the hgu133plus2CHRLOC environment. I've noticed that some of the probeset entries (e.g. "201268_at") have multiple locations compared to Affy's annotation file. I'd like to figure out if these multiple locations are current, in which case it is some sort of overlapping/repeating duplication. For example:
>
>> as.list(hgu133plus2CHRLOC)$'201268_at'
>       17       17       17       17 
> 46598879 46597889 46598637 46599081 
>
> indicates that the probeset maps to 4 locations. Compare this to the alignments info in the Affy's annotation file (from 7/8/08, http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG-U133_Plus_2.na26.annot.csv.zip): 
>
> chr12:119204403-119205041 (+) // 91.49 // q24.31 /// chr17:46598810-46604103 (+) // 96.87 // q21.33
>
> which suggests one location on chromosome 17 (I'm ignoring chromosome 12 for now). This is a "_at" probeset, so it should map uniquely to a sequence, according to Affy's "Data Analysis Fundamentals" document (and speaking to a rep).
>
> >From the information provided by "?hgu133plus2CHRLOC", I downloaded 
> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/database/affyU133Plus2.txt.gz 
> from UCSC to see how this occured, but it is not clear. Actually, the file:
> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG-U133_Plus_2.link.psl.zip
> from Affy's support page has the same alignment info. Here's the relevant PSL info:
>
> Target sequence name: chr17
> Alignment start position in target: 46598810
> Alignment end position in target: 46604103
> Number of blocks in the alignment (a block contains no gaps): 5
> Comma-separated list of sizes of each block: 47,130,102,113,257,
> Comma-separated list of starting positions of each block in target: 46598810,46599186,46600601,46602296,46603846,
>
>
> The second location provided by CHRLOC (46597889) occurs before the start of the alignment in the PSL info, so perhaps this one CHRLOC location corresponds to the PSL alignment? The mappings were obtained from UCSC on 2006-Apr14, so perhaps additional alignments existed at the time, which have since been removed.
>
>
> Thank you for any help. Hopefully I'm just missing something obvious (well, non-obvious for me).
>
> Peter Bazeley
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319