[BioC] Mapping microarray probes to the genome using findOverlaps

Pages, Herve hpages at fhcrc.org
Tue Mar 1 10:20:01 CET 2011


Hi Ravi,

Sean's suggestion to map the probes directly to the transcriptome can
easily be achieved with the Biostrings/BSgenome/GenomicFeatures
infrastructure.

The code below uses vwhichPDict (a variant of matchPDict) and the danRer6
genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just
released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome package
for it yet.

Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression
chip so I'm using the zebrafish chip from Affymetrix instead.

Depending on your machine and the quality of your internet connection,
this code should run in 8-15 minutes. Less than 800 Mb of RAM is used.

  ## Load the probes:
  library(zebrafishprobe)
  library(Biostrings)
  probes <- DNAStringSet(zebrafishprobe)

  ## Compute the transcriptome (32992 transcripts):
  library(GenomicFeatures)
  txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene")
  library(BSgenome.Drerio.UCSC.danRer6)
  txseqs <- extractTranscriptsFromGenome(Drerio, txdb)
  ## Note that saving 'txseqs' with write.XStringSet() generates
  ## a FASTA file that is 51M.

  ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
  pdict2 <- PDict(probes, max.mismatch=2)
  m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2)
  makeMap <- function(m, probeids, txnames)
  {
    probeid <- probeids[unlist(m)]
    txname <- rep.int(txnames, elementLengths(m))
    ans <- unique(data.frame(probeid, txname))
    rownames(ans) <- NULL
    ans
  }
  map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs))

Note that the longest step is not the call to vwhichPDict() but the
extraction of the transcriptome.

However, the final result doesn't look that good. A quick look at the
mapping shows that only 62.4% of the probe ids are mapped and that some
probe ids are mapped to more than 1 transcript:

  > head(map)
             probeid             txname
  1 Dr.19494.1.S1_at ENSDART00000048610
  2 Dr.19494.1.S1_at ENSDART00000103479
  3 Dr.19494.1.S1_at ENSDART00000103478
  4 Dr.10343.1.S1_at ENSDART00000006449
  5 Dr.12057.1.A1_at ENSDART00000054814
  6 Dr.22271.1.A1_at ENSDART00000054814
  > length(unique(zebrafishprobe$Probe.Set.Name))
  [1] 15617
  > length(unique(map$probeid))
  [1] 9746
  > any(duplicated(map$probeid))
  [1] TRUE

But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K
Expression chip.

One more thing. While testing the above code, I discovered 2 problems
that were preventing it to work: one with the extractTranscriptsFromGenome()
function and one with the vwhichPDict() function. Those problems are now
fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & 1.3.13)
and Biostrings (v 2.18.3 & 2.19.12). These new versions will become available
via biocLite() in the next 24-36 hours.

Hope this helps.
H.


----- Original Message -----
From: "Ravi Karra" <ravi.karra at gmail.com>
To: "Sean Davis" <sdavis2 at mail.nih.gov>
Cc: bioconductor at r-project.org
Sent: Saturday, February 26, 2011 2:56:02 PM
Subject: Re: [BioC] Mapping microarray probes to the genome using	findOverlaps

Hi Sean, 
I could do that, but am not sure how to.  The annotated zebrafish genome is the Tubingen strain and the some of the probes on the array are from the EK and AB strains.   This means that I need to allow for SNP's in the alignment.  I originally tried to align the probes to the Ensembl Transcripts using matchPDict but when I allowed for 2 mismatches (max.mismatch = 2) across the probe sequences, my computer never stopped running the program!  I found TopHat to be much faster (8 min) and TopHat allows for a few nucleotide wobble by default!. 
Do you have a suggestion(s) on another way to align the array to the Ensembl Transcripts?
Thanks, 
Ravi




On Feb 26, 2011, at 5:45 PM, Sean Davis wrote:

> 
> 
> On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra at gmail.com> wrote:
> Hello,
> 
> I am still trying to map probes on the Nimblegen Zebrafish 12 x135K Expression to the Zv9 version of the zebrafish genome available from Ensembl!  I am very reluctantly pursuing an alignment approach to annotation as the original annotation provided with the array is quite outdated.
> 
> I performed a gapped alignment using the individual probe sequences (60-mers) from the array using TopHat and loaded the results into Bioconductor as a GappedAlignments object.  I have made a TranscriptDb object using the Zv9 genome from Ensembl.  Next, I plan to use findOverlaps for the annotation. What is the best way to get the overlaps (by exon, cds, or by transcript)?  I am a little concerned that using transcriptsByOverlaps might be a bit too broad and result in mapping reads to multiple genes (for example transcripts in the genome that have overlapping genomic coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps might be too restrictive and miss information in the UTR's.  My gut feeling is that annotating by cds is the "in-between" approach.  What is the recommended approach for RNA-seq?  As you can tell, I am quite new to this!
> 
> 
> Hi, Ravi.  Sorry to answer a question with more questions, but why not just map the probes against Ensembl Transcripts or refseq?  What is the advantage of mapping to the genome and then going back to the transcripts?  
> 
> Sean 


	[[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list