[BioC] Mapping microarray probes to the genome using findOverlaps

Hervé Pagès hpages at fhcrc.org
Wed Mar 2 01:00:22 CET 2011


Hi Sean, Ravi,

On 03/01/2011 04:08 AM, Sean Davis wrote:
>
>
> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Ravi,
>
>     Sean's suggestion to map the probes directly to the transcriptome can
>     easily be achieved with the Biostrings/BSgenome/GenomicFeatures
>     infrastructure.
>
>     The code below uses vwhichPDict (a variant of matchPDict) and the
>     danRer6
>     genome from UCSC (release name: Sanger Institute Zv8). Note that
>     UCSC just
>     released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome
>     package
>     for it yet.
>
>     Also I don't think we support the Nimblegen Zebrafish 12 x135K
>     Expression
>     chip so I'm using the zebrafish chip from Affymetrix instead.
>
>     Depending on your machine and the quality of your internet connection,
>     this code should run in 8-15 minutes. Less than 800 Mb of RAM is used.
>
>       ## Load the probes:
>       library(zebrafishprobe)
>       library(Biostrings)
>       probes <- DNAStringSet(zebrafishprobe)
>
>       ## Compute the transcriptome (32992 transcripts):
>       library(GenomicFeatures)
>       txdb <- makeTranscriptDbFromUCSC(genome="danRer6",
>     tablename="ensGene")
>       library(BSgenome.Drerio.UCSC.danRer6)
>       txseqs <- extractTranscriptsFromGenome(Drerio, txdb)
>       ## Note that saving 'txseqs' with write.XStringSet() generates
>       ## a FASTA file that is 51M.
>
>       ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
>       pdict2 <- PDict(probes, max.mismatch=2)
>       m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2)
>       makeMap <- function(m, probeids, txnames)
>       {
>         probeid <- probeids[unlist(m)]
>         txname <- rep.int <http://rep.int>(txnames, elementLengths(m))
>         ans <- unique(data.frame(probeid, txname))
>         rownames(ans) <- NULL
>         ans
>       }
>       map <- makeMap(m2, zebrafishprobe$Probe.Set.Name
>     <http://Probe.Set.Name>, names(txseqs))
>
>     Note that the longest step is not the call to vwhichPDict() but the
>     extraction of the transcriptome.
>
>     However, the final result doesn't look that good. A quick look at the
>     mapping shows that only 62.4% of the probe ids are mapped and that some
>     probe ids are mapped to more than 1 transcript:
>
>
> Nice example, Herve!
>
> Mapping to multiple transcripts is to be expected, so one needs to come
> up with ways of dealing with the situation.  Missing probes is also
> typical.  These can be dealt with by mapping to the unmatched probes to
> other transcript databases.  For example, one could map to Ensembl
> first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next.
>
>      > head(map)
>                  probeid             txname
>       1 Dr.19494.1.S1_at ENSDART00000048610
>       2 Dr.19494.1.S1_at ENSDART00000103479
>       3 Dr.19494.1.S1_at ENSDART00000103478
>       4 Dr.10343.1.S1_at ENSDART00000006449
>       5 Dr.12057.1.A1_at ENSDART00000054814
>       6 Dr.22271.1.A1_at ENSDART00000054814
>      > length(unique(zebrafishprobe$Probe.Set.Name <http://Probe.Set.Name>))
>       [1] 15617
>      > length(unique(map$probeid))
>       [1] 9746
>      > any(duplicated(map$probeid))
>       [1] TRUE
>
>     But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K
>     Expression chip.
>
>
> Keep in mind that with 60mer probes, there is potentially a need to
> allow gaps in the alignment

Good point. I should mention here that, even though at the moment
matchPDict() and family cannot handle indels when the dictionary
is preprocessed, now they do support them when the dictionary is
not preprocessed (I just added this feature). So one can still use
this to try to align the probes that didn't align with 2 mismatches:

   mapped_probes_idx <- sort(unique(unlist(m2)))
   unmapped_probes <- probes[-mapped_probes_idx]  # 115789 probes

   m2_indels <- vwhichPDict(unmapped_probes, txseqs,
                            max.mismatch=2, with.indels=TRUE)

Unfortunately, since no preprocessing is used, this is *very* slow
(can take up to 3 or 4 days to align the 115789 unmapped probes).

Anyway, for the zebrafish probes used in this example, allowing 2
indels doesn't seem to help (I tried to align the first 100 unmapped
probes only and none of them would align), so, as you said Sean, one
might want to try using other transcript databases.



> and that consecutive matches of even 40 or
> 50 base pairs are likely to generate a signal on a microarray.

Unfortunately, I can't think of any easy/efficient way to achieve
this kind of partial mapping with the tools I'm trying to promote
here. The kind of indels supported by the matchPattern and matchPDict
families of functions are simple indels (1 indel = 1 insertion or
deletion of 1 nucleotide). For more complex types of alignments, the
pairwiseAlignment() function could be used, but, since it is basically
an implementation of the Smith-Waterman algorithm, it would be much
much slower than any of the external tools Sean mentioned (blat, gmap,
etc...)

Cheers,
H.

PS: Support for a small number of indels to countPDict(), whichPDict(),
vcountPDict() and vwhichPDict(), when the input dictionary is not
preprocessed, was added to Biostrings 2.18.4 (release) and 2.19.13
(devel).

>
> Sean
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list