[BioC] Mapping microarray probes to the genome using findOverlaps

Wed Mar 9 02:29:26 CET 2011

Hi Ravi,

FWIW, I just made a BSgenome package for Zv9. It will become available
to BioC devel users in a couple of hours (source package only for now).

Cheers,
H.

On 03/07/2011 07:11 PM, Ravi Karra wrote:
> Hi Sean and Herve,
>
> Thanks a ton for the suggestions! I modified Herve's approach to use
> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many
> of the VEGA transcripts, many of which are supposed to be on this array.
> I ended up mapping to the cdna's from zv9 and like Herve only match
> about 60% of probes to the ensembl transcripts. I too found many probes
> that map to multiple transcripts. Like the brainarray group, I chose to
> use the probes as a part of transcript-specific probesets... i.e. a
> given probe can contribute to the expression of multiple transcripts.
> Before trying to extend mapping to other databases as suggested by Sean,
> I wanted to try and create a custom ndf file and annotation package to
> give the analysis a go. However, I keep getting the error below using
> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file
> incorrectly? I included my code and session info below. Sorry for the
> inefficient code... I am learning on the fly!
>
> Thanks everyone! This list has been immensely helpful!
>
> Ravi
>
> =======================================================================
> Building annotation package for Nimblegen Expression Array
> NDF: newndf.ndf
> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys
> =======================================================================
> Parsing file: newndf.ndf... OK
> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK
> Merging NDF and XYS files... OK
> Preparing contents for featureSet table... OK
> Preparing contents for bgfeature table... OK
> Preparing contents for pmfeature table... OK
> Creating package in ./pd.newndf
> Inserting 27649 rows into table featureSet... OK
> Inserting 200368 rows into table pmfeature... Error in
> sqliteExecStatement(con, statement, bind.data) :
> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed)
>
> My code:
>
> #1. Load the probes from the ndf file:
> library(Biostrings)
> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration
> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf"
> probe_info= read.table(array_file, header= TRUE, sep= '\t')
> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE))
> probeid= probe_info$PROBE_DESIGN_ID
> probeid= as.character(probeid)
> names(probes) = probeid
> real_probes= which(width(probes) == 60) #don't align the control, empty,
> or the crosshyb probes
> probes= probes[real_probes]
>
> #2. Compute the transcriptome (51569 transcripts with sequences as of
> March 7, .2011):
> library(biomaRt)
> ensembl= useMart("ensembl")
> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl)
> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1]
> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType=
> "cdna", mart= ensemblzv9)
> #make the DNAStringSet object
> zv9txDict= DNAStringSet(txSeqs$cdna)
> names(zv9txDict) = txSeqs$ensembl_transcript_id
>
> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches):
> pdict2<- PDict(probes, max.mismatch=2)
> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2)
> makeMap<- function(m, probeids, txnames) {
> probeid<- probeids[unlist(m)]
> txname<- rep.int(txnames, elementLengths(m))
> ans<- unique(data.frame(probeid, txname))
> rownames(ans) <- NULL
> ans
> }
> map= makeMap(m2, names(probes), names(zv9txDict))
>
> #4. Remove transcripts with less than 3 probes
> txcounts= data.frame(table(map$txname))
> dim(txcounts) #start with 32961 different transcripts
> usefultxs= txcounts[txcounts$Freq> 2,][,1] #end with 27647 transcripts
> map= map[map$txname%in% usefultxs, ]
>
> #5: Make a new ndf file with the updated transcript information
> #probe_info is the ndf file I originally loaded
> rownames(probe_info) = probe_info[,1]
> ndf= probe_info[as.character(map$probeid),]
> ndf$SEQ_ID= map$txname
>
> #append the information for the control probes to the new ndf
> control= probe_info[probe_info$PROBE_CLASS!= "experimental",]
> ndf= rbind(ndf, control)
>
> # change all the probes with unmapped transcripts to random but keep
> controls intact
> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in%
> as.character(ndf$PROBE_DESIGN_ID))
> noalignprobes= probe_info[noalign,]
> noalignprobes$SEQ_ID= "noalign"
> ndf= rbind(ndf, noalignprobes)
> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE)
>
> #6. Make the new annotation package
> #make custom chip descriptor file for Nimblegen
> library(pdInfoBuilder)
> filename_ndf<- list.files('.',pattern='.ndf',full.names=T)
> filename_ndf= new("ScalarCharacter", filename_ndf)
> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1]
> filename_xys= new("ScalarCharacter", filename_xys)
> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf,
> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu
> <mailto:ravi.karra at duke.edu>", biocViews="AnnotationData", chipName=
> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7",
> manufacturer= "Nimblegen", url=
> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism=
> "Danio rerio", version= "1.1")
> makePdInfoPackage(seed, destDir= ".")
>
>>  sessionInfo()
> R version 2.12.2 (2011-02-25)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> [7] base
>
> other attached packages:
> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0
> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2
> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5
> [10] Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3
> [4] splines_2.12.2 XML_3.2-0
>
>
> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote:
>
>>
>>
>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>>
>>     Hi Ravi,
>>
>>     Sean's suggestion to map the probes directly to the transcriptome can
>>     easily be achieved with the Biostrings/BSgenome/GenomicFeatures
>>     infrastructure.
>>
>>     The code below uses vwhichPDict (a variant of matchPDict) and the
>>     danRer6
>>     genome from UCSC (release name: Sanger Institute Zv8). Note that
>>     UCSC just
>>     released danRer7 (Sanger Institute Zv9) but we don't have a
>>     BSgenome package
>>     for it yet.
>>
>>     Also I don't think we support the Nimblegen Zebrafish 12 x135K
>>     Expression
>>     chip so I'm using the zebrafish chip from Affymetrix instead.
>>
>>     Depending on your machine and the quality of your internet connection,
>>     this code should run in 8-15 minutes. Less than 800 Mb of RAM is used.
>>
>>     ## Load the probes:
>>     library(zebrafishprobe)
>>     library(Biostrings)
>>     probes <- DNAStringSet(zebrafishprobe)
>>
>>     ## Compute the transcriptome (32992 transcripts):
>>     library(GenomicFeatures)
>>     txdb <- makeTranscriptDbFromUCSC(genome="danRer6",
>>     tablename="ensGene")
>>     library(BSgenome.Drerio.UCSC.danRer6)
>>     txseqs <- extractTranscriptsFromGenome(Drerio, txdb)
>>     ## Note that saving 'txseqs' with write.XStringSet() generates
>>     ## a FASTA file that is 51M.
>>
>>     ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>     pdict2 <- PDict(probes, max.mismatch=2)
>>     m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2)
>>     makeMap <- function(m, probeids, txnames)
>>     {
>>     probeid <- probeids[unlist(m)]
>>     txname <- rep.int <http://rep.int/>(txnames, elementLengths(m))
>>     ans <- unique(data.frame(probeid, txname))
>>     rownames(ans) <- NULL
>>     ans
>>     }
>>     map <- makeMap(m2, zebrafishprobe$Probe.Set.Name
>>     <http://Probe.Set.Name/>, names(txseqs))
>>
>>     Note that the longest step is not the call to vwhichPDict() but the
>>     extraction of the transcriptome.
>>
>>     However, the final result doesn't look that good. A quick look at the
>>     mapping shows that only 62.4% of the probe ids are mapped and that
>>     some
>>     probe ids are mapped to more than 1 transcript:
>>
>>
>> Nice example, Herve!
>>
>> Mapping to multiple transcripts is to be expected, so one needs to
>> come up with ways of dealing with the situation. Missing probes is
>> also typical. These can be dealt with by mapping to the unmatched
>> probes to other transcript databases. For example, one could map to
>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish
>> ESTs next.
>>
>>     > head(map)
>>     probeid txname
>>     1 Dr.19494.1.S1_at ENSDART00000048610
>>     2 Dr.19494.1.S1_at ENSDART00000103479
>>     3 Dr.19494.1.S1_at ENSDART00000103478
>>     4 Dr.10343.1.S1_at ENSDART00000006449
>>     5 Dr.12057.1.A1_at ENSDART00000054814
>>     6 Dr.22271.1.A1_at ENSDART00000054814
>>     > length(unique(zebrafishprobe$Probe.Set.Name
>>     <http://Probe.Set.Name/>))
>>     [1] 15617
>>     > length(unique(map$probeid))
>>     [1] 9746
>>     > any(duplicated(map$probeid))
>>     [1] TRUE
>>
>>     But maybe you will have more luck with the Nimblegen Zebrafish 12
>>     x135K
>>     Expression chip.
>>
>>
>> Keep in mind that with 60mer probes, there is potentially a need to
>> allow gaps in the alignment and that consecutive matches of even 40 or
>> 50 base pairs are likely to generate a signal on a microarray.
>>
>> Sean
>>
>>     One more thing. While testing the above code, I discovered 2 problems
>>     that were preventing it to work: one with the
>>     extractTranscriptsFromGenome()
>>     function and one with the vwhichPDict() function. Those problems
>>     are now
>>     fixed in the release and devel versions of GenomicFeatures (v
>>     1.2.4 & 1.3.13)
>>     and Biostrings (v 2.18.3 & 2.19.12). These new versions will
>>     become available
>>     via biocLite() in the next 24-36 hours.
>>
>>     Hope this helps.
>>     H.
>>
>>
>>     ----- Original Message -----
>>     From: "Ravi Karra" <ravi.karra at gmail.com
>>     <mailto:ravi.karra at gmail.com>>
>>     To: "Sean Davis" <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>
>>     Cc: bioconductor at r-project.org <mailto:bioconductor at r-project.org>
>>     Sent: Saturday, February 26, 2011 2:56:02 PM
>>     Subject: Re: [BioC] Mapping microarray probes to the genome using
>>     findOverlaps
>>
>>     Hi Sean,
>>     I could do that, but am not sure how to. The annotated zebrafish
>>     genome is the Tubingen strain and the some of the probes on the
>>     array are from the EK and AB strains. This means that I need to
>>     allow for SNP's in the alignment. I originally tried to align the
>>     probes to the Ensembl Transcripts using matchPDict but when I
>>     allowed for 2 mismatches (max.mismatch = 2) across the probe
>>     sequences, my computer never stopped running the program! I found
>>     TopHat to be much faster (8 min) and TopHat allows for a few
>>     nucleotide wobble by default!.
>>     Do you have a suggestion(s) on another way to align the array to
>>     the Ensembl Transcripts?
>>     Thanks,
>>     Ravi
>>
>>
>>
>>
>>     On Feb 26, 2011, at 5:45 PM, Sean Davis wrote:
>>
>>     >
>>     >
>>     > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra
>>     <ravi.karra at gmail.com <mailto:ravi.karra at gmail.com>> wrote:
>>     > Hello,
>>     >
>>     > I am still trying to map probes on the Nimblegen Zebrafish 12
>>     x135K Expression to the Zv9 version of the zebrafish genome
>>     available from Ensembl! I am very reluctantly pursuing an
>>     alignment approach to annotation as the original annotation
>>     provided with the array is quite outdated.
>>     >
>>     > I performed a gapped alignment using the individual probe
>>     sequences (60-mers) from the array using TopHat and loaded the
>>     results into Bioconductor as a GappedAlignments object. I have
>>     made a TranscriptDb object using the Zv9 genome from Ensembl.
>>     Next, I plan to use findOverlaps for the annotation. What is the
>>     best way to get the overlaps (by exon, cds, or by transcript)? I
>>     am a little concerned that using transcriptsByOverlaps might be a
>>     bit too broad and result in mapping reads to multiple genes (for
>>     example transcripts in the genome that have overlapping genomic
>>     coordinates). By contrast, mapping with exonsByOverlaps and
>>     cdsByOverlaps might be too restrictive and miss information in the
>>     UTR's. My gut feeling is that annotating by cds is the
>>     "in-between" approach. What is the recommended approach for
>>     RNA-seq? As you can tell, I am quite new to this!
>>     >
>>     >
>>     > Hi, Ravi. Sorry to answer a question with more questions, but
>>     why not just map the probes against Ensembl Transcripts or refseq?
>>     What is the advantage of mapping to the genome and then going back
>>     to the transcripts?
>>     >
>>     > Sean
>>
>>
>>     [[alternative HTML version deleted]]
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319