[BioC] Mapping microarray probes to the genome using findOverlaps

Mon Mar 21 18:47:16 CET 2011

Sorry Ravi, I've no idea how to do this.  H.

On 03/13/2011 07:03 PM, Ravi Karra wrote:
> Thanks for the suppprt Herve!
> Do you or Benilton (or anyone else!) have any ideas on how to redesign the .ndf file to allow PdInfoBuilder to use the updated probeset information from these alignments?
> Thanks again for all of the help,
> Ravi
>
> On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote:
>
>> Hi Ravi,
>>
>> FWIW, I just made a BSgenome package for Zv9. It will become available
>> to BioC devel users in a couple of hours (source package only for now).
>>
>> Cheers,
>> H.
>>
>>
>> On 03/07/2011 07:11 PM, Ravi Karra wrote:
>>> Hi Sean and Herve,
>>>
>>> Thanks a ton for the suggestions! I modified Herve's approach to use
>>> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many
>>> of the VEGA transcripts, many of which are supposed to be on this array.
>>> I ended up mapping to the cdna's from zv9 and like Herve only match
>>> about 60% of probes to the ensembl transcripts. I too found many probes
>>> that map to multiple transcripts. Like the brainarray group, I chose to
>>> use the probes as a part of transcript-specific probesets... i.e. a
>>> given probe can contribute to the expression of multiple transcripts.
>>> Before trying to extend mapping to other databases as suggested by Sean,
>>> I wanted to try and create a custom ndf file and annotation package to
>>> give the analysis a go. However, I keep getting the error below using
>>> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file
>>> incorrectly? I included my code and session info below. Sorry for the
>>> inefficient code... I am learning on the fly!
>>>
>>> Thanks everyone! This list has been immensely helpful!
>>>
>>> Ravi
>>>
>>> =======================================================================
>>> Building annotation package for Nimblegen Expression Array
>>> NDF: newndf.ndf
>>> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys
>>> =======================================================================
>>> Parsing file: newndf.ndf... OK
>>> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK
>>> Merging NDF and XYS files... OK
>>> Preparing contents for featureSet table... OK
>>> Preparing contents for bgfeature table... OK
>>> Preparing contents for pmfeature table... OK
>>> Creating package in ./pd.newndf
>>> Inserting 27649 rows into table featureSet... OK
>>> Inserting 200368 rows into table pmfeature... Error in
>>> sqliteExecStatement(con, statement, bind.data) :
>>> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed)
>>>
>>> My code:
>>>
>>> #1. Load the probes from the ndf file:
>>> library(Biostrings)
>>> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration
>>> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf"
>>> probe_info= read.table(array_file, header= TRUE, sep= '\t')
>>> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE))
>>> probeid= probe_info$PROBE_DESIGN_ID
>>> probeid= as.character(probeid)
>>> names(probes) = probeid
>>> real_probes= which(width(probes) == 60) #don't align the control, empty,
>>> or the crosshyb probes
>>> probes= probes[real_probes]
>>>
>>> #2. Compute the transcriptome (51569 transcripts with sequences as of
>>> March 7, .2011):
>>> library(biomaRt)
>>> ensembl= useMart("ensembl")
>>> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl)
>>> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1]
>>> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType=
>>> "cdna", mart= ensemblzv9)
>>> #make the DNAStringSet object
>>> zv9txDict= DNAStringSet(txSeqs$cdna)
>>> names(zv9txDict) = txSeqs$ensembl_transcript_id
>>>
>>> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>> pdict2<- PDict(probes, max.mismatch=2)
>>> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2)
>>> makeMap<- function(m, probeids, txnames) {
>>> probeid<- probeids[unlist(m)]
>>> txname<- rep.int(txnames, elementLengths(m))
>>> ans<- unique(data.frame(probeid, txname))
>>> rownames(ans)<- NULL
>>> ans
>>> }
>>> map= makeMap(m2, names(probes), names(zv9txDict))
>>>
>>> #4. Remove transcripts with less than 3 probes
>>> txcounts= data.frame(table(map$txname))
>>> dim(txcounts) #start with 32961 different transcripts
>>> usefultxs= txcounts[txcounts$Freq>  2,][,1] #end with 27647 transcripts
>>> map= map[map$txname%in% usefultxs, ]
>>>
>>> #5: Make a new ndf file with the updated transcript information
>>> #probe_info is the ndf file I originally loaded
>>> rownames(probe_info) = probe_info[,1]
>>> ndf= probe_info[as.character(map$probeid),]
>>> ndf$SEQ_ID= map$txname
>>>
>>> #append the information for the control probes to the new ndf
>>> control= probe_info[probe_info$PROBE_CLASS!= "experimental",]
>>> ndf= rbind(ndf, control)
>>>
>>> # change all the probes with unmapped transcripts to random but keep
>>> controls intact
>>> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in%
>>> as.character(ndf$PROBE_DESIGN_ID))
>>> noalignprobes= probe_info[noalign,]
>>> noalignprobes$SEQ_ID= "noalign"
>>> ndf= rbind(ndf, noalignprobes)
>>> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE)
>>>
>>> #6. Make the new annotation package
>>> #make custom chip descriptor file for Nimblegen
>>> library(pdInfoBuilder)
>>> filename_ndf<- list.files('.',pattern='.ndf',full.names=T)
>>> filename_ndf= new("ScalarCharacter", filename_ndf)
>>> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1]
>>> filename_xys= new("ScalarCharacter", filename_xys)
>>> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf,
>>> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu
>>> <mailto:ravi.karra at duke.edu>", biocViews="AnnotationData", chipName=
>>> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7",
>>> manufacturer= "Nimblegen", url=
>>> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism=
>>> "Danio rerio", version= "1.1")
>>> makePdInfoPackage(seed, destDir= ".")
>>>
>>>> sessionInfo()
>>> R version 2.12.2 (2011-02-25)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods
>>> [7] base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0
>>> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2
>>> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5
>>> [10] Biobase_2.10.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3
>>> [4] splines_2.12.2 XML_3.2-0
>>>
>>>
>>> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve<hpages at fhcrc.org
>>>> <mailto:hpages at fhcrc.org>>  wrote:
>>>>
>>>>     Hi Ravi,
>>>>
>>>>     Sean's suggestion to map the probes directly to the transcriptome can
>>>>     easily be achieved with the Biostrings/BSgenome/GenomicFeatures
>>>>     infrastructure.
>>>>
>>>>     The code below uses vwhichPDict (a variant of matchPDict) and the
>>>>     danRer6
>>>>     genome from UCSC (release name: Sanger Institute Zv8). Note that
>>>>     UCSC just
>>>>     released danRer7 (Sanger Institute Zv9) but we don't have a
>>>>     BSgenome package
>>>>     for it yet.
>>>>
>>>>     Also I don't think we support the Nimblegen Zebrafish 12 x135K
>>>>     Expression
>>>>     chip so I'm using the zebrafish chip from Affymetrix instead.
>>>>
>>>>     Depending on your machine and the quality of your internet connection,
>>>>     this code should run in 8-15 minutes. Less than 800 Mb of RAM is used.
>>>>
>>>>     ## Load the probes:
>>>>     library(zebrafishprobe)
>>>>     library(Biostrings)
>>>>     probes<- DNAStringSet(zebrafishprobe)
>>>>
>>>>     ## Compute the transcriptome (32992 transcripts):
>>>>     library(GenomicFeatures)
>>>>     txdb<- makeTranscriptDbFromUCSC(genome="danRer6",
>>>>     tablename="ensGene")
>>>>     library(BSgenome.Drerio.UCSC.danRer6)
>>>>     txseqs<- extractTranscriptsFromGenome(Drerio, txdb)
>>>>     ## Note that saving 'txseqs' with write.XStringSet() generates
>>>>     ## a FASTA file that is 51M.
>>>>
>>>>     ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>>>     pdict2<- PDict(probes, max.mismatch=2)
>>>>     m2<- vwhichPDict(pdict2, txseqs, max.mismatch=2)
>>>>     makeMap<- function(m, probeids, txnames)
>>>>     {
>>>>     probeid<- probeids[unlist(m)]
>>>>     txname<- rep.int<http://rep.int/>(txnames, elementLengths(m))
>>>>     ans<- unique(data.frame(probeid, txname))
>>>>     rownames(ans)<- NULL
>>>>     ans
>>>>     }
>>>>     map<- makeMap(m2, zebrafishprobe$Probe.Set.Name
>>>>     <http://Probe.Set.Name/>, names(txseqs))
>>>>
>>>>     Note that the longest step is not the call to vwhichPDict() but the
>>>>     extraction of the transcriptome.
>>>>
>>>>     However, the final result doesn't look that good. A quick look at the
>>>>     mapping shows that only 62.4% of the probe ids are mapped and that
>>>>     some
>>>>     probe ids are mapped to more than 1 transcript:
>>>>
>>>>
>>>> Nice example, Herve!
>>>>
>>>> Mapping to multiple transcripts is to be expected, so one needs to
>>>> come up with ways of dealing with the situation. Missing probes is
>>>> also typical. These can be dealt with by mapping to the unmatched
>>>> probes to other transcript databases. For example, one could map to
>>>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish
>>>> ESTs next.
>>>>
>>>>     >  head(map)
>>>>     probeid txname
>>>>     1 Dr.19494.1.S1_at ENSDART00000048610
>>>>     2 Dr.19494.1.S1_at ENSDART00000103479
>>>>     3 Dr.19494.1.S1_at ENSDART00000103478
>>>>     4 Dr.10343.1.S1_at ENSDART00000006449
>>>>     5 Dr.12057.1.A1_at ENSDART00000054814
>>>>     6 Dr.22271.1.A1_at ENSDART00000054814
>>>>     >  length(unique(zebrafishprobe$Probe.Set.Name
>>>>     <http://Probe.Set.Name/>))
>>>>     [1] 15617
>>>>     >  length(unique(map$probeid))
>>>>     [1] 9746
>>>>     >  any(duplicated(map$probeid))
>>>>     [1] TRUE
>>>>
>>>>     But maybe you will have more luck with the Nimblegen Zebrafish 12
>>>>     x135K
>>>>     Expression chip.
>>>>
>>>>
>>>> Keep in mind that with 60mer probes, there is potentially a need to
>>>> allow gaps in the alignment and that consecutive matches of even 40 or
>>>> 50 base pairs are likely to generate a signal on a microarray.
>>>>
>>>> Sean
>>>>
>>>>     One more thing. While testing the above code, I discovered 2 problems
>>>>     that were preventing it to work: one with the
>>>>     extractTranscriptsFromGenome()
>>>>     function and one with the vwhichPDict() function. Those problems
>>>>     are now
>>>>     fixed in the release and devel versions of GenomicFeatures (v
>>>>     1.2.4&  1.3.13)
>>>>     and Biostrings (v 2.18.3&  2.19.12). These new versions will
>>>>     become available
>>>>     via biocLite() in the next 24-36 hours.
>>>>
>>>>     Hope this helps.
>>>>     H.
>>>>
>>>>
>>>>     ----- Original Message -----
>>>>     From: "Ravi Karra"<ravi.karra at gmail.com
>>>>     <mailto:ravi.karra at gmail.com>>
>>>>     To: "Sean Davis"<sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>>
>>>>     Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
>>>>     Sent: Saturday, February 26, 2011 2:56:02 PM
>>>>     Subject: Re: [BioC] Mapping microarray probes to the genome using
>>>>     findOverlaps
>>>>
>>>>     Hi Sean,
>>>>     I could do that, but am not sure how to. The annotated zebrafish
>>>>     genome is the Tubingen strain and the some of the probes on the
>>>>     array are from the EK and AB strains. This means that I need to
>>>>     allow for SNP's in the alignment. I originally tried to align the
>>>>     probes to the Ensembl Transcripts using matchPDict but when I
>>>>     allowed for 2 mismatches (max.mismatch = 2) across the probe
>>>>     sequences, my computer never stopped running the program! I found
>>>>     TopHat to be much faster (8 min) and TopHat allows for a few
>>>>     nucleotide wobble by default!.
>>>>     Do you have a suggestion(s) on another way to align the array to
>>>>     the Ensembl Transcripts?
>>>>     Thanks,
>>>>     Ravi
>>>>
>>>>
>>>>
>>>>
>>>>     On Feb 26, 2011, at 5:45 PM, Sean Davis wrote:
>>>>
>>>>     >
>>>>     >
>>>>     >  On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra
>>>>     <ravi.karra at gmail.com<mailto:ravi.karra at gmail.com>>  wrote:
>>>>     >  Hello,
>>>>     >
>>>>     >  I am still trying to map probes on the Nimblegen Zebrafish 12
>>>>     x135K Expression to the Zv9 version of the zebrafish genome
>>>>     available from Ensembl! I am very reluctantly pursuing an
>>>>     alignment approach to annotation as the original annotation
>>>>     provided with the array is quite outdated.
>>>>     >
>>>>     >  I performed a gapped alignment using the individual probe
>>>>     sequences (60-mers) from the array using TopHat and loaded the
>>>>     results into Bioconductor as a GappedAlignments object. I have
>>>>     made a TranscriptDb object using the Zv9 genome from Ensembl.
>>>>     Next, I plan to use findOverlaps for the annotation. What is the
>>>>     best way to get the overlaps (by exon, cds, or by transcript)? I
>>>>     am a little concerned that using transcriptsByOverlaps might be a
>>>>     bit too broad and result in mapping reads to multiple genes (for
>>>>     example transcripts in the genome that have overlapping genomic
>>>>     coordinates). By contrast, mapping with exonsByOverlaps and
>>>>     cdsByOverlaps might be too restrictive and miss information in the
>>>>     UTR's. My gut feeling is that annotating by cds is the
>>>>     "in-between" approach. What is the recommended approach for
>>>>     RNA-seq? As you can tell, I am quite new to this!
>>>>     >
>>>>     >
>>>>     >  Hi, Ravi. Sorry to answer a question with more questions, but
>>>>     why not just map the probes against Ensembl Transcripts or refseq?
>>>>     What is the advantage of mapping to the genome and then going back
>>>>     to the transcripts?
>>>>     >
>>>>     >  Sean
>>>>
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>>     _______________________________________________
>>>>     Bioconductor mailing list
>>>>     Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>     Search the archives:
>>>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>     _______________________________________________
>>>>     Bioconductor mailing list
>>>>     Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>     Search the archives:
>>>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319