[BioC] Mapping microarray probes to the genome using findOverlaps

Mon Mar 21 18:54:28 CET 2011

We're working on this offline... will get back to the ML once it's
solved... but, essentially, the issue is that after his modification,
the NDF has multiple rows per X/Y coordinates...
b

2011/3/21 Hervé Pagès <hpages at fhcrc.org>:
> Sorry Ravi, I've no idea how to do this.  H.
>
> On 03/13/2011 07:03 PM, Ravi Karra wrote:
>>
>> Thanks for the suppprt Herve!
>> Do you or Benilton (or anyone else!) have any ideas on how to redesign the
>> .ndf file to allow PdInfoBuilder to use the updated probeset information
>> from these alignments?
>> Thanks again for all of the help,
>> Ravi
>>
>> On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote:
>>
>>> Hi Ravi,
>>>
>>> FWIW, I just made a BSgenome package for Zv9. It will become available
>>> to BioC devel users in a couple of hours (source package only for now).
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> On 03/07/2011 07:11 PM, Ravi Karra wrote:
>>>>
>>>> Hi Sean and Herve,
>>>>
>>>> Thanks a ton for the suggestions! I modified Herve's approach to use
>>>> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many
>>>> of the VEGA transcripts, many of which are supposed to be on this array.
>>>> I ended up mapping to the cdna's from zv9 and like Herve only match
>>>> about 60% of probes to the ensembl transcripts. I too found many probes
>>>> that map to multiple transcripts. Like the brainarray group, I chose to
>>>> use the probes as a part of transcript-specific probesets... i.e. a
>>>> given probe can contribute to the expression of multiple transcripts.
>>>> Before trying to extend mapping to other databases as suggested by Sean,
>>>> I wanted to try and create a custom ndf file and annotation package to
>>>> give the analysis a go. However, I keep getting the error below using
>>>> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file
>>>> incorrectly? I included my code and session info below. Sorry for the
>>>> inefficient code... I am learning on the fly!
>>>>
>>>> Thanks everyone! This list has been immensely helpful!
>>>>
>>>> Ravi
>>>>
>>>> =======================================================================
>>>> Building annotation package for Nimblegen Expression Array
>>>> NDF: newndf.ndf
>>>> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys
>>>> =======================================================================
>>>> Parsing file: newndf.ndf... OK
>>>> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK
>>>> Merging NDF and XYS files... OK
>>>> Preparing contents for featureSet table... OK
>>>> Preparing contents for bgfeature table... OK
>>>> Preparing contents for pmfeature table... OK
>>>> Creating package in ./pd.newndf
>>>> Inserting 27649 rows into table featureSet... OK
>>>> Inserting 200368 rows into table pmfeature... Error in
>>>> sqliteExecStatement(con, statement, bind.data) :
>>>> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed)
>>>>
>>>> My code:
>>>>
>>>> #1. Load the probes from the ndf file:
>>>> library(Biostrings)
>>>> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration
>>>> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf"
>>>> probe_info= read.table(array_file, header= TRUE, sep= '\t')
>>>> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE))
>>>> probeid= probe_info$PROBE_DESIGN_ID
>>>> probeid= as.character(probeid)
>>>> names(probes) = probeid
>>>> real_probes= which(width(probes) == 60) #don't align the control, empty,
>>>> or the crosshyb probes
>>>> probes= probes[real_probes]
>>>>
>>>> #2. Compute the transcriptome (51569 transcripts with sequences as of
>>>> March 7, .2011):
>>>> library(biomaRt)
>>>> ensembl= useMart("ensembl")
>>>> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl)
>>>> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart=
>>>> ensemblzv9)[,1]
>>>> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType=
>>>> "cdna", mart= ensemblzv9)
>>>> #make the DNAStringSet object
>>>> zv9txDict= DNAStringSet(txSeqs$cdna)
>>>> names(zv9txDict) = txSeqs$ensembl_transcript_id
>>>>
>>>> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>>> pdict2<- PDict(probes, max.mismatch=2)
>>>> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2)
>>>> makeMap<- function(m, probeids, txnames) {
>>>> probeid<- probeids[unlist(m)]
>>>> txname<- rep.int(txnames, elementLengths(m))
>>>> ans<- unique(data.frame(probeid, txname))
>>>> rownames(ans)<- NULL
>>>> ans
>>>> }
>>>> map= makeMap(m2, names(probes), names(zv9txDict))
>>>>
>>>> #4. Remove transcripts with less than 3 probes
>>>> txcounts= data.frame(table(map$txname))
>>>> dim(txcounts) #start with 32961 different transcripts
>>>> usefultxs= txcounts[txcounts$Freq>  2,][,1] #end with 27647 transcripts
>>>> map= map[map$txname%in% usefultxs, ]
>>>>
>>>> #5: Make a new ndf file with the updated transcript information
>>>> #probe_info is the ndf file I originally loaded
>>>> rownames(probe_info) = probe_info[,1]
>>>> ndf= probe_info[as.character(map$probeid),]
>>>> ndf$SEQ_ID= map$txname
>>>>
>>>> #append the information for the control probes to the new ndf
>>>> control= probe_info[probe_info$PROBE_CLASS!= "experimental",]
>>>> ndf= rbind(ndf, control)
>>>>
>>>> # change all the probes with unmapped transcripts to random but keep
>>>> controls intact
>>>> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in%
>>>> as.character(ndf$PROBE_DESIGN_ID))
>>>> noalignprobes= probe_info[noalign,]
>>>> noalignprobes$SEQ_ID= "noalign"
>>>> ndf= rbind(ndf, noalignprobes)
>>>> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE)
>>>>
>>>> #6. Make the new annotation package
>>>> #make custom chip descriptor file for Nimblegen
>>>> library(pdInfoBuilder)
>>>> filename_ndf<- list.files('.',pattern='.ndf',full.names=T)
>>>> filename_ndf= new("ScalarCharacter", filename_ndf)
>>>> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1]
>>>> filename_xys= new("ScalarCharacter", filename_xys)
>>>> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf,
>>>> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu
>>>> <mailto:ravi.karra at duke.edu>", biocViews="AnnotationData", chipName=
>>>> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7",
>>>> manufacturer= "Nimblegen", url=
>>>> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism=
>>>> "Danio rerio", version= "1.1")
>>>> makePdInfoPackage(seed, destDir= ".")
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.12.2 (2011-02-25)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods
>>>> [7] base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0
>>>> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2
>>>> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5
>>>> [10] Biobase_2.10.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3
>>>> [4] splines_2.12.2 XML_3.2-0
>>>>
>>>>
>>>> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve<hpages at fhcrc.org
>>>>> <mailto:hpages at fhcrc.org>>  wrote:
>>>>>
>>>>>    Hi Ravi,
>>>>>
>>>>>    Sean's suggestion to map the probes directly to the transcriptome
>>>>> can
>>>>>    easily be achieved with the Biostrings/BSgenome/GenomicFeatures
>>>>>    infrastructure.
>>>>>
>>>>>    The code below uses vwhichPDict (a variant of matchPDict) and the
>>>>>    danRer6
>>>>>    genome from UCSC (release name: Sanger Institute Zv8). Note that
>>>>>    UCSC just
>>>>>    released danRer7 (Sanger Institute Zv9) but we don't have a
>>>>>    BSgenome package
>>>>>    for it yet.
>>>>>
>>>>>    Also I don't think we support the Nimblegen Zebrafish 12 x135K
>>>>>    Expression
>>>>>    chip so I'm using the zebrafish chip from Affymetrix instead.
>>>>>
>>>>>    Depending on your machine and the quality of your internet
>>>>> connection,
>>>>>    this code should run in 8-15 minutes. Less than 800 Mb of RAM is
>>>>> used.
>>>>>
>>>>>    ## Load the probes:
>>>>>    library(zebrafishprobe)
>>>>>    library(Biostrings)
>>>>>    probes<- DNAStringSet(zebrafishprobe)
>>>>>
>>>>>    ## Compute the transcriptome (32992 transcripts):
>>>>>    library(GenomicFeatures)
>>>>>    txdb<- makeTranscriptDbFromUCSC(genome="danRer6",
>>>>>    tablename="ensGene")
>>>>>    library(BSgenome.Drerio.UCSC.danRer6)
>>>>>    txseqs<- extractTranscriptsFromGenome(Drerio, txdb)
>>>>>    ## Note that saving 'txseqs' with write.XStringSet() generates
>>>>>    ## a FASTA file that is 51M.
>>>>>
>>>>>    ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>>>>    pdict2<- PDict(probes, max.mismatch=2)
>>>>>    m2<- vwhichPDict(pdict2, txseqs, max.mismatch=2)
>>>>>    makeMap<- function(m, probeids, txnames)
>>>>>    {
>>>>>    probeid<- probeids[unlist(m)]
>>>>>    txname<- rep.int<http://rep.int/>(txnames, elementLengths(m))
>>>>>    ans<- unique(data.frame(probeid, txname))
>>>>>    rownames(ans)<- NULL
>>>>>    ans
>>>>>    }
>>>>>    map<- makeMap(m2, zebrafishprobe$Probe.Set.Name
>>>>>    <http://Probe.Set.Name/>, names(txseqs))
>>>>>
>>>>>    Note that the longest step is not the call to vwhichPDict() but the
>>>>>    extraction of the transcriptome.
>>>>>
>>>>>    However, the final result doesn't look that good. A quick look at
>>>>> the
>>>>>    mapping shows that only 62.4% of the probe ids are mapped and that
>>>>>    some
>>>>>    probe ids are mapped to more than 1 transcript:
>>>>>
>>>>>
>>>>> Nice example, Herve!
>>>>>
>>>>> Mapping to multiple transcripts is to be expected, so one needs to
>>>>> come up with ways of dealing with the situation. Missing probes is
>>>>> also typical. These can be dealt with by mapping to the unmatched
>>>>> probes to other transcript databases. For example, one could map to
>>>>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish
>>>>> ESTs next.
>>>>>
>>>>>    >  head(map)
>>>>>    probeid txname
>>>>>    1 Dr.19494.1.S1_at ENSDART00000048610
>>>>>    2 Dr.19494.1.S1_at ENSDART00000103479
>>>>>    3 Dr.19494.1.S1_at ENSDART00000103478
>>>>>    4 Dr.10343.1.S1_at ENSDART00000006449
>>>>>    5 Dr.12057.1.A1_at ENSDART00000054814
>>>>>    6 Dr.22271.1.A1_at ENSDART00000054814
>>>>>    >  length(unique(zebrafishprobe$Probe.Set.Name
>>>>>    <http://Probe.Set.Name/>))
>>>>>    [1] 15617
>>>>>    >  length(unique(map$probeid))
>>>>>    [1] 9746
>>>>>    >  any(duplicated(map$probeid))
>>>>>    [1] TRUE
>>>>>
>>>>>    But maybe you will have more luck with the Nimblegen Zebrafish 12
>>>>>    x135K
>>>>>    Expression chip.
>>>>>
>>>>>
>>>>> Keep in mind that with 60mer probes, there is potentially a need to
>>>>> allow gaps in the alignment and that consecutive matches of even 40 or
>>>>> 50 base pairs are likely to generate a signal on a microarray.
>>>>>
>>>>> Sean
>>>>>
>>>>>    One more thing. While testing the above code, I discovered 2
>>>>> problems
>>>>>    that were preventing it to work: one with the
>>>>>    extractTranscriptsFromGenome()
>>>>>    function and one with the vwhichPDict() function. Those problems
>>>>>    are now
>>>>>    fixed in the release and devel versions of GenomicFeatures (v
>>>>>    1.2.4&  1.3.13)
>>>>>    and Biostrings (v 2.18.3&  2.19.12). These new versions will
>>>>>    become available
>>>>>    via biocLite() in the next 24-36 hours.
>>>>>
>>>>>    Hope this helps.
>>>>>    H.
>>>>>
>>>>>
>>>>>    ----- Original Message -----
>>>>>    From: "Ravi Karra"<ravi.karra at gmail.com
>>>>>    <mailto:ravi.karra at gmail.com>>
>>>>>    To: "Sean Davis"<sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>>
>>>>>    Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
>>>>>    Sent: Saturday, February 26, 2011 2:56:02 PM
>>>>>    Subject: Re: [BioC] Mapping microarray probes to the genome using
>>>>>    findOverlaps
>>>>>
>>>>>    Hi Sean,
>>>>>    I could do that, but am not sure how to. The annotated zebrafish
>>>>>    genome is the Tubingen strain and the some of the probes on the
>>>>>    array are from the EK and AB strains. This means that I need to
>>>>>    allow for SNP's in the alignment. I originally tried to align the
>>>>>    probes to the Ensembl Transcripts using matchPDict but when I
>>>>>    allowed for 2 mismatches (max.mismatch = 2) across the probe
>>>>>    sequences, my computer never stopped running the program! I found
>>>>>    TopHat to be much faster (8 min) and TopHat allows for a few
>>>>>    nucleotide wobble by default!.
>>>>>    Do you have a suggestion(s) on another way to align the array to
>>>>>    the Ensembl Transcripts?
>>>>>    Thanks,
>>>>>    Ravi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    On Feb 26, 2011, at 5:45 PM, Sean Davis wrote:
>>>>>
>>>>>    >
>>>>>    >
>>>>>    >  On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra
>>>>>    <ravi.karra at gmail.com<mailto:ravi.karra at gmail.com>>  wrote:
>>>>>    >  Hello,
>>>>>    >
>>>>>    >  I am still trying to map probes on the Nimblegen Zebrafish 12
>>>>>    x135K Expression to the Zv9 version of the zebrafish genome
>>>>>    available from Ensembl! I am very reluctantly pursuing an
>>>>>    alignment approach to annotation as the original annotation
>>>>>    provided with the array is quite outdated.
>>>>>    >
>>>>>    >  I performed a gapped alignment using the individual probe
>>>>>    sequences (60-mers) from the array using TopHat and loaded the
>>>>>    results into Bioconductor as a GappedAlignments object. I have
>>>>>    made a TranscriptDb object using the Zv9 genome from Ensembl.
>>>>>    Next, I plan to use findOverlaps for the annotation. What is the
>>>>>    best way to get the overlaps (by exon, cds, or by transcript)? I
>>>>>    am a little concerned that using transcriptsByOverlaps might be a
>>>>>    bit too broad and result in mapping reads to multiple genes (for
>>>>>    example transcripts in the genome that have overlapping genomic
>>>>>    coordinates). By contrast, mapping with exonsByOverlaps and
>>>>>    cdsByOverlaps might be too restrictive and miss information in the
>>>>>    UTR's. My gut feeling is that annotating by cds is the
>>>>>    "in-between" approach. What is the recommended approach for
>>>>>    RNA-seq? As you can tell, I am quite new to this!
>>>>>    >
>>>>>    >
>>>>>    >  Hi, Ravi. Sorry to answer a question with more questions, but
>>>>>    why not just map the probes against Ensembl Transcripts or refseq?
>>>>>    What is the advantage of mapping to the genome and then going back
>>>>>    to the transcripts?
>>>>>    >
>>>>>    >  Sean
>>>>>
>>>>>
>>>>>    [[alternative HTML version deleted]]
>>>>>
>>>>>    _______________________________________________
>>>>>    Bioconductor mailing list
>>>>>    Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>>    https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>    Search the archives:
>>>>>    http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>    _______________________________________________
>>>>>    Bioconductor mailing list
>>>>>    Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>>    https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>    Search the archives:
>>>>>    http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>