[BioC] Fwd: Probe matching using vmatchPDict

Wed Aug 3 10:19:53 CEST 2011

Hi Ian,

I'm putting this discussion back on the mailing list as it might
be of interest to others.

On 11-08-01 02:00 AM, Ian Henry wrote:
> Hi Herve,
>
> I did some extra digging and the UCSC refGene track for NM_001320 CSNK2B
> shows:
>
> chr6 	31633656 	31637843 	CSNK2B
> chr6_cox_hap2 	3143276 	3147463 	CSNK2B
> chr6_dbb_hap3 	2919235 	2923423 	CSNK2B
> chr6_mcf_hap5 	3013336 	3017523 	CSNK2B
> chr6_qbl_hap6 	2927299 	2931486 	CSNK2B
> chr6_mann_hap4 	2976546 	2980733 	CSNK2B
> chr6_ssto_hap7 	2964461 	2968648 	CSNK2B
>
>
> Which seems to be the results in my table. So the answer appears to be
> that the transcripts are on 'unusual chromosomes' could I filter these
> away an just keep chr6 easily?
>
> Thanks so much for your help,
>
> Ian
>
>
> Begin forwarded message:
>
>> *From: *Ian Henry <henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>>
>> *Date: *August 1, 2011 10:31:54 AM GMT+02:00
>> *To: *Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>> *Subject: **Re: [BioC] Probe matching using vmatchPDict*
>>
>> Hello Herve,
>>
>> Thanks for your help,
>>
>> I got the code running by first using vwhichPDict and then running
>> matchPDict on those positive matches. It takes around 15 minutes to
>> run on the around 6000 transcripts hit by my probes which isn't bad.
>>
>>
>> getSequenceMatches<-function(transcript)
>> {
>> probeset_pdict <- PDict(probeset)
>> txmatches <- matchPDict(probeset_pdict, transcript[[1]])
>> match <-extractAllMatches(transcript[[1]], txmatches)
>> if( length(match)>0 ){
>> matchdf <- as.data.frame(match)
>> matchdf$Transcript = names(transcript)
>> matchdf$Sequence <- as.character(subseq(rep(transcript[1],
>> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end))
>> return(matchdf)
>> }
>> }
>>
>> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx)
>> tx_match_indexes <- which(elementLengths(tx_matches)>0)
>>
>> f <- function(f) getSequenceMatches(hg19_tx[f])
>> a <- lapply(tx_match_indexes,f)
>> outputTable<-do.call(rbind,a)
>>
>>
>> I have one small additional question which may or may not be best
>> directed at you. If I extract the human hg19 refGene transcript
>> database as follows:
>>
>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>> "refGene")
>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>
>> This works, but if I explore the object produced to look at transcript
>> names:
>>
>> > t1 <- names(hg19_tx)
>> > t1<- data.frame(t1)
>> > subset(t1, t1=="NM_001320")
>>
>> this gives:
>> t1
>> 12387 NM_001320
>> 38438 NM_001320
>> 38976 NM_001320
>> 39122 NM_001320
>> 39645 NM_001320
>> 39963 NM_001320
>> 40204 NM_001320
>>
>> So there are multiple entries for one transcript. Do you know why this
>> is, I guess it is a 'feature' of the UCSC refGene? I suspect they may
>> be versions of the same transcript e.g. NM_001320.1, NM_001320.2
>> etc.... but the version number is missing and I guess I'd want the
>> current transcript for my purposes, as in my probe position matches
>> often I get multiple positions with the same transcript ID. Not sure
>> if you can help, but any advice would be greatly appreciated.

Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC
knownGene track does not contain duplicated transcript IDs.
Note that the names you see on the DNAStringSet object returned
by extractTranscriptsFromGenome() are exactly the names you get
on the GRangesList object returned by exonsBy() when you group
the exons by transcripts and use 'use.names=TRUE':

   > exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE)
   Warning message:
   In .set.group.names(ans, use.names, txdb, by) :
     some group names are NAs or duplicated

   > identical(names(exbytx), names(hg19_tx))
   [1] TRUE

   > table(duplicated(names(exbytx)))

   FALSE  TRUE
   37308  3231

Also note the warning about some group names being NAs or duplicated.
Here the group names are the transcript names i.e. the top-level
names of 'exbytx'. You also get this warning when calling
extractTranscriptsFromGenome() on 'hg19txdb', which is not
surprising because 'exonsBy( , by="tx", use.names=TRUE)' is
called internally.

To keep only the transcripts that belong to a pre-defined subset
of genomic sequences of interest, you can do:

## Let's say my sequences of interest are the main chromosomes:
 > mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="")

## 'exbytx' contains the sequence that each exon belongs to.
## We use this to infer the sequence that each transcript belongs to.
 > tx_seqnames <- lapply(seqnames(exbytx), runValue)
## Just to make sure that no transcript contains exons from more
## than 1 sequence:
 > all(elementLengths(tx_seqnames) == 1)
[1] TRUE
 > tx_seqnames <- unlist(tx_seqnames)
 > idx <- which(tx_seqnames %in% mychroms)

## Note that, even transcripts on the main chromosomes have
## duplicated names:
 > table(duplicated(names(exbytx)[idx]))

FALSE  TRUE
37285  1015

## Finally, subset 'hg19_tx':
 > mytx <- hg19_tx[idx]

Cheers,
H.

>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote:
>>
>>> Hi Ian,
>>>
>>> On 11-07-25 08:41 AM, Ian Henry wrote:
>>>> Hello,
>>>>
>>>> I'm trying to match a list of 60mer probes against a transcriptome to
>>>> see which probes hit which transcripts.
>>>>
>>>> I have my 60mer probe list as a DNAStringSet and also as a "PDict"
>>>> > probeset <- DNAStringSet(probelist$ProbeSeq)
>>>> > probeset_pdict <- PDict(probeset)
>>>>
>>>> My transcriptome was created as follows:
>>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7", tablename =
>>>> "ensGene")
>>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb)
>>>>
>>>>
>>>> To find which transcripts are hit by the probes I've used vwhichPDict:
>>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx)
>>>> which works brilliantly!
>>>>
>>>> However, I also would like the locations of the matches and so tried:
>>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx)
>>>> This doesn't work and errors to say:
>>>> Error in .local(pdict, subject, max.mismatch, min.mismatch,
>>>> with.indels, :
>>>> vmatchPDict() is not ready yet, sorry
>>>>
>>>> Does this just mean it's not yet implemented and is there a
>>>> solution/workaround?
>>>
>>> It is not yet implemented and has been on my TODO list for a long time,
>>> sorry...
>>>
>>> One workaround I can think of: use matchPDict in a loop (you loop over
>>> the transcripts). The Drerio transcriptome is not be that big so it
>>> might run in descent time. It might help a little bit to reduce the
>>> size of the problem by getting rid of the probes and transcripts that
>>> don't have hit (based on the result of vwhichPDict).
>>>
>>> Let me know if that's not fast enough or if you need further help with
>>> this.
>>>
>>> Cheers,
>>> H.
>>>
>>>>
>>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to match
>>>> to the transcriptome rather than the genome.
>>>>
>>>> Thanks for any advice in advance,
>>>>
>>>> Ian
>>>>
>>>> Ian Henry
>>>> MPI-CBG Dresden
>>>> Pfotenhauerstrasse 108
>>>> 01307 Dresden
>>>> Germany
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>
>> Dr. Ian Henry
>> Bioinformatics Service Leader
>> MPI-CBG Dresden
>> Pfotenhauerstrasse 108
>> 01307 Dresden
>> Germany
>> phone: +49 351 210 2638
>> email: henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>
>> web: http://people.mpi-cbg.de/henry
>>
>> Check out the Bioinformatics Service wiki site:
>> https://wiki.mpi-cbg.de/wiki/bioinfo/
>>
>
> Dr. Ian Henry
> Bioinformatics Service Leader
> MPI-CBG Dresden
> Pfotenhauerstrasse 108
> 01307 Dresden
> Germany
> phone: +49 351 210 2638
> email: henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>
> web: http://people.mpi-cbg.de/henry
>
> Check out the Bioinformatics Service wiki site:
> https://wiki.mpi-cbg.de/wiki/bioinfo/
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319