[BioC] Probe matching using vmatchPDict

Wed Aug 3 21:29:42 CEST 2011

Hi Ian,

On 11-08-03 01:58 AM, Ian Henry wrote:
>> Hi Ian,
>>
>> I'm putting this discussion back on the mailing list as it might
>> be of interest to others.
>
> OK sounds good! I'll report some more of my findings too, to make things
> more clear.
>
> This issue is really due to the way the refGene track is created by the
> UCSC. They use BLAT and can take multiple strong BLAT alignments to
> their assembly, which includes these 'unusual chromosomes'. However, it
> can also be that a transcript maps multiple times to a 'standard chromsome'.
>
> If you create a transcriptDB object:
>  > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
> "refGene")
>
> then you can explore it:
>  > txdb_data <-as.list(hg19txdb)
>  > txList <-txdb_data$transcripts
>
> This gives a transcript table where it is possible to see that one
> tx_name (NM_????????) is actually not unique but has a tx_id associated
> with it that is unique.
>
> tx_id tx_name tx_chrom tx_strand tx_start tx_end
> 613 NM_001008221 chr1 + 104198141 104207173
> 2324 NM_001008221 chr1 + 104292279 104301311
> 2323 NM_001008221 chr1 - 104230040 104239073

Yes the tx_id col is guaranteed to contain distinct values. Note
that those ids are not coming from UCSC: they were assigned to
each transcript when the TranscriptDb object was created.

This is just standard db design practice: the main tables in
the underlying SQLite db of a TranscriptDb object are the
transcript, exon and cds tables. And they all have a column
that serves as a primary key: tx_id, exon_id and cds_id,
respectively. Those "TranscriptDb internal ids" have no meaning
outside the TranscriptDb object e.g. you cannot use them to
connect transcripts from 2 different TranscriptDb objects. Even
creating the same TranscriptDb object twice (with 2 identical
calls to makeTranscriptDbFromUCSC()) is not guaranteed to generate
the same internal ids. The only purpose of this "TranscriptDb
internal id" is to unambiguously identify rows in the 3 main
tables, and, in particular, to unambiguously link exons and cds
to their corresponding transcript.

Note that you can control what names are put on the object
returned by most TranscriptDb extractors (e.g. transcripts(),
transcriptsBy(), exons(), exonsBy(), extractTranscriptsFromGenome()
etc...) with the use.names arg: when 'use.names=FALSE', then the
internal ids are used for the names.

>
> In this case NM_001008221 is a salivary amylase alpha 1A precursor
>
> I looked at these positions in the UCSC genome browser to get an idea of
> what other annotation tracks had to say:
>
> chr1:104198141-104207173 has AMY1A,B,C,D mapping in the refseq track but
> is described as AMY1A by VEGA
> chr1:104230040 104239073 has AMY1A,B,C,D mapping in the refseq track but
> is described as AMY1B by VEGA
> chr1:104292279 104301311 has AMY1A,B,C,D mapping in the refseq track but
> is described as AMY1C by VEGA
>
> I guess the multiple mapping is due to this BLAT sequence similarity.

Some details on how the RefSeq RNAs were aligned against the reference
genome are provided on the track description page:

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=205474499&c=chr21&g=refGene

>
> It is also worth pointing out that if you want to get these transcripts
> out of the TranscriptDB object you would normally do the following:
>  >hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>
> The object returned though, does not appear to keep the unique tx_ids
> and so contains multiple copies of some NM_???????s without a reference
> to their origin (ie tx_id).

If you want the tx_id ("TranscriptDb internal ids") instead of the
UCSC tx names, call extractTranscriptsFromGenome() with
'use.names=FALSE'. Unlike for the other TranscriptDb extractors,
the default here is 'use.names=TRUE'.
See ?extractTranscriptsFromGenome for the details.

>
> Anyway, for my purposes I didn't want these multiple mappings and so I
> decided to just take the transcript with the longest sequence.
> There is most likely a better way to do this, but this is what I did:
>
>  > test_tx<-hg19_tx
>  > names<-names(test_tx)
>  > tmpdf<-as.data.frame(names)
>  > tmpdf$width<-width(test_tx)
>  > tmpdf$seq<-as.character(subseq(test_tx))
>  > tx_unique_df<-ddply(tmpdf, .(names), summarize, width=max(width),
> seq=seq[which.max(width)])
>  > tx_unique<-DNAStringSet(tx_unique_df$seq)
>  > names(tx_unique)<-tx_unique_df$names
>
> This creates a new DNAStringSet containing just the longest mapping
> transcript. (although I've not used it for anything yet)

Note that in the case of NM_001008221 (the salivary amylase alpha 1A
precursor above), all the transcripts have the same length:

   > idx <- which(!is.na(match(names(hg19_tx), "NM_001008221")))

   > width(hg19_tx)[idx]
   [1] 1781 1781 1781

Furthermore, the sequences are the same:

   > hg19_tx[idx]
     A DNAStringSet instance of length 3
       width seq                                               names 

   [1]  1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221
   [2]  1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221
   [3]  1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221

   > duplicated(hg19_tx[idx])
   [1] FALSE  TRUE  TRUE

I would actually not be surprised that you see something like this for
almost all those transcripts with repeated IDs since they are the
result of mapping the same RefSeq RNA.

When you map your probes against the transcriptome, it seems unavoidable
that you will end up with probes that have multiple hits: some of those
ambiguous probes will hit more than 1 transcript *name*, but even probes
that hit a single transcript name can hit the reference genome in
multiple places just because a single transcript name/sequence can occur
at more than 1 location in the reference genome.

Cheers,
H.

>
> I thought I'd include this info for clarification and to hopefully help
> others.
>
> Ian
>
>
>
> On Aug 3, 2011, at 10:19 AM, Hervé Pagès wrote:
>
>> Hi Ian,
>>
>> I'm putting this discussion back on the mailing list as it might
>> be of interest to others.
>>
>> On 11-08-01 02:00 AM, Ian Henry wrote:
>>> Hi Herve,
>>>
>>> I did some extra digging and the UCSC refGene track for NM_001320 CSNK2B
>>> shows:
>>>
>>> chr6 31633656 31637843 CSNK2B
>>> chr6_cox_hap2 3143276 3147463 CSNK2B
>>> chr6_dbb_hap3 2919235 2923423 CSNK2B
>>> chr6_mcf_hap5 3013336 3017523 CSNK2B
>>> chr6_qbl_hap6 2927299 2931486 CSNK2B
>>> chr6_mann_hap4 2976546 2980733 CSNK2B
>>> chr6_ssto_hap7 2964461 2968648 CSNK2B
>>>
>>>
>>> Which seems to be the results in my table. So the answer appears to be
>>> that the transcripts are on 'unusual chromosomes' could I filter these
>>> away an just keep chr6 easily?
>>>
>>> Thanks so much for your help,
>>>
>>> Ian
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> *From: *Ian Henry <henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>>
>>>> *Date: *August 1, 2011 10:31:54 AM GMT+02:00
>>>> *To: *Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>>>> *Subject: **Re: [BioC] Probe matching using vmatchPDict*
>>>>
>>>> Hello Herve,
>>>>
>>>> Thanks for your help,
>>>>
>>>> I got the code running by first using vwhichPDict and then running
>>>> matchPDict on those positive matches. It takes around 15 minutes to
>>>> run on the around 6000 transcripts hit by my probes which isn't bad.
>>>>
>>>>
>>>> getSequenceMatches<-function(transcript)
>>>> {
>>>> probeset_pdict <- PDict(probeset)
>>>> txmatches <- matchPDict(probeset_pdict, transcript[[1]])
>>>> match <-extractAllMatches(transcript[[1]], txmatches)
>>>> if( length(match)>0 ){
>>>> matchdf <- as.data.frame(match)
>>>> matchdf$Transcript = names(transcript)
>>>> matchdf$Sequence <- as.character(subseq(rep(transcript[1],
>>>> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end))
>>>> return(matchdf)
>>>> }
>>>> }
>>>>
>>>> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx)
>>>> tx_match_indexes <- which(elementLengths(tx_matches)>0)
>>>>
>>>> f <- function(f) getSequenceMatches(hg19_tx[f])
>>>> a <- lapply(tx_match_indexes,f)
>>>> outputTable<-do.call(rbind,a)
>>>>
>>>>
>>>> I have one small additional question which may or may not be best
>>>> directed at you. If I extract the human hg19 refGene transcript
>>>> database as follows:
>>>>
>>>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>>>> "refGene")
>>>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>>>
>>>> This works, but if I explore the object produced to look at transcript
>>>> names:
>>>>
>>>> > t1 <- names(hg19_tx)
>>>> > t1<- data.frame(t1)
>>>> > subset(t1, t1=="NM_001320")
>>>>
>>>> this gives:
>>>> t1
>>>> 12387 NM_001320
>>>> 38438 NM_001320
>>>> 38976 NM_001320
>>>> 39122 NM_001320
>>>> 39645 NM_001320
>>>> 39963 NM_001320
>>>> 40204 NM_001320
>>>>
>>>> So there are multiple entries for one transcript. Do you know why this
>>>> is, I guess it is a 'feature' of the UCSC refGene? I suspect they may
>>>> be versions of the same transcript e.g. NM_001320.1, NM_001320.2
>>>> etc.... but the version number is missing and I guess I'd want the
>>>> current transcript for my purposes, as in my probe position matches
>>>> often I get multiple positions with the same transcript ID. Not sure
>>>> if you can help, but any advice would be greatly appreciated.
>>
>> Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC
>> knownGene track does not contain duplicated transcript IDs.
>> Note that the names you see on the DNAStringSet object returned
>> by extractTranscriptsFromGenome() are exactly the names you get
>> on the GRangesList object returned by exonsBy() when you group
>> the exons by transcripts and use 'use.names=TRUE':
>>
>> > exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE)
>> Warning message:
>> In .set.group.names(ans, use.names, txdb, by) :
>> some group names are NAs or duplicated
>>
>> > identical(names(exbytx), names(hg19_tx))
>> [1] TRUE
>>
>> > table(duplicated(names(exbytx)))
>>
>> FALSE TRUE
>> 37308 3231
>>
>> Also note the warning about some group names being NAs or duplicated.
>> Here the group names are the transcript names i.e. the top-level
>> names of 'exbytx'. You also get this warning when calling
>> extractTranscriptsFromGenome() on 'hg19txdb', which is not
>> surprising because 'exonsBy( , by="tx", use.names=TRUE)' is
>> called internally.
>>
>> To keep only the transcripts that belong to a pre-defined subset
>> of genomic sequences of interest, you can do:
>>
>> ## Let's say my sequences of interest are the main chromosomes:
>> > mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="")
>>
>> ## 'exbytx' contains the sequence that each exon belongs to.
>> ## We use this to infer the sequence that each transcript belongs to.
>> > tx_seqnames <- lapply(seqnames(exbytx), runValue)
>> ## Just to make sure that no transcript contains exons from more
>> ## than 1 sequence:
>> > all(elementLengths(tx_seqnames) == 1)
>> [1] TRUE
>> > tx_seqnames <- unlist(tx_seqnames)
>> > idx <- which(tx_seqnames %in% mychroms)
>>
>> ## Note that, even transcripts on the main chromosomes have
>> ## duplicated names:
>> > table(duplicated(names(exbytx)[idx]))
>>
>> FALSE TRUE
>> 37285 1015
>>
>> ## Finally, subset 'hg19_tx':
>> > mytx <- hg19_tx[idx]
>>
>> Cheers,
>> H.
>>
>>>>
>>>> Thanks,
>>>>
>>>> Ian
>>>>
>>>>
>>>>
>>>> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> On 11-07-25 08:41 AM, Ian Henry wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I'm trying to match a list of 60mer probes against a transcriptome to
>>>>>> see which probes hit which transcripts.
>>>>>>
>>>>>> I have my 60mer probe list as a DNAStringSet and also as a "PDict"
>>>>>> > probeset <- DNAStringSet(probelist$ProbeSeq)
>>>>>> > probeset_pdict <- PDict(probeset)
>>>>>>
>>>>>> My transcriptome was created as follows:
>>>>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7", tablename =
>>>>>> "ensGene")
>>>>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb)
>>>>>>
>>>>>>
>>>>>> To find which transcripts are hit by the probes I've used vwhichPDict:
>>>>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx)
>>>>>> which works brilliantly!
>>>>>>
>>>>>> However, I also would like the locations of the matches and so tried:
>>>>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx)
>>>>>> This doesn't work and errors to say:
>>>>>> Error in .local(pdict, subject, max.mismatch, min.mismatch,
>>>>>> with.indels, :
>>>>>> vmatchPDict() is not ready yet, sorry
>>>>>>
>>>>>> Does this just mean it's not yet implemented and is there a
>>>>>> solution/workaround?
>>>>>
>>>>> It is not yet implemented and has been on my TODO list for a long time,
>>>>> sorry...
>>>>>
>>>>> One workaround I can think of: use matchPDict in a loop (you loop over
>>>>> the transcripts). The Drerio transcriptome is not be that big so it
>>>>> might run in descent time. It might help a little bit to reduce the
>>>>> size of the problem by getting rid of the probes and transcripts that
>>>>> don't have hit (based on the result of vwhichPDict).
>>>>>
>>>>> Let me know if that's not fast enough or if you need further help with
>>>>> this.
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>>
>>>>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to match
>>>>>> to the transcriptome rather than the genome.
>>>>>>
>>>>>> Thanks for any advice in advance,
>>>>>>
>>>>>> Ian
>>>>>>
>>>>>> Ian Henry
>>>>>> MPI-CBG Dresden
>>>>>> Pfotenhauerstrasse 108
>>>>>> 01307 Dresden
>>>>>> Germany
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>>
>>>> Dr. Ian Henry
>>>> Bioinformatics Service Leader
>>>> MPI-CBG Dresden
>>>> Pfotenhauerstrasse 108
>>>> 01307 Dresden
>>>> Germany
>>>> phone: +49 351 210 2638
>>>> email: henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>
>>>> web: http://people.mpi-cbg.de/henry
>>>>
>>>> Check out the Bioinformatics Service wiki site:
>>>> https://wiki.mpi-cbg.de/wiki/bioinfo/
>>>>
>>>
>>> Dr. Ian Henry
>>> Bioinformatics Service Leader
>>> MPI-CBG Dresden
>>> Pfotenhauerstrasse 108
>>> 01307 Dresden
>>> Germany
>>> phone: +49 351 210 2638
>>> email: henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>
>>> web: http://people.mpi-cbg.de/henry
>>>
>>> Check out the Bioinformatics Service wiki site:
>>> https://wiki.mpi-cbg.de/wiki/bioinfo/
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
> Dr. Ian Henry
> Bioinformatics Service Leader
> MPI-CBG Dresden
> Pfotenhauerstrasse 108
> 01307 Dresden
> Germany
> phone: +49 351 210 2638
> email: henry at mpi-cbg.de <mailto:henry at mpi-cbg.de>
> web: http://people.mpi-cbg.de/henry
>
> Check out the Bioinformatics Service wiki site:
> https://wiki.mpi-cbg.de/wiki/bioinfo/
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319