[BioC] matchPDict with mismatches allowed appears to drop names

Hervé Pagès hpages at fhcrc.org
Wed Aug 3 03:04:03 CEST 2011


Hi Ian, Harris,

On 11-08-02 08:45 AM, Harris A. Jaffee wrote:
> The short answer for the fuzzy matching case appears to be that, even
> though your names are maintained through most of the function
>
> matchPDict.R:.match.MTB_PDict,
>
> and in fact they are still present in the list 'ans_compons', they get
> dropped in this call:
>
> ans <- ByPos_MIndex.combine(ans_compons)
>
> By contrast, for exact matching, the function
>
> matchPDict.R:.match.TB_PDict
>
> returns your names here:
>
> new("ByPos_MIndex", width0=width(pdict), NAMES=names(pdict), ends=C_ans)
>
> So my naive guess is that
>
> MIndex-class.R:ByPos_MIndex.combine
>
> should do that also, something like
>
> ans_names <- mi_list[[1]]@NAMES
> new("ByPos_MIndex", width0=ans_width0, ends=ans_ends, NAMES=ans_names)

Exactly :-) This is now fixed in Biostrings 2.20.2 (release) and
will be soon in Biostrings devel. With Biostrings 2.20.2:

   library(Biostrings)
   probes <- DNAStringSet(c(probe1="aaaaaa", probe2="accacc"))
   pdict1 <- PDict(probes, max.mismatch=1)
   m1 <- matchPDict(pdict1, DNAString("ggaaaaaccccc"), max.mismatch=1)

Then:

   > names(m1)
   [1] "probe1" "probe2"

   > unlist(m1)
   IRanges of length 3
       start end width  names
   [1]     2   7     6 probe1
   [2]     3   8     6 probe1
   [3]     7  12     6 probe2

Thanks guys for the report and for the fix!
H.

>
> On Aug 2, 2011, at 5:24 AM, Ian Henry wrote:
>
>> Hi,
>>
>> I have a question regarding the inheritance of the names attribute
>> when using matchPDict.
>>
>> If I use matchPDict as follows:
>>
>> #Get transcript information
>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>> "refGene")
>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>
>> #Create DNAStringSet with names associated with each probe
>> > probeset <- DNAStringSet(probelist$sequence)
>> > names(probeset)<-probelist$probenames
>>
>> #Create PDict object and match against human transcript 14 (I know it
>> should match)
>> > ps_pdict<-PDict(probeset)
>> > txmatches <- matchPDict(ps_pdict, hg19_tx[[14]])
>>
>> this compares the probes in ps_pdict to transcript 14 in hg19 and gives:
>> >unlist(txmatches):
>>
>> start end width names
>> [1] 749 773 25 HW:6
>> [2] 569 593 25 HW:16
>> [3] 804 828 25 HW:26
>> [4] 757 781 25 HW:36
>>
>> which works :)
>>
>> However, if I search allowing for mismatches then the names appear to
>> be lost:
>>
>> > ps_pdict1<-PDict(probeset, max.mismatch=1)
>> > txmatches1 <- matchPDict(ps_pdict1, hg19_tx[[14]], max.mismatch=1,
>> min.mismatch=0)
>> > unlist(txmatches1)
>>
>> IRanges of length 4
>> start end width
>> [1] 749 773 25
>> [2] 569 593 25
>> [3] 804 828 25
>> [4] 757 781 25
>>
>> The result of matchPDict is a MIndex object that I named txmatches
>> with exact matches, and txmatches1 with 1 mismatch
>> > names(txmatches) #gives character vector containing probe names
>> > names(txmatches1) #returns NULL
>>
>> So it appears the names are not inherited. I tried to added them
>> manually to my MIndex object
>> >names(txmatches1)<-names(probeset)
>>
>> but I get Error:
>> attempt to modify the names of a ByPos_MIndex instance
>>
>> Therefore I'm not sure how to keep my probe names associated with the
>> Transcript match, which is important for inexact matching.
>>
>> Any help would be greatly appreciated,
>>
>> Thanks,
>>
>> Ian
>>
>>
>> >sessionInfo()
>>
>> R version 2.13.0 beta (2011-03-31 r55221)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] C/UTF-8/C/C/C/C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] plyr_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.3.17
>> [3] BSgenome_1.19.5 Biostrings_2.19.17
>> [5] GenomicFeatures_1.3.15 GenomicRanges_1.3.31
>> [7] IRanges_1.9.28
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.11.10 DBI_0.2-5 RCurl_1.5-0
>> [4] RSQLite_0.9-4 XML_3.2-0 biomaRt_2.7.1
>> [7] rtracklayer_1.11.12 tools_2.13.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list