[BioC] consensusString Function

Patrick Aboyoun paboyoun at fhcrc.org
Wed Apr 7 18:41:34 CEST 2010


Erik,
The consensusString weights each input string equally and assigns an 
equal probability to each of the base letters represented by an 
ambiguity letter. So to use the examples you mentioned the pseudo 
arithmetic results in

G + R => R
0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G = 
0.75 G + 0.25 A => R

G + R + G => G
2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G = 
5/6 G + 1/6 A => G

A + N => A
0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) = 0.625 
A + 0.125 C + 0.125 G + 0.125 T => A

There a little room for growth in the consensusString function. For 
example, it could accept a non-negative vector of weights so the input 
strings are not all equally weighted. Anything more complicated than 
that, however, and I would recommend representing each string in PWM 
form and then performing weighted/unweighted averages across each of the 
cells in the PWMs.


Patrick


On 4/7/10 9:06 AM, Erik Wright wrote:
> Hi Patrick,
>
> Thanks for closing the loop on this issue.  I have a question about the outputs you show below:
>
> If two strings are ACAG and ACAR where R can be A or G, then it makes sense to me that the consensus is ACAR.
>
> Why then is an N treated differently than an R?  If two strings are NNNN and ACTG, where N can be A, C, T, or G, then based on the prior example the output should be NNNN, but the output you show is ACTG.
>
> If there are three sequences, ACAG, ACAR, and ACAG, and the threshold is set at the default 25% (for a DNAStringSet) then why is the output ACAG?  If an R is a C or G, and the other two codons in the final position are G's then the sum is two G's and one C.  One in three (33.33...%) being C is greater than the default threshold of 25%, so I would expect the consensus character to be R.  Therefore the output should be ACAR.
>
> Thanks for helping me understand the output.
>
> Smiles,
> Erik
>
>
>
> On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote:
>
>    
>> Erik, Heidi, and Wolfgang,
>> To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC<= 2.5 (R<= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters:
>>
>>      
>>>    consensusString(DNAStringSet(c("NNNN","ACTG")))
>>>        
>> [1] "ACTG"
>>      
>>>    consensusString(DNAStringSet(c("AANN","ACTG")))
>>>        
>> [1] "AMTG"
>>      
>>>    consensusString(DNAStringSet(c("ACAG","ACAR")))
>>>        
>> [1] "ACAR"
>>      
>>>    consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
>>>        
>> [1] "ACAG"
>>      
>>> sessionInfo()
>>>        
>> R version 2.11.0 alpha (2010-04-04 r51591)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Biostrings_2.15.27 IRanges_1.5.74
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.5
>>
>>
>>
>> On 4/6/10 2:36 PM, Wolfgang Huber wrote:
>>      
>>> Hi Erik, Herv'e
>>>
>>> please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package).
>>>
>>> With a more recent version of Biostrings, I get:
>>>
>>> library("Biostrings")
>>> consensusString( DNAStringSet(c("AAAA","ACTG")) )
>>> # [1] "AMWR"
>>>
>>> consensusString( DNAStringSet(c("AAAB","ACTG")) )
>>> # Error in FUN(newX[, i], ...) :
>>>   'ambiguityMap' is missing some combinations of row names
>>>
>>> And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression
>>>
>>> i<- paste(all_letters[col>= threshold], collapse = "")
>>>
>>> is evaluated where
>>> Browse[1]>  all_letters
>>> [1] "A" "C" "G" "T" "B"    ## length is 5
>>> Browse[1]>  col
>>> [1] 1 0 0 0                ## length is 4
>>> Browse[1]>  i
>>> [1] "AB"                   ## recycling rule was applied
>>>
>>> which seems unintended and with some more insight will probably lead to the fixing of a bug.
>>>
>>>     Best wishes
>>>     Wolfgang
>>>
>>>
>>>        
>>>> sessionInfo()
>>>>          
>>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=C              LC_NUMERIC=C            LC_TIME=C
>>> [4] LC_COLLATE=C            LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8
>>> [7] LC_PAPER=C              LC_NAME=C               LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C          LC_MEASUREMENT=C        LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.15.26 IRanges_1.5.74     fortunes_1.3-7
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.7.5
>>>
>>> Heidi Dvinge ha scritto:
>>>        
>>>> Hello Erik,
>>>>
>>>> unfortunately I'm not familiar with the Biostrings package, so I can't
>>>> tell you why this doesn't work, but until someone else can answer, this
>>>> seems to be a work-around.
>>>>
>>>> Apparently, consensusString doesn't handle Ns.
>>>>
>>>>          
>>>>> test<- DNAStringSet(c("AANN","ACTG"))
>>>>> consensusString(test)
>>>>>            
>>>> Error in .local(x, ...) :
>>>>   'threshold' must be a numeric in (0, 1/sum(rowSums(x)>  0)]
>>>>
>>>> If there are no Ns things are okay though.
>>>>
>>>>          
>>>>> test2<- DNAStringSet(c("AAAA","ACTG"))
>>>>> consensusString(test2)
>>>>>            
>>>> [1] "AMWR"
>>>>
>>>> However, Ns seem acceptable if the consensus matrix is calculated first,
>>>> although they might result in ?s where no consensus could be found.
>>>>
>>>>          
>>>>> test3<- consensusMatrix(test)
>>>>> consensusString(test3)
>>>>>            
>>>> [1] "A???"
>>>>
>>>> HTH
>>>> \Heidi
>>>>
>>>>
>>>>          
>>>>> Hello,
>>>>>
>>>>> I am trying to get a consensus string for a DNAStringSet, but I am getting
>>>>> an error.  The documentation for consensusString says the argument "x" is
>>>>> either a consensus matrix or an XStringSet.  So this should work, right?:
>>>>>            
>>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG"))
>>>>>> consensusString(myDNAStringSet)
>>>>>>              
>>>>> Error in .local(x, ...) :
>>>>>   'threshold' must be a numeric in (0, 1/sum(rowSums(x)>  0)]
>>>>>
>>>>> Specifying a threshold in the arguments doesn't seem to make a difference.
>>>>>
>>>>> Thanks!,
>>>>> Erik
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>            
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>          
>>>        
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>      
>



More information about the Bioconductor mailing list