[BioC] Count differences between sequences

Hervé Pagès hpages at fhcrc.org
Tue Mar 30 02:23:52 CEST 2010


Erik,

This is much faster:

   countLeadingLetter2 <- function(x, letter="-")
   {
     ans <- which(!isMatchingAt(letter, x, seq_len(length(x))))[1L]
     if (is.na(ans))
         return(length(x))
     ans - 1L
   }

   countTrailingLetter2 <- function(x, letter="-")
     countLeadingLetter2(reverse(x), letter=letter)

Cheers,
H.

Hervé Pagès wrote:
> Erik Wright wrote:
>> Hi everyone,
>>
>> Since I don't want to include terminal gaps into my distance 
>> calculation, is there a function that will quickly return the number 
>> of leading gaps and/or trailing gaps?
>> For example,
>> seq1 <- DNAString("--AT-CG-")
>> number of leading terminal gaps = 2
>> number of trailing terminal gaps = 1
> 
> Maybe not the most efficient way to do this but the trimLRPatterns()
> function can be used for this:
> 
>   countLeadingLetter <- function(x, letter="-")
>   {
>     Lpattern <- DNAString(paste(rep.int("-", length(x)), collapse=""))
>     rg <- trimLRPatterns(Lpattern=Lpattern , subject=x, ranges=TRUE)
>     start(rg) - 1L
>   }
> 
>   countTrailingLetter <- function(x, letter="-")
>   {
>     Rpattern <- DNAString(paste(rep.int("-", length(x)), collapse=""))
>     rg <- trimLRPatterns(Rpattern=Rpattern , subject=x, ranges=TRUE)
>     length(x) - end(rg)
>   }
> 
> Note that when using this in the context of your distance
> calculation you need to be careful that adding
> countLeadingLetter() + countTrailingLetter()
> won't be correct when the string contains only gaps (very
> unlikely to happen but still).
> 
> Also in this context, since the left and right fake patterns
> are always going to be the same, it would be better to generate
> them once for all before starting the big "for" loop.
> 
> I didn't do any timing so please let us know if this is too slow.
> 
> Cheers,
> H.
> 
>>
>> Thanks again!,
>> Erik
>>
>>
>> On Mar 29, 2010, at 5:39 PM, Hervé Pagès wrote:
>>
>>> Hi Erik,
>>>
>>> To address your "gap problem", one solution is to replace the "-" in
>>> one of the 2 strings to compare with a letter that is guaranteed not
>>> to be in the other string, e.g. the "+" letter (yes, this is part of
>>> the DNA alphabet, and if it was not, you could always work with
>>> BStringSet objects). This can be done with chartr():
>>>
>>>  numF <- length(myDNAStringSet)
>>>
>>>  distMatrix <- matrix(integer(numF*numF), nrow=numF)
>>>
>>>  for (i in 1:(numF-1)) {
>>>    pattern <- chartr("-", "+", myDNAStringSet[[i]])
>>>    # use IUPAC_CODE_MAP with fixed=FALSE
>>>    distMatrix[i, (i+1):numF] <- neditStartingAt(
>>>                                   pattern,
>>>                                   myDNAStringSet[(i+1):numF],
>>>                                   fixed=FALSE)
>>>    distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF]
>>>  }
>>>
>>>  diag(distMatrix) <- 0L  # I think you want 0 here, not 1
>>>
>>> Takes about 15 sec. on my machine for your 500x8000 DNA rectangle.
>>>
>>> Note that I'm using an integer matrix which is twice smaller than a
>>> numeric/double matrix.
>>>
>>> You could maybe get a small speed improvement by doing this replacement
>>> once for all at the DNAStringSet level before you start the loop with:
>>>
>>>  myDNAStringSet2 <- chartr("-", "+", myDNAStringSet)
>>>
>>> and by adapting the code in the loop but I don't expect this to make a
>>> big difference.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> erikwright at comcast.net wrote:
>>>> Hi Patrick, Michael, Hervé, and Martin, Wow! Thanks very much 
>>>> everyone for putting so much effort into answering my question. 
>>>> While we wait for Patrick's update of stringDist to become 
>>>> available, I will describe my solution and a related problem I have 
>>>> run into. I am using this piece of code to do most of the work: numF 
>>>> <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use 
>>>> IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <- 
>>>> neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ): 
>>>> numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <- 
>>>> distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This 
>>>> will populate a matrix with the number of differences between 
>>>> strings. The code works reasonably fast - about 15 seconds for a 
>>>> DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 
>>>> 8,000 sequences of length 8,000 is exponentially slower - about 45 
>>>> minutes. Obviously, if Patrick's update of stringDist improves the spe
> ed it will greatly help. In playing around with this I have run into a 
> related problem that you all might be able to help me with. As I stated 
> in my original email, I am trying to calculate a Distance Matrix with a 
> large aligned DNAStringSet. My DNAStringSet does not contain ideal data, 
> meaning there are many gap characters ("-"). If I find the edit 
> difference between two strings that have gaps in the same places then 
> the gaps are not counted towards the edit distance. This makes complete 
> sense because "-" == "-". My problem is that I want to include any gap 
> characters in my edit distance. For example:
>>>>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y) 
>>>> [1] 0 The distance between these two strings for my distance 
>>>> measurement should be 2. That is because the gap character ("-") 
>>>> gives me no data, so I cannot say that the distance between the two 
>>>> strings is 0. Can anyone think of a good method of counting common 
>>>> gap characters toward the edit distance? Unfortunately I cannot 
>>>> simply count up the number of gap characters in my pattern and add 
>>>> it to my edit distance. For example:
>>>>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y) 
>>>> [1] 1 The distance between these two strings for my distance 
>>>> measurement should be 2. If I were to add the number of gap 
>>>> characters in the pattern to the edit distance I would get 2 + 1 = 
>>>> 3. So basically I want to count any gap character in the pattern or 
>>>> subject toward the edit distance because it gives me no information. 
>>>> Does this make sense, and does anyone have an idea for how to do 
>>>> this? Thanks again, Erik ----- Original Message ----- From: "Patrick 
>>>> Aboyoun" <paboyoun at fhcrc.org> To: "Michael Lawrence" 
>>>> <lawrence.michael at gene.com> Cc: "BioC list" 
>>>> <bioconductor at stat.math.ethz.ch> Sent: Saturday, March 27, 2010 
>>>> 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count 
>>>> differences between sequences I just added a Hamming distance metric 
>>>> to the stringDist function in BioC 2.6, which should be available on 
>>>> bioconductor.org in 36 hours. This uses the code underlying the 
>>>> neditStartingAt functions and loops over the lower triangle of the 
>>>> distance matrix in C so it is f
> aster than the sapply method Herve provided. To calculate this newly 
> added distance measure, specify stringDist(x, method = "hamming") where 
> x is an XStringSet object containing equal length strings. 
> library(Biostrings) library(drosophila2probe) ch0 <- 
> drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- 
> unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- 
> DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- 
> stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 
> 5.541 sessionInfo() R version 2.11.0 Under development (unstable) 
> (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] 
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base 
> packages: [1] stats gra
>>> phics grDevices utils datasets methods base other attached packages: 
>>> [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] 
>>> Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not 
>>> attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 
>>> 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv� Pag�s 
>>> > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> 
>>> Erik, >>> Judging from your data, I would gather that you are not 
>>> interested in >>> indels. Is that correct? You should look at the 
>>> neditStartingAt function. >>> Something like the following may meet 
>>> your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, 
>>> nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> 
>>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> 
>>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of 
>>> the fact that neditStartingAt() is >> vectorized with respect to its 
>>> second argument: >> >> N<- length(myStrings) >> 
> myD
>>> ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], 
>>> myStrings))) >> >> That will make things hundred times faster with a 
>>> big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( 
>>> >>> >> myDists<- sapply(1:N, >> function(i) 
>>> neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed 
>>> >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] 
>>> [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 
>>> 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 
>>> 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 
>>> >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 
>>> 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 
>>> 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 
>>> 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 
>>> >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 
>>> 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> 
>> W
>>> ow, nice. It sounds to me like this would be a common use case; how 
>>> hard > would it be to vectorize both arguments? > > Michael > > > >> 
>>> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, 
>>> erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all 
>>> of length 8000. I need the entire N x N >>>> distance matrix. >>>> 
>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message 
>>> ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net 
>>> >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March 
>>> 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: 
>>> [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could 
>>> you provide more details on your data? How long are each of the >>>> 
>>> strings and how many strings do you have? Also, do you need the 
>>> entire N x N >>>> distance matrix for downstream analysis or are you 
>>> just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> 
>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, 
> eri
>>> kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> 
>>> >>>>> I have a large DNAStringSet and I am trying to calculate its 
>>> >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and 
>>> they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the 
>>> stringDist() function, but it is very slow >>>>> >>>>> >>>> for large 
>>> DNAStringSets. Is there a way to quickly calculate the number >>>> of 
>>> differences between two DNAString instances? >>>> >>>> >>>>> For 
>>> example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> 
>>> >>>>> >>>> would like to know if their is a function other than 
>>> stringDist() that >>>> will tell me the distance between them is 1. 
>>> >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - 
>>> Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> 
>>> _______________________________________________ >>>>> Bioconductor 
>>> mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> 
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Se
> arc
>>> h the archives: >>>>> >>>>> >>>> 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor 
>>> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> 
>>> _______________________________________________ >>> Bioconductor 
>>> mailing list >>> Bioconductor at stat.math.ethz.ch >>> 
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the 
>>> archives: >>> 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> 
>>> >>> >> -- >> Herv� Pag�s >> >> Program in Computational Biology 
>>> >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer 
>>> Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> 
>>> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 
>>> 667-5791 >> Fax: (206) 667-1319 >> >> >> 
>>> _______________________________________________ >> Bioconductor 
>>> mailing list >> Bioconductor at stat.math.ethz.ch >> 
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the 
>>> archives: >> http://news.gmane.org/gmane.science.biology.inform
> ati
>>> cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > 
>>> _______________________________________________ > Bioconductor 
>>> mailing list > Bioconductor at stat.math.ethz.ch > 
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the 
>>> archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor 
>>> [[alternative HTML version deleted]]
>>>> _______________________________________________ Bioconductor mailing 
>>>> list Bioconductor at stat.math.ethz.ch 
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the 
>>>> archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>     [[alternative HTML version deleted]]
>>>> ------------------------------------------------------------------------ 
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> -- 
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list