[R] how to group a large list of strings into categories based on string similarity?

Sun Jun 27 16:10:28 CEST 2010

On 06/26/2010 11:42 AM, G FANG wrote:
> Hi Martin,
> 
> Thanks a lot for your advice.
> 
> I tried the process you suggested as below, it worked, but in a
> different way that I planned.
> 
> library(Biostrings)
> x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
>       "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
>       "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
>       "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
>       "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
>       "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
>       "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
> names(x) <- seq_along(x)
> dna <- DNAStringSet(x)
> while (!all(width(dna) == width(dna <- trimLRPatterns("N", "N", dna)))) {}
> names(dna)[order(dna)[rank(dna, ties.method="min")]]
> 
> The output is,
> "1" "2" "3" "4" "4" "6" "4", this is the right answer after trimining
> N's, i.e. without considering N, which strings are the same.
> 
> But actually, the match I planned is position-to-position match, i.e.
> 1st and 2nd strings are the same except for the N's
> 
> So, the expected output is 1 1 2 2 3 2 4
> 
> Please advice.

what would you expect of

 ACTG
 ACTC
 ACTN

? I'd guess an elegant solution would require a fancy data structure /
algorithm (suffix arrays, I guess), and that you'll end up doing
something ad-hoc, along the lines of

  x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
         "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
         "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
         "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
         "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
         "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
         "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
  names(x) = seq_along(x)
  dna = DNAStringSet(x)
  wd <- unique(width(dna))
  stopifnot(1 == length(wd))

  nm <- names(dna)[order(dna)[rank(dna)]]
  for (i in rev(seq_len(wd)[-1])) {
      seq = narrow(dna, 1, i-1)
      n = narrow(dna, i, len)
      allN = alphabetFrequency(n, baseOnly=TRUE)[,"other"] == width(n)
      nm[allN] <- names(seq)[order(seq)[rank(seq)]][allN]
  }

which gives

> as.integer(factor(nm))
[1] 1 1 2 2 3 2 4

(this doesn't deal with the leading N's, but the logic might be the same).

>>> If you go this route you'll want to address
>>> questions to the Bioconductor mailing list
>>>
>>>   http://bioconductor.org/docs/mailList.html

Martin

> 
> Thanks!
> 
> --gang
> 
> On Wed, Jun 23, 2010 at 7:55 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 06/23/2010 07:46 PM, Martin Morgan wrote:
>>> On 06/23/2010 06:55 PM, G FANG wrote:
>>>> Hi,
>>>>
>>>> I want to group a large list (20 million) of strings into categories
>>>> based on string similarity?
>>>>
>>>> The specific problem is: given a list of DNA sequence as below
>>>>
>>>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
>>>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
>>>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
>>>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
>>>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
>>>> ......
>>>> .....
>>>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
>>>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
>>>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>>>>
>>>> 'N' is the missing letter
>>>>
>>>> It can be seen that some strings are the same except for those N's
>>>> (i.e. N can match with any base)
>>>>
>>>> given this list of string, I want to have
>>>>
>>>> 1) a vector corresponding to each row (string), for each string assign
>>>> an id, such that similar strings (those only differ at N's) have the
>>>> same id
>>>> 2) also get a mapping list from unique strings ('unique' in term of
>>>> the same similarity defined above) to the ids
>>>>
>>>> I am a matlab user shifting to R. Please advice on efficient ways to do this.
>>>
>>> The Bioconductor Biostrings package has many tools for this sort of
>>> operation. See http://bioconductor.org/packages/release/Software.html
>>>
>>> Maybe a one-time install
>>>
>>>    source('http://bioconductor.org/biocLite.R')
>>>    biocLite('Biostrings')
>>>
>>> then
>>>
>>>   library(Biostrings)
>>>   x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
>>>         "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
>>>         "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
>>>         "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
>>>         "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
>>>         "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
>>>         "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
>>>   names(x) <- seq_along(x)
>>>   dna <- DNAStringSet(x)
>>>   while (!all(width(dna) ==
>>>               width(dna <- trimLRPatterns("N", "N", dna)))) {}
>>>   names(dna)[rank(dna)]
>>
>> oops, maybe closer to
>>
>>   names(dna)[order(dna)[rank(dna, ties.method="min")]]
>>
>>> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
>>> your sequences likely come from a fasta file (Biostrings::readFASTA) or
>>> a text file with a column of sequences (ShortRead::readXStringColumns)
>>> or from alignment software (ShortRead::readAligned /
>>> ShortRead::readFastq). If you go this route you'll want to address
>>> questions to the Bioconductor mailing list
>>>
>>>   http://bioconductor.org/docs/mailList.html
>>>
>>> Martin
>>>
>>>> Thanks!
>>>>
>>>> Gang
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793