[BioC] Why does a call to "unique" removes a DNAStringSet names?

Tue Dec 18 21:02:40 CET 2012

Hi Nico, Val,

Just to let you know that this is addressed in the latest
IRanges/Biostrings (devel):

   dna <- DNAStringSet(c(a="AAA", b="CC", c="ATTA", d="CC", e="CG"))
   mcols(dna) <- DataFrame(score=1:5)

Then:

   > unique(dna)
     A DNAStringSet instance of length 4
       width seq                                               names 

   [1]     3 AAA                                               a
   [2]     2 CC                                                b
   [3]     4 ATTA                                              c
   [4]     2 CG                                                e

   > mcols(unique(dna))
   DataFrame with 4 rows and 1 column
         score
     <integer>
   1         1
   2         2
   3         3
   4         5

This new behavior is consistent with what unique() does on a
GRanges object or a Vector object in general.

Cheers,
H.

On 11/15/2012 03:27 PM, Hervé Pagès wrote:
> Hi Nico, Val,
>
> Yes sorry for taking so long Nico, I didn't notice your email
> before.
>
> 2 additional issues I didn't realize:
>
> (1) unique() does not drop the metadata columns of a DNAStringSet:
>
>    > dset
>      A DNAStringSet instance of length 2
>        width seq                                               names
>    [1]     1 A                                                 a
>    [2]     1 C                                                 a
>    > mcols(dset)
>    DataFrame with 2 rows and 1 column
>          score
>      <integer>
>    1         1
>    2         2
>
>    > mcols(unique(dset))
>    DataFrame with 2 rows and 1 column
>          score
>      <integer>
>    1         1
>    2         2
>
> (2) unique() doesn't treat DNAStringSet consistently with GRanges:
>
>    > gr
>    GRanges with 5 ranges and 2 metadata columns:
>        seqnames    ranges strand |     score                GC
>           <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>      a     chr1   [1, 10]      - |         1                 1
>      b     chr2   [2, 10]      + |         2 0.888888888888889
>      c     chr2   [3, 10]      + |         3 0.777777777777778
>      d     chr2   [2, 10]      + |         2 0.888888888888889
>      e     chr2   [4, 10]      * |         4 0.666666666666667
>      ---
>      seqlengths:
>       chr1 chr2 chr3
>       1000 2000 1500
>    > unique(gr)
>    GRanges with 4 ranges and 2 metadata columns:
>        seqnames    ranges strand |     score                GC
>           <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>      a     chr1   [1, 10]      - |         1                 1
>      b     chr2   [2, 10]      + |         2 0.888888888888889
>      c     chr2   [3, 10]      + |         3 0.777777777777778
>      e     chr2   [4, 10]      * |         4 0.666666666666667
>      ---
>      seqlengths:
>       chr1 chr2 chr3
>       1000 2000 1500
>
> On a GRanges, it just does x[!duplicated(x)], so not only the names
> are propagated but also the metadata columns.
>
> So the choices are:
>    (a) we do the same for DNAStringSet, even if that's not what
>        base::unique() does,
>    (b) we choose to have unique() drop the names and metadata columns
>        of any Vector object (DNAStringSet, GRanges, etc...),
>    (c) we add the 'use.names' and 'use.mcols' args to unique(), with
>        defaults to FALSE? or to TRUE?
>    (d) ?
>
> I have a small preference for (a) even though I'm not really sure what
> the use cases are. Whatever we do, we should have unique() behave
> consistently on any member of the Vector family and also treat
> names and metadata columns the same way.
>
> Thanks,
> H.
>
>
> On 11/15/2012 09:11 AM, Valerie Obenchain wrote:
>> Hi Nico,
>>
>> Sorry it's taken awhile to get back to you. I wanted to ask about what
>> behavior you'd expect from a call to unique() on a DNAStringSet, i.e.,
>> what is your use case?
>>
>>
>> unique() on a named character vector drops names:
>> chr <- c(a="A", c="C", aa="A", c="CC")
>>  > unique(chr)
>> [1] "A"  "C"  "CC"
>>
>>
>> Same for a named list:
>> lst <- list(a="A", c="C", aa="A", c="CC")
>>  > unique(lst)
>> [[1]]
>> [1] "A"
>>
>> [[2]]
>> [1] "C"
>>
>> [[3]]
>> [1] "CC"
>>
>>
>> unique() on a DNAStringSet was patterned after this behavior. If names
>> were kept, would it be useful to retain only the name of the first
>> duplicate? In the data above there are two "A"'s. Would you want 'a'
>> kept and 'aa' dropped?
>>
>> Valerie
>>
>>
>>
>>
>> On 07/26/2012 08:36 AM, Nicolas Delhomme wrote:
>>> Hi,
>>>
>>> I've just realized that a call to unique on a DNAStringSet would
>>> result in the names slot to disappear. There's nothing about this in
>>> the documentation, but if that's the desired effect, warning about it
>>> would be good :-)
>>>
>>> Here is how to reproduce it:
>>>
>>> library(Biostrings)
>>> dset<-DNAStringSet(c("A","C"))
>>> names(dset)<- c("a","a")
>>> dset
>>> unique(dset)
>>>
>>>
>>> It gives:
>>>
>>>> dset
>>>    A DNAStringSet instance of length 2
>>>      width seq                                               names
>>> [1]     1 A                                                 a
>>> [2]     1 C                                                 a
>>>> unique(dset)
>>>    A DNAStringSet instance of length 2
>>>      width seq
>>> [1]     1 A
>>> [2]     1 C
>>>
>>> My sessionInfo():
>>>
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] C/UTF-8/C/C/C/C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.25.8  IRanges_1.15.24    BiocGenerics_0.3.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.15.1 tools_2.15.1
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Nathaniel Street Lab
>>> Department of Plant Physiology
>>> Umeå Plant Science Center
>>>
>>> Tel: +46 90 786 7989
>>> Email: nicolas.delhomme at plantphys.umu.se
>>> SLU - Umeå universitet
>>> Umeå S-901 87 Sweden
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319