[BioC] [Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values

Fri Oct 7 19:21:14 CEST 2011

Hi Michael,

On 11-09-29 02:17 PM, Michael Lawrence wrote:
> I saw that all coercions to atomic vectors from AtomicList are now
> deprecated. You had proposed deprecating as.vector(), because it should
> not unlist, and I agreed. Really as.vector() should return an ordinary R
> list. However, as.character(), as.numeric(), etc, in base R will unlist.

They don't seem to do that:

   > as.integer(list(a=1:3, b=4:-2))
   Error: (list) object cannot be coerced to type 'integer'

   > as.character(list(a=1:3, b=4:-2))
   [1] "1:3"                      "c(4, 3, 2, 1, 0, -1, -2)"

So they either refuse to do the coercion or they do it in a strange
way. Note that in the latter case they honor the strong expectation
that the output of the as.<atomic_type> coercion functions must have
the same length as the input (with positions of the elements being
preserved). unlist() would not honor this.

H.

> I'd like to keep consistency with base R. Do we really need to deprecate
> those, as well?
>
> Michael
>
> 2011/6/15 Michael Lawrence <michafla at gene.com <mailto:michafla at gene.com>>
>
>
>
>     2011/6/15 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>         On 11-06-15 03:38 PM, Michael Lawrence wrote:
>
>
>
>             2011/6/15 Hervé Pagès <hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org>>>
>
>
>                 Hi Michael, Janet,
>
>                 I just added an "as.vector" method for XStringSet objects to
>                 Biostrings 2.21.6:
>
>              > library(Biostrings)
>              > x <- DNAStringSet(c("aaatg", "gt"))
>              > as.vector(x)
>                   [1] "AAATG" "GT"
>
>                 But that doesn't solve Janet's problem:
>
>              > df <- DataFrame(id=c("ID1", "ID2"), seqs=x)
>              > df
>                   DataFrame with 2 rows and 2 columns
>                              id           seqs
>             <character> <DNAStringSet>
>                   1         ID1          AAATG
>                   2         ID2             GT
>              > as.data.frame(df)
>
>                   Error in as.data.frame.default(y, optional = TRUE, ...) :
>                     cannot coerce class 'structure("DNAStringSet", package =
>             "Biostrings")' into a data.frame
>
>                 Michael?
>
>
>             Well, sorry for that. I just added a coercion from Vector to
>             data.frame
>             through as.vector, so this works.
>
>
>         Thanks!
>
>
>             But someone might add a coercion from
>             List to data.frame that would treat the elements as columns.
>             Would this
>             make sense?
>
>
>         Hard to tell. Maybe sometimes it would make sense, but sometimes it
>         definitely does not (e.g. DNAStringSet).
>
>
>             AtomicList to data.frame does something even stranger: it
>             creates a two column data frame with the unlisted values and
>             names/indices rep'd out as a factor. Actually, that's kind
>             of cool,
>             since usually one does not have a list with equal element
>             lengths, but
>             it's somewhat unintuitive. But why does it apply only to
>             AtomicList?
>
>
>         Glad you bring this on the table.
>
>         For the record, "as.vector" also unrolls an AtomicList:
>
>          > as.vector(IntegerList(1:4, 0:-2))
>           [1]  1  2  3  4  0 -1 -2
>
>         IMO, we should not do things like that. Because:
>
>           1) The same can be achieved with unlist():
>
>          > unlist(IntegerList(1:4, 0:-2))
>             [1]  1  2  3  4  0 -1 -2
>
>           2) It's totally unintuitive to use as.vector for unlisting
>              a list (as.vector on a standard list does not do that).
>
>           3) There is a strong expectation that as.vector() will preserve
>              the length of its input.
>
>         So I propose to deprecate those "as.vector" and "as.data.frame"
>         methods for AtomicList objects.
>
>
>     Sounds good to me. In fact, the stack method on List is almost
>     identical to as.data.frame on AtomicList (and the stack method
>     actually makes sense). You could make as.vector return an ordinary
>     list, since list is a vector.
>
>         H.
>
>
>             Anyway, given the special correspondence between a
>             XStringSet and a
>             character vector, we could always add an as.data.frame
>             method for
>             XStringSet, just to make sure stuff behaves as expected.
>
>                 Thanks,
>                 H.
>
>
>              > sessionInfo()
>                 R version 2.14.0 Under development (unstable)
>             (2011-05-30 r56024)
>                 Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                 locale:
>                   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>                   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>                   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>                   [7] LC_PAPER=C                 LC_NAME=C
>                   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>                 [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
>                 attached base packages:
>                 [1] stats     graphics  grDevices utils     datasets
>               methods   base
>
>                 other attached packages:
>                 [1] Biostrings_2.21.6 IRanges_1.11.10
>
>
>
>                 On 11-06-15 12:49 PM, Janet Young wrote:
>
>                     yes - as.character seems a good choice, I think
>
>                     thanks,
>
>                     Janet
>
>                     On Jun 15, 2011, at 12:46 PM, Michael Lawrence wrote:
>
>                         So you would expect that the DNAStringSet is
>             converted to a
>                         character vector? DNAStringSet (technically
>             XStringSet) then
>                         just needs an as.vector method that delegates to
>             as.character.
>
>                         Michael
>
>
>                         On Wed, Jun 15, 2011 at 12:37 PM, Janet
>                         Young<jayoung at fhcrc.org
>             <mailto:jayoung at fhcrc.org> <mailto:jayoung at fhcrc.org
>             <mailto:jayoung at fhcrc.org>>>  wrote:
>
>                         Hi there,
>
>                         I'm trying to as as.data.frame on a GRanges
>             object. On
>                         regular GRanges objects it works fine but I have
>             some
>                         objects that contain a DNAStringSet in the
>             values column,
>                         which isn't built in to the as.data.frame
>             method.  Is it
>                         possible to add the ability to coerce the
>             DNAStringSet too,
>                         please?
>
>                         Here's some code that demonstrates the issue:
>
>                         ################
>                         library(GenomicRanges)
>                         library(Biostrings)
>
>                         gr1<-
>
>               GRanges(seqnames=rep("chr1",3),ranges=IRanges(start=c(1,101,201),width=50),strand=c("+","-","+"),
>                         genenames=c("seq1","seq2","seq3") )
>
>                         as.data.frame(gr1)
>                         # works
>
>                         gr2<- gr1
>                         values(gr2)[,"myseqs"]<- DNAStringSet(c ("AACGTG",
>             "ACGGTGGTGTT", "GAGGCTG"))
>
>                         as.data.frame(gr2)
>                         # Error in as.data.frame.default(y, optional =
>             TRUE, ...) :
>                         #   cannot coerce class
>             'structure("DNAStringSet", package =
>             "Biostrings")' into a data.frame
>                         ################
>
>                         and here's   sessionInfo() output:
>
>                         R version 2.13.0 (2011-04-13)
>                         Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
>                         locale:
>                         [1]
>             en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
>                         attached base packages:
>                         [1] stats     graphics  grDevices utils     datasets
>                           methods   base
>
>                         other attached packages:
>                         [1] Biostrings_2.20.1   GenomicRanges_1.4.6
>             IRanges_1.10.4
>
>                         ################
>
>
>                         You might wonder why I'm storing sequences in
>             the GRanges
>                         values - in my real data they're sequencing
>             reads that have
>                         mapped back to that region, but I'm still curious to
>                         maintain the sequence itself (for the moment)
>             because it's
>                         not always identical to the underlying genomic
>             sequence of
>                         that region (investigating mapping issues).
>
>                         (and my desire to use as.data.frame relates to a
>             suggestion
>                         from Herve to let me workaround some issues with the
>                         identical function)
>
>                         thanks,
>
>                         Janet
>
>                         _______________________________________________
>                         Bioc-sig-sequencing mailing list
>             Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>
>             <mailto:Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>>
>
>             https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>                     _______________________________________________
>                     Bioc-sig-sequencing mailing list
>             Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>
>             <mailto:Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>>
>
>             https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
>                 --
>                 Hervé Pagès
>
>                 Program in Computational Biology
>                 Division of Public Health Sciences
>                 Fred Hutchinson Cancer Research Center
>                 1100 Fairview Ave. N, M1-B514
>                 P.O. Box 19024
>                 Seattle, WA 98109-1024
>
>                 E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>                 Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>                 Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319