[BioC] Easy way to convert CharacterList to character, collapsing each element?

Hervé Pagès hpages at fhcrc.org
Tue Dec 17 01:16:18 CET 2013


Hi Ryan,

Here is one way to do this using Biostrings:

   library(Biostrings)

   strunsplit <- function(x, sep=",")
   {
     if (!is(x, "XStringSetList"))
         x <- Biostrings:::XStringSetList("B", x)
     if (!isSingleString(sep))
         stop("'sep' must be a single character string")

     ## unlist twice.
     unlisted_x <- unlist(x, use.names=FALSE)
     unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE)

     ## insert 'seq'.
     unlisted_x_width <- width(unlisted_x)
     x_partitioning <- PartitioningByEnd(x)
     at <- cumsum(unlisted_x_width)[-end(x_partitioning)] + 1L
     unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep)

     ## relist.
     ans_width <- sum(relist(unlisted_x_width, x_partitioning))
     x_eltlens <- width(x_partitioning)
     idx <- which(x_eltlens >= 2L)
     ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) * nchar(sep)
     relist(unlisted_ans, PartitioningByWidth(ans_width))
   }

Then:

   > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4", 
D=c("id2", "id4"))
   > strunsplit(x)
     A BStringSet instance of length 4
       width seq                                               names 

   [1]    13 id35,id2,id18                                     A
   [2]     0                                                   B
   [3]     3 id4                                               C
   [4]     7 id2,id4                                           D

I'll add this to Biostrings.

Cheers,
H.


On 12/16/2013 03:04 PM, Ryan C. Thompson wrote:
> Hi all,
>
> I have some annotation data in a DataFrame, and of course since
> annotations are not one-to-one, some of the columns are CharacterList or
> similar classes. I would like to know if there is an efficient way to
> collapse a CharacterList to a character vector of the same length, such
> that for elements of length > 1, those elements are collapsed with a
> given separator. The following is what I came up with, but it is very
> slow for large CharacterLists:
>
> library(stringr)
> library(plyr)
> flatten.CharacterList <- function(x, sep=",") {
>    if (is.list(x)) {
>      x[!is.na(x)] <- laply(x[!is.na(x)], str_c, collapse=sep,
> .parallel=TRUE)
>      x <- as(x, "character")
>    }
>    x
> }
>
> -Ryan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list