[R] readr to generate tibble from a character matrix

Ben Tupper btupper at bigelow.org
Fri Apr 7 15:08:27 CEST 2017


Thanks!

I made up a little test for converting from character matrix to tibble: dumping to file and reading back, pasting up a big string, using pipes, using as.data.frame and using a pipeless version.  By far and away it is worth using Ulrik's or your solution compared to dumping the matrix to a file and then reading back OR pasting the matrix into one honking big string.

There is a difference in how the various methods interpret date-time inputs, but otherwise the results are all identical.

Cheers,
Ben

#### START
library(nycflights13)
library(tibble)
library(magrittr)
library(readr)
library(microbenchmark)


m <- as.matrix(flights)

via_file <- function(m){
    filename = tempfile(fileext = '.csv')
    write.csv(m, file = filename, row.names = FALSE, quote = FALSE)
    readr::read_csv(filename)
}

via_paste <- function(m){
    s <- paste(
        c(paste(colnames(m), collapse = ","), apply(m, 1, paste, collapse = ",")), 
        collapse = "\n")
    readr::read_csv(s)
}

via_pipes <- function(m){
    m %>%
        tibble::as_tibble() %>% 
        lapply(type.convert, as.is = TRUE) %>% 
        tibble::as_tibble()
}

via_dataframe <- function(m){
    mm <- lapply(data.frame(m, stringsAsFactors=FALSE), type.convert, as.is=TRUE)
    tibble::as_tibble(mm)
}

via_pipeless <- function(m){
    tibble::as_tibble(lapply(tibble::as_tibble(m), type.convert, as.is=TRUE))
}

X <- list(
    file=via_file(m),
    paste=via_paste(m),
    pipes=via_pipes(m),
    dataframe=via_dataframe(m),
    pipeless=via_pipeless(m))
    
sapply(names(X), function(n) all.equal(X[[n]], X[[1]]))

# $file
# [1] TRUE

# $paste
# [1] TRUE

# $pipes
# [1] "Incompatible type for column time_hour1: x character, y POSIXct" "Incompatible type for column time_hour2: x character, y POSIXt" 

# $dataframe
# [1] "Incompatible type for column time_hour1: x character, y POSIXct" "Incompatible type for column time_hour2: x character, y POSIXt" 

# $pipeless
# [1] "Incompatible type for column time_hour1: x character, y POSIXct" "Incompatible type for column time_hour2: x character, y POSIXt" 
    
microbenchmark(
    via_file(m),
    via_paste(m),
    via_pipes(m),
    via_dataframe(m),
    via_pipeless(m),
    times = 5
)

#Unit: milliseconds
#             expr       min        lq      mean    median        uq       max neval
#      via_file(m) 2362.7778 2396.2277 2415.9207 2413.0772 2439.5752 2467.9457     5
#     via_paste(m) 5287.8176 5305.6228 5622.1432 5666.0165 5919.3568 5931.9023     5
#     via_pipes(m)  461.4782  464.5656  506.4157  509.5532  542.1091  554.3726     5
# via_dataframe(m)  507.4674  514.2550  553.1791  515.9132  518.0807  710.1794     5
#  via_pipeless(m)  448.9529  470.1074  499.4392  470.6874  500.6027  606.8459     5

sessionInfo()
# R version 3.3.1 (2016-06-21)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.11.6 (El Capitan)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] microbenchmark_1.4-2.1 readr_1.0.0            magrittr_1.5           tibble_1.2             nycflights13_0.2.0    

# loaded via a namespace (and not attached):
 # [1] colorspace_1.2-6 scales_0.4.1     plyr_1.8.4       assertthat_0.1   tools_3.3.1      gtable_0.2.0     Rcpp_0.12.9      ggplot2_2.1.0    grid_3.3.1       munsell_0.4.3  


### END
 



> On Apr 6, 2017, at 3:34 PM, David L Carlson <dcarlson at tamu.edu> wrote:
> 
> Ulrik's solution gives you factors. To get them as characters, add as.is=TRUE:
> 
>> m %>%
> +    as_tibble() %>% 
> +    lapply(type.convert, as.is=TRUE) %>% 
> +    as_tibble()
> # A tibble: 4 × 5
>      A     B     C     D     E
>  <chr> <chr> <chr> <int> <dbl>
> 1     a     e     i     1  11.2
> 2     b     f     j     2  12.2
> 3     c     g     k     3  13.2
> 4     d     h     l     4  14.2
> 
> Other possibilities:
> 
>> mm <- lapply(data.frame(m, stringsAsFactors=FALSE), type.convert, as.is=TRUE)
>> as_tibble(mm)
> # Your solution simplified by converting to a data.frame
> 
>> as_tibble(lapply(as_tibble(m), type.convert, as.is=TRUE))
> # Ulrik's solution but without the pipes. Shows why you need 2 as_tibbles()
> 
> -------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> 
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Ben Tupper
> Sent: Thursday, April 6, 2017 11:42 AM
> To: Ulrik Stervbo <ulrik.stervbo at gmail.com>
> Cc: R-help Mailing List <r-help at r-project.org>
> Subject: Re: [R] readr to generate tibble from a character matrix
> 
> Hi,
> 
> Thanks for this solution!  Very slick! 
> 
> I see what you mean about the two calls to as_tibble(). I suppose I could do the following, but I doubt it is a gain...
> 
> mm <- lapply(colnames(m), function(nm, m) type.convert(m[,nm], as.is = TRUE), m=m)
> names(mm) <- colnames(m)
> as_tibble(mm)
> 
> # # A tibble: 4 × 5
> #       A     B     C     D     E
> #   <chr> <chr> <chr> <int> <dbl>
> # 1     a     e     i     1  11.2
> # 2     b     f     j     2  12.2
> # 3     c     g     k     3  13.2
> # 4     d     h     l     4  14.2
> 
> I'll benchmark these with writing to a temporary file and pasting together a string.
> 
> Cheers and thanks,
> Ben
> 
> On Apr 6, 2017, at 11:15 AM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:
>> 
>> Hi Ben,
>> 
>> type.convert should do the trick:
>> 
>> m %>%
>>  as_tibble() %>% 
>>  lapply(type.convert) %>% 
>>  as_tibble()
>> 
>> I am not too happy about to double 'as_tibble' but it get the job done.
>> 
>> HTH
>> Ulrik
>> 
>> On Thu, 6 Apr 2017 at 16:41 Ben Tupper <btupper at bigelow.org <mailto:btupper at bigelow.org>> wrote:
>> Hello,
>> 
>> I have a workflow yields a character matrix that I convert to a tibble. Here is a simple example.
>> 
>> library(tibble)
>> library(readr)
>> 
>> m <- matrix(c(letters[1:12], 1:4, (11:14 + 0.2)), ncol = 5)
>> colnames(m) <- LETTERS[1:5]
>> 
>> x <- as_tibble(m)
>> 
>> # # A tibble: 4 × 5
>> #       A     B     C     D     E
>> #   <chr> <chr> <chr> <chr> <chr>
>> # 1     a     e     i     1  11.2
>> # 2     b     f     j     2  12.2
>> # 3     c     g     k     3  13.2
>> # 4     d     h     l     4  14.2
>> 
>> The workflow output columns can be a mix of a known set column outputs.  Some of the columns really should be converted to non-character types before I proceed. Right now I explictly set the column classes with something like this...
>> 
>> mode(x[['D']]) <- 'integer'
>> mode(x[['E']]) <- 'numeric'
>> 
>> # # A tibble: 4 × 5
>> #       A     B     C     D     E
>> #   <chr> <chr> <chr> <int> <dbl>
>> # 1     a     e     i     1  11.2
>> # 2     b     f     j     2  12.2
>> # 3     c     g     k     3  13.2
>> # 4     d     h     l     4  14.2
>> 
>> 
>> I wonder if there is a way to use the read_* functions in the readr package to read the character matrix into a tibble directly which would leverage readr's excellent column class guessing. I can see in the vignette ( https://cran.r-project.org/web/packages/readr/vignettes/readr.html <https://cran.r-project.org/web/packages/readr/vignettes/readr.html> ) that I'm not too far off in thinking this could be done (step 1 tantalizingly says 'The flat file is parsed into a rectangular matrix of strings.')
>> 
>> I know that I could either write the matrix to a file or paste it all into a character vector and then use read_* functions, but I confess I am looking for a straighter path by simply passing the matrix to a function like readr::read_matrix() or the like.
>> 
>> Thanks!
>> Ben
>> 
>> Ben Tupper
>> Bigelow Laboratory for Ocean Sciences
>> 60 Bigelow Drive, P.O. Box 380
>> East Boothbay, Maine 04544
>> http://www.bigelow.org <http://www.bigelow.org/>
>> 
>> ______________________________________________
>> R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
> 
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org



More information about the R-help mailing list