[Rd] write.table with row.names=FALSE unnecessarily slow?

Martin Morgan mtmorgan at fhcrc.org
Mon Mar 10 19:07:54 CET 2008


I neglected to include my test case,

> df <- data.frame(x=1:(10^7))

Martin

Martin Morgan <mtmorgan at fhcrc.org> writes:

> write.table with large data frames takes quite a long time
>
>> system.time({
> +     write.table(df, '/tmp/dftest.txt', row.names=FALSE)
> + }, gcFirst=TRUE)
>    user  system elapsed 
>  97.302   1.532  98.837 
>
> A reason is because dimnames is always called, causing 'anonymous' row
> names to be created as character vectors. Avoiding this in
> src/library/utils, along the lines of
>
> Index: write.table.R
> ===================================================================
> --- write.table.R	(revision 44717)
> +++ write.table.R	(working copy)
> @@ -27,13 +27,18 @@
>  
>      if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
>  
> +    makeRownames <- is.logical(row.names) && !is.na(row.names) &&
> +                    row.names==TRUE
> +    makeColnames <- is.logical(col.names) && !is.na(col.names) &&
> +                    col.names==TRUE
>      if(is.matrix(x)) {
>          ## fix up dimnames as as.data.frame would
>          p <- ncol(x)
>          d <- dimnames(x)
>          if(is.null(d)) d <- list(NULL, NULL)
> -        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
> -        if(is.null(d[[2]]) && p > 0) d[[2]] <-  paste("V", 1:p, sep="")
> +        if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
> +        if(is.null(d[[2]]) && p > 0 && makeColnames)
> +            d[[2]] <-  paste("V", 1:p, sep="")
>          if(is.logical(quote) && quote)
>              quote <- if(is.character(x)) seq_len(p) else numeric(0)
>      } else {
> @@ -53,8 +58,8 @@
>                  quote <- ord[quote]; quote <- quote[quote > 0]
>              }
>          }
> -        d <- dimnames(x)
> -        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
> +        d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
> +                  if (makeColnames==TRUE) names(x) else NULL)
>          p <- ncol(x)
>      }
>      nocols <- p==0
>
> improves performance at least in proportion to nrow(x):
>
>> system.time({
> +     write.table(df, '/tmp/dftest1.txt', row.names=FALSE)
> + }, gcFirst=TRUE)
>    user  system elapsed 
>   8.132   0.608   8.899 
>
> Martin
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-devel mailing list