[Rd] write.table with row.names=FALSE unnecessarily slow?

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Mar 11 11:28:26 CET 2008


This is a pretty extreme case: why not use write() to write a single 
column?  (It's a bit faster than your patched timing.)

In a more realistic test of 10 columns of 1 million rows I see a speedup 
from 12.2 to 9.7 seconds.

So I'll add the patch, but think that significant speedups will be quite 
rare.

BTW, this seems to be one of the places where we are paying the price of 
the CHARSXP cache: system.time(as.character(1:1e7)) has got a lot slower.
Maybe some further tuning is called for.

On Mon, 10 Mar 2008, Martin Morgan wrote:

> I neglected to include my test case,
>
>> df <- data.frame(x=1:(10^7))
>
> Martin
>
> Martin Morgan <mtmorgan at fhcrc.org> writes:
>
>> write.table with large data frames takes quite a long time
>>
>>> system.time({
>> +     write.table(df, '/tmp/dftest.txt', row.names=FALSE)
>> + }, gcFirst=TRUE)
>>    user  system elapsed
>>  97.302   1.532  98.837
>>
>> A reason is because dimnames is always called, causing 'anonymous' row
>> names to be created as character vectors. Avoiding this in
>> src/library/utils, along the lines of
>>
>> Index: write.table.R
>> ===================================================================
>> --- write.table.R	(revision 44717)
>> +++ write.table.R	(working copy)
>> @@ -27,13 +27,18 @@
>>
>>      if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
>>
>> +    makeRownames <- is.logical(row.names) && !is.na(row.names) &&
>> +                    row.names==TRUE
>> +    makeColnames <- is.logical(col.names) && !is.na(col.names) &&
>> +                    col.names==TRUE
>>      if(is.matrix(x)) {
>>          ## fix up dimnames as as.data.frame would
>>          p <- ncol(x)
>>          d <- dimnames(x)
>>          if(is.null(d)) d <- list(NULL, NULL)
>> -        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
>> -        if(is.null(d[[2]]) && p > 0) d[[2]] <-  paste("V", 1:p, sep="")
>> +        if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
>> +        if(is.null(d[[2]]) && p > 0 && makeColnames)
>> +            d[[2]] <-  paste("V", 1:p, sep="")
>>          if(is.logical(quote) && quote)
>>              quote <- if(is.character(x)) seq_len(p) else numeric(0)
>>      } else {
>> @@ -53,8 +58,8 @@
>>                  quote <- ord[quote]; quote <- quote[quote > 0]
>>              }
>>          }
>> -        d <- dimnames(x)
>> -        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
>> +        d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
>> +                  if (makeColnames==TRUE) names(x) else NULL)
>>          p <- ncol(x)
>>      }
>>      nocols <- p==0
>>
>> improves performance at least in proportion to nrow(x):
>>
>>> system.time({
>> +     write.table(df, '/tmp/dftest1.txt', row.names=FALSE)
>> + }, gcFirst=TRUE)
>>    user  system elapsed
>>   8.132   0.608   8.899
>>
>> Martin
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list