[R] how to get how many lines there are in a file.

Marc Schwartz MSchwartz at MedAnalytics.com
Mon Dec 6 18:50:12 CET 2004


On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote:

> Marc alerted me off-list that count.fields() might spent time delimiting
> fields, which is not needed for the purpose of counting lines, and suggested
> using sep="\n" as a possible way to make it more efficient.  (Thanks, Marc!)
> 
>  Here are some tests on a file with 14337 lines and  8900 fields (space
> delimited).
> 
> > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
> [1] 48.86  0.24 49.30  0.00  0.00
> > system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE)
> [1] 42.19  0.26 42.60  0.00  0.00

Andy,

I suspect that the relatively modest gain to be had here is the result
of count.fields() still scanning the input buffer for the delimiting
character, even though it would occur only once per line using the
newline character. Thus, the overhead is not reduced substantially.

A scan of the source code for the .Internal function would validate
that.

Thanks for testing this.

As both you and Thomas mention, 'wc' is clearly the fastest way to go
based upon your additional figures.

Best regards,

Marc




More information about the R-help mailing list