[R] how to get how many lines there are in a file.

Liaw, Andy andy_liaw at merck.com
Mon Dec 6 18:26:42 CET 2004


> From: Liaw, Andy
> 
> > From: Marc Schwartz
> > 
> > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:
> > > hi all
> > > If I wanna get the total number of lines in a big file 
> > without reading
> > > the file's content into R as matrix or data frame, any methods or
> > > functions?
> > > thanks in advance.
> > > Regards
> > 
> > See ?readLines
> > 
> > You can use:
> > 
> > length(readLines("FileName"))
> > 
> > to get the number of lines read.
> > 
> > HTH,
> > 
> > Marc Schwartz
> 
> 
> On a system equipped with `wc' (*nix or Windows with such utilities
> installed and on PATH) I would use that.  Otherwise 
> length(count.fields())
> might be a good choice.
> 
> Cheers,
> Andy

Marc alerted me off-list that count.fields() might spent time delimiting
fields, which is not needed for the purpose of counting lines, and suggested
using sep="\n" as a possible way to make it more efficient.  (Thanks, Marc!)

 Here are some tests on a file with 14337 lines and  8900 fields (space
delimited).

> system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
[1] 48.86  0.24 49.30  0.00  0.00
> system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE)
[1] 42.19  0.26 42.60  0.00  0.00
> n
[1] 14337
> system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE)
[1] 37.77  0.56 38.35  0.00  0.00
> n2
[1] 14337
> system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0, NULL))[[1]],
gcFirst=T)
Read 1 records
[1] 0.00 0.00 0.33 0.08 0.25
> n3
[1] 14337

My only concern with the readLines() approach is that it still needs to read
the entire file into memory (if I'm not mistaken), which may not be
desirable:

> system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE)
[1] 36.72  0.48 37.24  0.00  0.00
> object.size(obj)/1024^2
[1] 244.6308

So it took 244+ MB just to store the text read in.  I would use a loop and
read the file in small chunks, if I really want to do it in R.

Cheers,
Andy




More information about the R-help mailing list