[R] Exceptional slowness with read.csv

CALUM POLWART po|c1410 @end|ng |rom gm@||@com
Mon Apr 8 20:14:06 CEST 2024


data.table's fread is also fast. Not sure about error handling. But I can
merge 300 csvs with a total of 0.5m lines and 50 columns in a couple of
minutes versus a lifetime with read.csv or readr::read_csv



On Mon, 8 Apr 2024, 16:19 Stevie Pederson, <stephen.pederson.au using gmail.com>
wrote:

> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) to be
> helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like that it
> can take a bit of work resolving things. The command line tools awk & sed
> might even be a good plan for finding lines that have errors & figuring out
> a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon using swcp.com> wrote:
>
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I know that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list