[R] Exceptional slowness with read.csv

Tue Apr 9 01:31:01 CEST 2024

Try reading the lines in (readLines), count the number of both types of
quotes in each line. Find out which are not even and investigate.

On Mon, Apr 8, 2024, 15:24 Dave Dixon <ddixon using swcp.com> wrote:

> I solved the mystery, but not the problem. The problem is that there's
> an unclosed quote somewhere in those 5 additional records I'm trying to
> access. So read.csv is reading million-character fields. It's slow at
> that. That mystery solved.
>
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv can't
> handle. readLines followed by read.csv(text= ) works great because, in
> that case, read.csv knows where the record terminates. Meaning, read.csv
> throws an exception that I can catch and handle with a quick and clean
> regex expression.
>
> Thanks, I'll take a look at vroom.
>
> -dave
>
> On 4/8/24 09:18, Stevie Pederson wrote:
> > Hi Dave,
> >
> > That's rather frustrating. I've found vroom (from the package vroom)
> > to be helpful with large files like this.
> >
> > Does the following give you any better luck?
> >
> > vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
> >
> > Of course, when you know you've got errors & the files are big like
> > that it can take a bit of work resolving things. The command line
> > tools awk & sed might even be a good plan for finding lines that have
> > errors & figuring out a fix, but I certainly don't envy you.
> >
> > All the best
> >
> > Stevie
> >
> > On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon using swcp.com> wrote:
> >
> >     Greetings,
> >
> >     I have a csv file of 76 fields and about 4 million records. I know
> >     that
> >     some of the records have errors - unmatched quotes, specifically.
> >     Reading the file with readLines and parsing the lines with
> >     read.csv(text
> >     = ...) is really slow. I know that the first 2459465 records are
> >     good.
> >     So I try this:
> >
> >      > startTime <- Sys.time()
> >      > first_records <- read.csv(file_name, nrows = 2459465)
> >      > endTime <- Sys.time()
> >      > cat("elapsed time = ", endTime - startTime, "\n")
> >
> >     elapsed time =   24.12598
> >
> >      > startTime <- Sys.time()
> >      > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >      > endTime <- Sys.time()
> >      > cat("elapsed time = ", endTime - startTime, "\n")
> >
> >     This appears to never finish. I have been waiting over 20 minutes.
> >
> >     So why would (skip = 2459465, nrows = 5) take orders of magnitude
> >     longer
> >     than (nrows = 2459465) ?
> >
> >     Thanks!
> >
> >     -dave
> >
> >     PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> >     ______________________________________________
> >     R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     <http://www.R-project.org/posting-guide.html>
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]