[R] Exceptional slowness with read.csv

Dave Dixon dd|xon @end|ng |rom @wcp@com
Mon Apr 8 22:21:49 CEST 2024


I solved the mystery, but not the problem. The problem is that there's 
an unclosed quote somewhere in those 5 additional records I'm trying to 
access. So read.csv is reading million-character fields. It's slow at 
that. That mystery solved.

However, the the problem persists: how to fix what is obvious to the 
naked eye - a quote not adjacent to a comma - but that read.csv can't 
handle. readLines followed by read.csv(text= ) works great because, in 
that case, read.csv knows where the record terminates. Meaning, read.csv 
throws an exception that I can catch and handle with a quick and clean 
regex expression.

Thanks, I'll take a look at vroom.

-dave

On 4/8/24 09:18, Stevie Pederson wrote:
> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) 
> to be helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like 
> that it can take a bit of work resolving things. The command line 
> tools awk & sed might even be a good plan for finding lines that have 
> errors & figuring out a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon using swcp.com> wrote:
>
>     Greetings,
>
>     I have a csv file of 76 fields and about 4 million records. I know
>     that
>     some of the records have errors - unmatched quotes, specifically.
>     Reading the file with readLines and parsing the lines with
>     read.csv(text
>     = ...) is really slow. I know that the first 2459465 records are
>     good.
>     So I try this:
>
>      > startTime <- Sys.time()
>      > first_records <- read.csv(file_name, nrows = 2459465)
>      > endTime <- Sys.time()
>      > cat("elapsed time = ", endTime - startTime, "\n")
>
>     elapsed time =   24.12598
>
>      > startTime <- Sys.time()
>      > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>      > endTime <- Sys.time()
>      > cat("elapsed time = ", endTime - startTime, "\n")
>
>     This appears to never finish. I have been waiting over 20 minutes.
>
>     So why would (skip = 2459465, nrows = 5) take orders of magnitude
>     longer
>     than (nrows = 2459465) ?
>
>     Thanks!
>
>     -dave
>
>     PS: readLines(n=2459470) takes 10.42731 seconds.
>
>     ______________________________________________
>     R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]



More information about the R-help mailing list