[R] Exceptional slowness with read.csv

Mon Apr 8 07:47:52 CEST 2024

Greetings,

I have a csv file of 76 fields and about 4 million records. I know that 
some of the records have errors - unmatched quotes, specifically.  
Reading the file with readLines and parsing the lines with read.csv(text 
= ...) is really slow. I know that the first 2459465 records are good. 
So I try this:

 > startTime <- Sys.time()
 > first_records <- read.csv(file_name, nrows = 2459465)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

 > startTime <- Sys.time()
 > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
than (nrows = 2459465) ?

Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.