[R] How long does skipping in read.table take

Sat Oct 23 16:52:02 CEST 2010

Just tried it on my work computer (Windows XP, I only have 2 GB RAM):
I've run your code, just indicated the separator "|" in read.table (in
DF line) and added the actual processing (writing out of the result
with a file name) - see below.
I got:
Error in textConnection(x) : cannot allocate memory for text connection

Thanks again for helping!
Dimitri

### New code from Gabor:
k <- 1000000 # no of rows per chunk
first <- TRUE
con <- file('myfile.txt', "r")
count<-1

repeat {

  start<-Sys.time()
  print(start)
  flush.console()

  # skip header
  if (first) hdgs <- readLines(con, 1)
  first <- FALSE

  x <- readLines(con, k)
  if (length(x) == 0) break
  DF <- read.table(textConnection(x), header = FALSE,sep="|")

  # process chunk -- we just print last row here
  end<-Sys.time()
  print(end-start)
  print(names(DV))
  print(tail(DF, 1))
  flush.console()
  filename<-paste("Chunk of 1 Mil number ",count,".txt",sep="")
  write.table(DF,sep="\t",header=FALSE,file=filename)
  count<-count+1
}
close(con)

On Sat, Oct 23, 2010 at 10:19 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I just tried it:
>>
>> for(i in 11:16){ #i<-11
>>  start<-Sys.time()
>>  print(start)
>>  flush.console()
>>  filename<-paste("skipped millions- ",i,".txt",sep="")
>>  mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
>> "select * from file limit 1000000, (1000000*i-1)")
>
> The SQL statement does not know anything about R variables. You would
> need something like this:
>
>> i <- 1
>> s <- sprintf("select from file limit 10, %d", 10*1-1)
>> s
> [1] "select from file limit 10, 9"
>> read.csv.sql(..., sql = s, ...)
>
> Also if you just want to read it in as chunks reading from a
> connection in R would be sufficient:
>
> k <- 5000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.csv', "r")
> repeat {
>
>   # skip header
>   if (first) hdgs <- readLines(con, 1)
>   first <- FALSE
>
>   x <- readLines(con, k)
>   if (length(x) == 0) break
>   DF <- read.csv(textConnection(x), header = FALSE)
>
>   # process chunk -- we just print last row here
>   print(tail(DF, 1))
>
> }
> close(con)
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com