[R] reading very large files

jim holtman jholtman at gmail.com
Fri Feb 2 19:33:17 CET 2007


I had a file with 200,000 lines in it and it took 1 second to select
3000 sample lines out of it.  One of the things is to use a connection
so that the file stays opens and then just 'skip' to the next record
to read:



> input <- file("/tempxx.txt", "r")
> sel <- 3000
> remaining <- 200000
> # get the records numbers to select
> recs <- sort(sample(1:remaining, sel))
> # compute number to skip on each read; account for the record just read
> skip <- diff(c(1, recs)) - 1
> # allocate my data
> mysel <- vector('character', sel)
> system.time({
+ for (i in 1:sel){
+     mysel[i] <- scan(input, what="", sep="\n", skip=skip[i], n=1, quiet=TRUE)
+ }
+ })
[1] 0.97 0.02 1.00   NA   NA
>
>


On 2/2/07, juli g. pausas <pausas at gmail.com> wrote:
> Hi all,
> I have a large file (1.8 GB) with 900,000 lines that I would like to read.
> Each line is a string characters. Specifically I would like to randomly
> select 3000 lines. For smaller files, what I'm doing is:
>
> trs <- scan("myfile", what= character(), sep = "\n")
> trs<- trs[sample(length(trs), 3000)]
>
> And this works OK; however my computer seems not able to handle the 1.8 G
> file.
> I thought of an alternative way that not require to read the whole file:
>
> sel <- sample(1:900000, 3000)
> for (i in 1:3000)  {
> un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1)
> write(un, "myfile_short", append=TRUE)
> }
>
> This works on my computer; however it is extremely slow; it read one line
> each time. It is been running for 25 hours and I think it has done less than
> half of the file (Yes, probably I do not have a very good computer and I'm
> working under Windows ...).
> So my question is: do you know any other faster way to do this?
> Thanks in advance
>
> Juli
>
> --
> http://www.ceam.es/pausas
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list