[R] Problem to resolve a step for reading a large TXT and, split in several file

Rui Barradas ruipbarradas at sapo.pt
Wed May 16 15:11:21 CEST 2012


Hello,

Your bug is obvious, each pass through the loop you read twice and write 
only once. The file pointer keeps moving forward...
Use something like

while (length(pv <- readLines(con, n=n)) > 0 ) {  # note that this line 
changed.
     i <- i + 1
     write.table(pv, file = paste(fileNames.temp.1, "_", i, ".txt", sep 
= ""), sep = "\t")
}

(or put the line with read.table where you have readLines.)

Anyway, I don't like it very much. If you know the number of lines in 
the input file, it would be much better to use integer division and 
modulus to determine how many times and how much to read.
Something like

n <- 1000000

passes <- number.of.lines.in.file %/% n
remaining <- number.of.lines.in.file %% n

for(i in seq.int(passes)){

     [ ... read n lines at a time & process them...]

}
if(remaining){
     n <- remaining

     [ ...read what's left... ]
}


If you do not know how many lines are there in the file, see 
(package::function)

parser::nlines
R.utils::countLines

Hope this helps,

Rui Barradas


Em 16-05-2012 11:00, r-help-request at r-project.org escreveu:
> Date: Tue, 15 May 2012 22:16:42 +0200
> From: gianni lavaredo<gianni.lavaredo at gmail.com>
> To:r-help at r-project.org
> Subject: [R] Problem to resolve a step for reading a large TXT and
> 	split in several file
> Message-ID:
> 	<CAJ6JbR-YwgjsFu8o0UnvET6M8p8WvP7YbosXw5nRdz48woDsrw at mail.gmail.com>
> Content-Type: text/plain
>
> Dear Researchs,
>
> It's the first time I am trying to resolve this problem. I have a TXT file
> with 1408452 rows. I wish to split file-by-file where each file has
> 1,000,000 rows with the following procedure:
>
> # split in two file one with 1,000,000 of rows and one with 408,452 of rows
>
> file<- "09G001_72975_7575_25_4025.txt"
> fileNames<- strsplit(as.character(file), ".", fixed = TRUE)
> fileNames.temp.1<- unique(as.vector(do.call("rbind", fileNames)[, 1]))
>
> con<- file(file, open = "r")
> # n is the number of row
> n<- 1000000
> i<- 0
> while (length(readLines(con, n=n))>  0 ) {
>      i<- i + 1
>      pv<- read.table(con,header=F,sep="\t", nrow=n)
>      write.table(pv, file = paste(fileNames.temp.1,"_",i,".txt",sep = ""),
> sep = "\t")
> }
> close(con)
>
>
> when I use 1,000,000 I have in the directory only
> "09G001_72975_7575_25_4025_1.txt" (with 1000000 of rows) and not
> "09G001_72975_7575_25_4025_2.txt"  (with 408,452). I din't understand where
> is my bug
>
> Furthermore when i wish for example split in 3 files (where n is 469484 =
> 1408452/3) i have this message:
>
> *Error in read.table(con, header = F, sep = "\t", nrow = n) :
>    no lines available in input*
>
> Thanks for all help and sorry for the disturb
>
> 	[[alternative HTML version deleted]]



More information about the R-help mailing list