[R] reading a text file, one line at a time

jim holtman jholtman at gmail.com
Thu Aug 19 12:25:22 CEST 2010


Here is how I would do it to just do character substitution on the data:

> inFile <- textConnection("   V1 V2 V3 V4         V5
+ 1   1  b  b  a -0.4990719
+ 2   2  b  a  a  1.5134101
+ 3   3  a  b  b  1.9375467
+ 4   4  a  a  b  0.3310612
+ 5   5  a  b  a  0.2807520
+ 6   6  a  a  b  0.9646351
+ 7   7  b  a  b  0.6243979
+ 8   8  a  b  a -0.8076008
+ 9   9  a  b  b -1.7645273
+ 10 10  b  b  a  0.5460802
+ 11 11  c  c  b 12.3000000")
> output <- NULL  # initialize output file (just a vector in this case
> while(length(input <- readLines(inFile, n=3)) > 0){
+     # replace 'b' with 'z'
+     for (i in seq_along(input)){
+         input[i] <- gsub('b', 'z', input[i])
+     }
+     output <- c(output, input)  # collect the output
+ }
> close(inFile)
> print(cbind(output))  # show converted data
      output
 [1,] "   V1 V2 V3 V4         V5"
 [2,] "1   1  z  z  a -0.4990719"
 [3,] "2   2  z  a  a  1.5134101"
 [4,] "3   3  a  z  z  1.9375467"
 [5,] "4   4  a  a  z  0.3310612"
 [6,] "5   5  a  z  a  0.2807520"
 [7,] "6   6  a  a  z  0.9646351"
 [8,] "7   7  z  a  z  0.6243979"
 [9,] "8   8  a  z  a -0.8076008"
[10,] "9   9  a  z  z -1.7645273"
[11,] "10 10  z  z  a  0.5460802"
[12,] "11 11  c  c  z 12.3000000"
>


On Wed, Aug 18, 2010 at 10:51 PM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
> Hi Jim,
>
> I was trying to use your template without success. With the toy data
> below, could
> you explain how to use this template to change all "b"s to "z"s --
> just as an exercise, reading
> in 3 lines at a time. I need to use this strategy for a larger
> problem, but I haven't
> been able to get the basics working.
>
> Thanks,
>
> Juliet
>
> myData <- structure(list(V1 = 1:11, V2 = structure(c(2L, 2L, 1L, 1L, 1L,
> 1L, 2L, 1L, 1L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"),
>    V3 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L,
>    3L), .Label = c("a", "b", "c"), class = "factor"), V4 = structure(c(1L,
>    1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("a",
>    "b"), class = "factor"), V5 = c(-0.499071939558026, 1.51341011554134,
>    1.93754671209923, 0.331061227463955, 0.280752001959284, 0.964635079229074,
>    0.624397908891502, -0.807600774484419, -1.76452730888732,
>    0.546080229326458, 12.3)), .Names = c("V1", "V2", "V3", "V4",
> "V5"), class = "data.frame", row.names = c(NA, -11L))
>
> On Sun, Aug 15, 2010 at 1:06 PM, jim holtman <jholtman at gmail.com> wrote:
>> For efficiency of processing, look at reading in several
>> hundred/thousand lines at a time.  One line read/write will probably
>> spend most of the time in the system calls to do the I/O and will take
>> a long time.  So do something like this:
>>
>> con <- file('yourInputFile', 'r')
>> outfile <- file('yourOutputFile', 'w')
>> while (length(input <- readLines(con, n=1000) > 0){
>>    for (i in 1:length(input)){
>>        ......your one line at a time processing
>>    }
>>    writeLines(output, con=outfile)
>> }
>>
>> On Sun, Aug 15, 2010 at 7:58 AM, Data Analytics Corp.
>> <walt at dataanalyticscorp.com> wrote:
>>> Hi,
>>>
>>> I have an upcoming project that will involve a large text file.  I want to
>>>
>>>  1. read the file into R one line at a time
>>>  2. do some string manipulations on the line
>>>  3. write the line to another text file.
>>>
>>> I can handle the last two parts.  Scan and read.table seem to read the whole
>>> file in at once.  Since this is a very large file (several hundred thousand
>>> lines), this is not practical.  Hence the idea of reading one line at at
>>> time.  The question is, can R read one line at a time?  If so, how?  Any
>>> suggestions are appreciated.
>>>
>>> Thanks,
>>>
>>> Walt
>>>
>>> ________________________
>>>
>>> Walter R. Paczkowski, Ph.D.
>>> Data Analytics Corp.
>>> 44 Hamilton Lane
>>> Plainsboro, NJ 08536
>>> ________________________
>>> (V) 609-936-8999
>>> (F) 609-936-3733
>>> walt at dataanalyticscorp.com
>>> www.dataanalyticscorp.com
>>>
>>> _____________________________________________________
>>>
>>>
>>> --
>>> ________________________
>>>
>>> Walter R. Paczkowski, Ph.D.
>>> Data Analytics Corp.
>>> 44 Hamilton Lane
>>> Plainsboro, NJ 08536
>>> ________________________
>>> (V) 609-936-8999
>>> (F) 609-936-3733
>>> walt at dataanalyticscorp.com
>>> www.dataanalyticscorp.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list