[R] Read file

jim holtman jholtman at gmail.com
Tue Oct 5 04:16:17 CEST 2010


Is this what you are looking for:

> input <- readLines(textConnection(" 2010 10 01 00
+  *82599  -35.25  -5.91     52   1*
+  1008.0  -9999    115     3.1   298.6   294.6 64
+ 2010 10 01 00
+ *83649  -40.28 -20.26      4  7
+ *1011.0  -9999      0     0.0   298.4   296.1 64
+ 1000.0     96     40     5.7   297.9   295.1 32
+  925.0    782    325     3.1   295.4   294.1 32
+  850.0   1520    270     4.1   293.8   289.4 32
+  700.0   3171    240     8.7   284.1   279.1 32
+  500.0   5890    275     8.2   266.2   262.9 32
+  400.0   7600    335     9.8   255.4   242.4 32"))
> closeAllConnections()
> # remove the "*" since they seem to be inconsistent
> input <- gsub("\\*|^ ", "", input)
>
> date <- NULL  # hold the date
> station <- NULL  # hold the station ID
> # now parse each line
> # length = 4 => date
> # length = 5 => station id
> # length = 7 => data
> result <- lapply(input, function(.line){
+     x <- as.numeric(strsplit(.line, '[[:space:]]+')[[1]])
+     if (length(x) == 4) date <<- x[1] * 1000000 + x[2] * 10000 +
+         x[3] * 100 + x[4]
+     else if (length(x) == 5) station <<- x[1]
+     else if (length(x) == 7) return(data.frame(date = date,
+         station = station,
+         x[1], x[2], x[3], x[4], x[5], x[6], x[7]))
+     else cat("invalid line:", .line, '\n')
+     return(NULL)
+ })
>
> # combine into single dataframe
> do.call(rbind, result)
        date station x.1.  x.2. x.3. x.4.  x.5.  x.6. x.7.
1 2010100100   82599 1008 -9999  115  3.1 298.6 294.6   64
2 2010100100   83649 1011 -9999    0  0.0 298.4 296.1   64
3 2010100100   83649 1000    96   40  5.7 297.9 295.1   32
4 2010100100   83649  925   782  325  3.1 295.4 294.1   32
5 2010100100   83649  850  1520  270  4.1 293.8 289.4   32
6 2010100100   83649  700  3171  240  8.7 284.1 279.1   32
7 2010100100   83649  500  5890  275  8.2 266.2 262.9   32
8 2010100100   83649  400  7600  335  9.8 255.4 242.4   32
>


On Mon, Oct 4, 2010 at 9:52 PM, Nilza BARROS <nilzabarros at gmail.com> wrote:
> Sorry, guys
>  I couldn`t explain what I really wanted.
> I have a file with many station and many information for each one.
> I need identified the line where the station information start. After that
> I`d like to store that data (related to the station) so as to it could be
> work in separate way.
>
> If I was using another language as Fortran , I would save the data in a
> vector.
> But in R I don`t know how to do this :(
>
> ====David`s Questions===========
>
> *my.data<-file("d2010100100.txt",open="rt")
> indata <- readLines(my.data, n=20000)
> i<-grep("^[837]",indata)  #station number*
> **
> *That would give you the line numbers for any line that had an 8 , _or_ a 3,
> _or_ a 7 as its first digit. Was that your intent? My guess is that you did
> not really want to use the square braces and should have been using "^837".*
> *?regex  # Paragraph starting "A character class .... "*
> *## In fact I am trying to find out the station in the file. As the
> Brazilian station start with `83` I intend to picked them up.*
> **
> **
> *my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
> stn<- my.data2$V1[i]*
> **
> *- That would give you the first column values for the lines you earlier
> selected*.
> ## It gave me all the station that started with `873`. I did it just because
> I needed to know how many station there was in the file. But it is not
> helping me to solve the problem.
> Thanks in Advanced
> Nilza Barros
> On Sun, Oct 3, 2010 at 11:05 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>
>>
>> On Oct 3, 2010, at 9:40 PM, Nilza BARROS wrote:
>>
>> Hi, Michael
>>> Thank you for your help. I have already done what you said.
>>> But I am still facing problems to deal with my data.
>>>
>>> I need to split the data according to station..
>>>
>>> I was able to identify where the station information start using:
>>>
>>> my.data<-file("d2010100100.txt",open="rt")
>>> indata <- readLines(my.data, n=20000)
>>> i<-grep("^[837]",indata)  #station number
>>>
>>
>> That would give you the line numbers for any line that had an 8 , _or_ a 3,
>> _or_ a 7 as its first digit. Was that your intent? My guess is that you did
>> not really want to use the square braces and should have been using "^837".
>>
>> ?regex  # Paragraph starting "A character class .... "
>>
>>
>> my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
>>> stn<- my.data2$V1[i]
>>>
>>
>> That would give you the first column values for the lines you earlier
>> selected.
>>
>>
>> ====
>>>
>>
>> This does not look like what I would expect as a value for stn. Is that
>> what you wanted us to think this was?
>>
>> --
>> David.
>>
>>
>>
>> 2010 10 01 00
>>> *82599  -35.25  -5.91     52   1
>>> * 1008.0  -9999    115     3.1   298.6   294.6 64
>>> 2010 10 01 00
>>> *83649  -40.28 -20.26      4  7*
>>> 1011.0  -9999      0     0.0   298.4   296.1 64
>>> 1000.0     96     40     5.7   297.9   295.1 32
>>>  925.0    782    325     3.1   295.4   294.1 32
>>>  850.0   1520    270     4.1   293.8   289.4 32
>>>  700.0   3171    240     8.7   284.1   279.1 32
>>>  500.0   5890    275     8.2   266.2   262.9 32
>>>  400.0   7600    335     9.8   255.4   242.4 32
>>> ===========
>>> As you can see in the data above the line show the number of leves (or
>>> lines) for each station.
>>> I need to catch these lines so as to be able to feed my database.
>>> By the way, I didn't understand the regular expression you've used. I've
>>> tried to run it but it did not work.
>>>
>>> Hope you can help me!
>>> Best Regards,
>>> Nilza
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Oct 3, 2010 at 2:18 AM, Michael Bedward
>>> <michael.bedward at gmail.com>wrote:
>>>
>>> Hello Nilza,
>>>>
>>>> If your file is small you can read it into a character vector like this:
>>>>
>>>> indata <- readLines("foo.dat")
>>>>
>>>> If your file is very big you can read it in batches like this...
>>>>
>>>> MAXRECS <- 1000  # for example
>>>> fcon <- file("foo.dat", open="r")
>>>> indata <- readLines(fcon, n=MAXRECS)
>>>>
>>>> The number of lines read will be given by length(indata).
>>>>
>>>> You can check to see if the end of the file has been read yet with:
>>>> isIncomplete( fcon )
>>>>
>>>> If a leading "*" character is a flag for the start of a station data
>>>> block you can find this in the indata vector with grepl...
>>>>
>>>> start.pos <- which(indata, grepl("^\\s*\\*", indata)
>>>>
>>>> When you're finished reading the file...
>>>> close(fcon)
>>>>
>>>> Hope this helps,
>>>>
>>>> Michael
>>>>
>>>>
>>>> On 3 October 2010 13:31, Nilza BARROS <nilzabarros at gmail.com> wrote:
>>>>
>>>>> Dear R-users,
>>>>>
>>>>> I would like to know how could I read a file with different lines
>>>>>
>>>> lengths.
>>>>
>>>>> I need read this file and create an output to feed my database.
>>>>> So after reading I'll need create an output like this
>>>>>
>>>>> "INSERT INTO TEMP (DATA,STATION,VAR1,VAR2) VALUES (20100910,837460,
>>>>>
>>>> 39,390)"
>>>>
>>>>>
>>>>> I mean,  each line should be read. But I don`t how to do this when these
>>>>> lines have different lengths
>>>>>
>>>>> I really appreciate any help.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> ====Below the file that should be read ===========
>>>>>
>>>>>
>>>>> *2010 10 01 00
>>>>> 83746  -43.25 -22.81      6  51*
>>>>> 1012.0  -9999    320     1.5   299.1   294.4 64
>>>>> 1000.0    114    250     4.1   298.4   294.8 32
>>>>> 925.0    797      0     0.0   293.6   292.9 32
>>>>> 850.0   1524    195     3.1   289.6   288.9 32
>>>>> 700.0   3156    290    11.3   280.1   280.1 32
>>>>> 500.0   5870    280    20.1   266.1   260.1 32
>>>>> 400.0   7570    265    23.7   256.6   222.7 32
>>>>> 300.0   9670    265    28.8   240.2   218.2 32
>>>>> 250.0  10920    280    27.3   230.2   220.2 32
>>>>> 200.0  12390    260    32.4   218.7   206.7 32
>>>>> 176.0  -9999    255    37.6 -9999.0 -9999.0  8
>>>>> 150.0  14180    245    35.5   205.1   196.1 32
>>>>> 100.0  16560    300    17.0   195.2   186.2 32
>>>>> *2010 10 01 00
>>>>> 83768  -51.13 -23.33    569  41
>>>>> * 1000.0     79  -9999 -9999.0 -9999.0 -9999.0 32
>>>>> 946.0  -9999    270     1.0   295.8   292.1 64
>>>>> 925.0    763     15     2.1   296.4   290.4 32
>>>>> 850.0   1497    175     3.6   290.8   288.4 32
>>>>> 700.0   3140    295     9.8   282.9   278.6 32
>>>>> 500.0   5840    285    23.7   267.1   232.1 32
>>>>> 400.0   7550    255    35.5   255.4   231.4 32
>>>>> 300.0   9640    265    37.0   242.2   216.2 32
>>>>>
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> --
>>>>> Abraço,
>>>>> Nilza Barros
>>>>>
>>>>
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>
>
> --
> Abraço,
> Nilza Barros
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list