[R] reading data from web data sources

Tim Coote tim+r-project.org at coote.org
Sat Feb 27 21:28:36 CET 2010


Thanks, Gabor. My take away from this and Phil's post is that I'm  
going to have to construct some code to do the parsing, rather than  
use a standard function. I'm afraid that neither approach works, yet:

Gabor's gets has an off-by-one error (days start on the 2nd, not the  
first), and the years get messed up around the 29th day.  I think that  
na.omit (DF) line is throwing out the baby with the bathwater.  It's  
interesting that this approach is based on read.table, I'd assumed  
that I'd need read.ftable, which I couldn't understand the  
documentation for.  What is it that's removing the -999 and -888  
values in this code -they seem to be gone, but I cannot see why.

Phil's reads in the data, but interleaves rows with just a year and  
all other values as NA.

Tim
On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:

> Mark Leeds pointed out to me that the code wrapped around in the post
> so it may not be obvious that the regular expression in the grep is
> (i.e. it contains a space):
> "[^ 0-9.]"
>
>
> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> Try this.  First we read the raw lines into R using grep to remove  
>> any
>> lines containing a character that is not a number or space.  Then we
>> look for the year lines and repeat them down V1 using cumsum.   
>> Finally
>> we omit the year lines.
>>
>> myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat 
>> "
>> raw.lines <- readLines(myURL)
>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>> 0-9.]",raw.lines)]), fill = TRUE)
>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>> DF <- na.omit(DF)
>> head(DF)
>>
>>
>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org 
>> > wrote:
>>> Hullo
>>> I'm trying to read some time series data of meteorological records  
>>> that are
>>> available on the web (eg
>>> http://climate.arm.ac.uk/calibrated/soil/ 
>>> dsoil100_cal_1910-1919.dat). I'd
>>> like to be able to read in the digital data directly into R.  
>>> However, I
>>> cannot work out the right function and set of parameters to use.   
>>> It could
>>> be that the only practical route is to write a parser, possibly in  
>>> some
>>> other language,  reformat the files and then read these into R. As  
>>> far as I
>>> can tell, the informal grammar of the file is:
>>>
>>> <comments terminated by a blank line>
>>> [<year number on a line on its own>
>>> <daily readings lines> ]+
>>>
>>> and the <daily readings> are of the form:
>>> <whitespace> <day number> [<whitespace> <reading on day of month>]  
>>> 12
>>>
>>> Readings for days in months where a day does not exist have  
>>> special values.
>>> Missing values have a different special value.
>>>
>>> And then I've got the problem of iterating over all relevant files  
>>> to get a
>>> whole timeseries.
>>>
>>> Is there a way to read in this type of file into R? I've read all  
>>> of the
>>> examples that I can find, but cannot work out how to do it. I  
>>> don't think
>>> that read.table can handle the separate sections of data  
>>> representing each
>>> year. read.ftable maybe can be coerced to parse the data, but I  
>>> cannot see
>>> how after reading the documentation and experimenting with the  
>>> parameters.
>>>
>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>
>>> Any help/suggestions would be greatly appreciated. I can see that  
>>> this type
>>> of issue is likely to grow in importance, and I'd also like to  
>>> give the data
>>> owners suggestions on how to reformat their data so that it is  
>>> easier to
>>> consume by machines, while being easy to read for humans.
>>>
>>> The early records are a serious machine parsing challenge as they  
>>> are tiff
>>> images of old notebooks ;-)
>>>
>>> tia
>>>
>>> Tim
>>> Tim Coote
>>> tim at coote.org
>>> vincit veritas
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>

Tim Coote
tim at coote.org
vincit veritas



More information about the R-help mailing list