[R] reading data from web data sources

David Winsemius dwinsemius at comcast.net
Sat Feb 27 23:54:57 CET 2010


On Feb 27, 2010, at 4:33 PM, Gabor Grothendieck wrote:

> No one else posted so the other post you are referring to must have
> been an email to you, not a post.  We did not see it.
>
> By one off I think you are referring to the row names, which are
> meaningless, rather than the day numbers.  The data for day 1 is
> present, not missing.  The example code did replace the day number
> column with the year since the days were just sequential and therefore
> derivable but its trivial to keep them if that is important to you and
> we have made that change below.
>
> The previous code used grep to kick out lines that had any character
> not in the set: minus sign, space and digit but in this version we add
> minus sign to that set.   We also corrected the year column and added
> column names and converted all -999 strings to NA.  Due to this last
> point we cannot use na.omit any more but we now have iy available that
> distinguishes between year rows and other rows.
>
> Every line here has been indented so anything that starts at the left
> column must have been word wrapped in transmission.
>
>  myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat 
> "
>  raw.lines <- readLines(myURL)
>  DF <- read.table(textConnection(raw.lines[!grepl("[^- 0-9.]",  
> raw.lines)]),
>    fill = TRUE, col.names = c("day", month.abb), na.strings = "-999")
>
>  iy <- is.na(DF[[2]]) # is year row
>  DF$year <- DF[iy, 1][cumsum(iy)]
>  DF <- DF[!iy, ]
>
>  DF

Wouldn't they be of more value if they were sequential?

dta <- data.matrix(DF[, -c(1,14)])
dtafrm <-data.frame(rdta=dta[!is.na(dta)],
                     dom= DF[row(dta)[!is.na(dta)], 1],
                     month= col(dta)[!is.na(dta)])
# adding a year column would be trivial.
 > sum(dtafrm$month ==2)
[1] 282
 > sum(dtafrm$month ==12)
[1] 310

plot(dtafrm$rdta,  type="l")

Yes, I know that zoo() might be better, but I'm still a "zoobie", or  
would that be "newzer"?

So, is there a zooisher function I should learn that would strip out  
the NA's and incorporate the data values?

-- 
David.

>
>
> On Sat, Feb 27, 2010 at 3:28 PM, Tim Coote <tim+r-project.org at coote.org 
> > wrote:
>> Thanks, Gabor. My take away from this and Phil's post is that I'm  
>> going to
>
> I think the other `post`` must have been directly to you.  We didn`t  
> see it.
>
>> have to construct some code to do the parsing, rather than use a  
>> standard
>> function. I'm afraid that neither approach works, yet:
>>
>> Gabor's gets has an off-by-one error (days start on the 2nd, not  
>> the first),
>> and the years get messed up around the 29th day.  I think that  
>> na.omit (DF)
>> line is throwing out the baby with the bathwater.  It's interesting  
>> that
>> this approach is based on read.table, I'd assumed that I'd need  
>> read.ftable,
>> which I couldn't understand the documentation for.  What is it that's
>> removing the -999 and -888 values in this code -they seem to be  
>> gone, but I
>> cannot see why.
>>
>> Phil's reads in the data, but interleaves rows with just a year and  
>> all
>> other values as NA.
>>
>> Tim
>> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>>
>>> Mark Leeds pointed out to me that the code wrapped around in the  
>>> post
>>> so it may not be obvious that the regular expression in the grep is
>>> (i.e. it contains a space):
>>> "[^ 0-9.]"
>>>
>>>
>>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>> Try this.  First we read the raw lines into R using grep to  
>>>> remove any
>>>> lines containing a character that is not a number or space.  Then  
>>>> we
>>>> look for the year lines and repeat them down V1 using cumsum.   
>>>> Finally
>>>> we omit the year lines.
>>>>
>>>> myURL <-
>>>> "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat 
>>>> "
>>>> raw.lines <- readLines(myURL)
>>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>>> 0-9.]",raw.lines)]), fill = TRUE)
>>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>>> DF <- na.omit(DF)
>>>> head(DF)
>>>>
>>>>
>>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org 
>>>> >
>>>> wrote:
>>>>>
>>>>> Hullo
>>>>> I'm trying to read some time series data of meteorological  
>>>>> records that
>>>>> are
>>>>> available on the web (eg
>>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat) 
>>>>> .
>>>>> I'd
>>>>> like to be able to read in the digital data directly into R.  
>>>>> However, I
>>>>> cannot work out the right function and set of parameters to  
>>>>> use.  It
>>>>> could
>>>>> be that the only practical route is to write a parser, possibly  
>>>>> in some
>>>>> other language,  reformat the files and then read these into R.  
>>>>> As far
>>>>> as I
>>>>> can tell, the informal grammar of the file is:
>>>>>
>>>>> <comments terminated by a blank line>
>>>>> [<year number on a line on its own>
>>>>> <daily readings lines> ]+
>>>>>
>>>>> and the <daily readings> are of the form:
>>>>> <whitespace> <day number> [<whitespace> <reading on day of  
>>>>> month>] 12
>>>>>
>>>>> Readings for days in months where a day does not exist have  
>>>>> special
>>>>> values.
>>>>> Missing values have a different special value.
>>>>>
>>>>> And then I've got the problem of iterating over all relevant  
>>>>> files to
>>>>> get a
>>>>> whole timeseries.
>>>>>
>>>>> Is there a way to read in this type of file into R? I've read  
>>>>> all of the
>>>>> examples that I can find, but cannot work out how to do it. I  
>>>>> don't
>>>>> think
>>>>> that read.table can handle the separate sections of data  
>>>>> representing
>>>>> each
>>>>> year. read.ftable maybe can be coerced to parse the data, but I  
>>>>> cannot
>>>>> see
>>>>> how after reading the documentation and experimenting with the
>>>>> parameters.
>>>>>
>>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>>>
>>>>> Any help/suggestions would be greatly appreciated. I can see  
>>>>> that this
>>>>> type
>>>>> of issue is likely to grow in importance, and I'd also like to  
>>>>> give the
>>>>> data
>>>>> owners suggestions on how to reformat their data so that it is  
>>>>> easier to
>>>>> consume by machines, while being easy to read for humans.
>>>>>
>>>>> The early records are a serious machine parsing challenge as  
>>>>> they are
>>>>> tiff
>>>>> images of old notebooks ;-)
>>>>>
>>>>> tia
>>>>>
>>>>> Tim
>>>>> Tim Coote
>>>>> tim at coote.org
>>>>> vincit veritas
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>
>> Tim Coote
>> tim at coote.org
>> vincit veritas
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT



More information about the R-help mailing list