[R] reading data from web data sources

Phil Spector spector at stat.berkeley.edu
Sat Feb 27 23:53:31 CET 2010


Sorry, I forgot to cc the group:

Tim -
    Here's a way to read the data into a list, with one entry per year:

x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat',
                 header=FALSE,fill=TRUE,skip=13)
cts = apply(x,1,function(x)sum(is.na(x)))
wh = which(cts == 12)
start = wh+1
end = c(wh[-1] - 1,nrow(x))
ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
names(ans) = x[wh,1]

Hope this helps.
                                         - Phil Spector



On Sat, 27 Feb 2010, Gabor Grothendieck wrote:

> No one else posted so the other post you are referring to must have
> been an email to you, not a post.  We did not see it.
>
> By one off I think you are referring to the row names, which are
> meaningless, rather than the day numbers.  The data for day 1 is
> present, not missing.  The example code did replace the day number
> column with the year since the days were just sequential and therefore
> derivable but its trivial to keep them if that is important to you and
> we have made that change below.
>
> The previous code used grep to kick out lines that had any character
> not in the set: minus sign, space and digit but in this version we add
> minus sign to that set.   We also corrected the year column and added
> column names and converted all -999 strings to NA.  Due to this last
> point we cannot use na.omit any more but we now have iy available that
> distinguishes between year rows and other rows.
>
> Every line here has been indented so anything that starts at the left
> column must have been word wrapped in transmission.
>
>  myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
>  raw.lines <- readLines(myURL)
>  DF <- read.table(textConnection(raw.lines[!grepl("[^- 0-9.]", raw.lines)]),
>    fill = TRUE, col.names = c("day", month.abb), na.strings = "-999")
>
>  iy <- is.na(DF[[2]]) # is year row
>  DF$year <- DF[iy, 1][cumsum(iy)]
>  DF <- DF[!iy, ]
>
>  DF
>
>
> On Sat, Feb 27, 2010 at 3:28 PM, Tim Coote <tim+r-project.org at coote.org> wrote:
>> Thanks, Gabor. My take away from this and Phil's post is that I'm going to
>
> I think the other `post`` must have been directly to you.  We didn`t see it.
>
>> have to construct some code to do the parsing, rather than use a standard
>> function. I'm afraid that neither approach works, yet:
>>
>> Gabor's gets has an off-by-one error (days start on the 2nd, not the first),
>> and the years get messed up around the 29th day.  I think that na.omit (DF)
>> line is throwing out the baby with the bathwater.  It's interesting that
>> this approach is based on read.table, I'd assumed that I'd need read.ftable,
>> which I couldn't understand the documentation for.  What is it that's
>> removing the -999 and -888 values in this code -they seem to be gone, but I
>> cannot see why.
>>
>> Phil's reads in the data, but interleaves rows with just a year and all
>> other values as NA.
>>
>> Tim
>> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>>
>>> Mark Leeds pointed out to me that the code wrapped around in the post
>>> so it may not be obvious that the regular expression in the grep is
>>> (i.e. it contains a space):
>>> "[^ 0-9.]"
>>>
>>>
>>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>> Try this.  First we read the raw lines into R using grep to remove any
>>>> lines containing a character that is not a number or space.  Then we
>>>> look for the year lines and repeat them down V1 using cumsum.  Finally
>>>> we omit the year lines.
>>>>
>>>> myURL <-
>>>> "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
>>>> raw.lines <- readLines(myURL)
>>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>>> 0-9.]",raw.lines)]), fill = TRUE)
>>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>>> DF <- na.omit(DF)
>>>> head(DF)
>>>>
>>>>
>>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org>
>>>> wrote:
>>>>>
>>>>> Hullo
>>>>> I'm trying to read some time series data of meteorological records that
>>>>> are
>>>>> available on the web (eg
>>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat).
>>>>> I'd
>>>>> like to be able to read in the digital data directly into R. However, I
>>>>> cannot work out the right function and set of parameters to use.  It
>>>>> could
>>>>> be that the only practical route is to write a parser, possibly in some
>>>>> other language,  reformat the files and then read these into R. As far
>>>>> as I
>>>>> can tell, the informal grammar of the file is:
>>>>>
>>>>> <comments terminated by a blank line>
>>>>> [<year number on a line on its own>
>>>>> <daily readings lines> ]+
>>>>>
>>>>> and the <daily readings> are of the form:
>>>>> <whitespace> <day number> [<whitespace> <reading on day of month>] 12
>>>>>
>>>>> Readings for days in months where a day does not exist have special
>>>>> values.
>>>>> Missing values have a different special value.
>>>>>
>>>>> And then I've got the problem of iterating over all relevant files to
>>>>> get a
>>>>> whole timeseries.
>>>>>
>>>>> Is there a way to read in this type of file into R? I've read all of the
>>>>> examples that I can find, but cannot work out how to do it. I don't
>>>>> think
>>>>> that read.table can handle the separate sections of data representing
>>>>> each
>>>>> year. read.ftable maybe can be coerced to parse the data, but I cannot
>>>>> see
>>>>> how after reading the documentation and experimenting with the
>>>>> parameters.
>>>>>
>>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>>>
>>>>> Any help/suggestions would be greatly appreciated. I can see that this
>>>>> type
>>>>> of issue is likely to grow in importance, and I'd also like to give the
>>>>> data
>>>>> owners suggestions on how to reformat their data so that it is easier to
>>>>> consume by machines, while being easy to read for humans.
>>>>>
>>>>> The early records are a serious machine parsing challenge as they are
>>>>> tiff
>>>>> images of old notebooks ;-)
>>>>>
>>>>> tia
>>>>>
>>>>> Tim
>>>>> Tim Coote
>>>>> tim at coote.org
>>>>> vincit veritas
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>
>> Tim Coote
>> tim at coote.org
>> vincit veritas
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


More information about the R-help mailing list