[R] Downloading data from from internet

Bogaso bogaso.christofer at gmail.com
Sat Sep 26 07:45:29 CEST 2009


Thanks Duncan for your input. However I could not install the package
"RHTMLForms", it is saying as not not available :

> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R") 
Warning in install.packages("RHTMLForms", repos =
"http://www.omegahat.org/R") :
  argument 'lib' is missing: using
'C:\Users\Arrun's\Documents/R/win-library/2.9'
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
  package ‘RHTMLForms’ is not available

I found this package in net : http://www.omegahat.org/RHTMLForms/ However it
is gz file which I could not use as I am a window user. Can you please
provide me alternate source?

Thanks,



Duncan Temple Lang wrote:
> 
> 
> 
> Bogaso wrote:
>> Thank you so much for those helps. However I need little more help. In
>> the
>> site
>> "http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php"
>> if I scroll below then there is an option "Historical CPI Index For USA"
>> Next if I click on "Get Data" then another table pops-up, however without
>> any significant change in address bar. This tables holds more data
>> starting
>> from 1999. Can you please help me how to get the values of this table?
>> 
> 
> 
> Hi again
> 
> Well, this is a little bit more involved, as this is an HTML form
> and so we need to be able to emulate submitting a form with
> values for the different parameters the form expects, along with
> ensuring they are correct inputs.  Ordinarily, this would involve
> looking at the source of the HTML document, finding the relevant
> <form> element, getting its action attribute, and all its inputs
> and figuring out the possible inputs.  This is "straightforward"
> but involved. But we have an R package that does this reasonably
> well in an automated form. This is the RHTMLForms from the
> www.omegahat.org/R repository.
> 
> We can use this with
>  install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
> 
> Then
> 
> library(RHTMLForms)
> 
> ff =
> getHTMLFormDescription("http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php")
> 
> # The form we want is the third one. We can determine this
> # from the names of the parameters.
> # So we request that this form description be turned into an R function
> 
> g = createFunction(ff[[3]])
> 
>   # Now we call this.
> xx = g("2001", "2008")
> 
> 
>   # This returns the content of an HTML document
>   # so we parse it and then pass this to readHTMLTable()
>   # This is why we have methods for
> 
> library(XML)
> doc = htmlParse(xx, asText = TRUE)
> tbls = readHTMLTable(doc)
> 
>   # we want the last of the tables.
> tbls[[length(tbls)]]
> 
> 
> So hopefully that helps solve your problem and introduces another Omegahat
> package that
> we hope people find through Google. The RHTMLForms package is an approach
> to the
> poor-man's Web services - HTML forms- rather than REST and SOAP that are
> becoming more relevant
> each day.  The RCurl and SSOAP address the latter.
> 
>   D.
> 
> 
> 
> 
> 
>> Thanks
>> 
>> 
>> Duncan Temple Lang wrote:
>>>
>>> Thanks for explaining this, Charlie.
>>>
>>> Just for completeness and to make things a little easier,
>>> the XML package has a function named readHTMLTable()
>>> and you can call it with a URL and it will attempt
>>> to read all the tables in the page.
>>>
>>>  tbls =
>>> readHTMLTable('http://www.rateinflation.com/consumer-price-index/usa-cpi.php')
>>>
>>> yields a list with 10 elements, and the table of interest with the data
>>> is
>>> the 10th one.
>>>
>>>  tbls[[10]]
>>>
>>> The function does the XPath voodoo and sapply() work for you and uses
>>> some
>>> heuristics.
>>> There are various controls one can specify and also various methods for
>>> working
>>> with sub-parts of the HTML document directly.
>>>
>>>   D.
>>>
>>>
>>>
>>> cls59 wrote:
>>>>
>>>> Bogaso wrote:
>>>>> Hi all,
>>>>>
>>>>> I want to download data from those two different sources, directly
>>>>> into
>>>>> R
>>>>> :
>>>>>
>>>>> http://www.rateinflation.com/consumer-price-index/usa-cpi.php
>>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>>
>>>>> First one is CPI of US and 2nd one is WPI of India. Can anyone please
>>>>> give
>>>>> any clue how to download them directly into R. I want to make them zoo
>>>>> object for further analysis.
>>>>>
>>>>> Thanks,
>>>>>
>>>> The following site did not load for me:
>>>>
>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>
>>>> But I was able to extract the table from the US CPI site using Duncan
>>>> Temple
>>>> Lang's XML package:
>>>>
>>>>   library(XML)
>>>>
>>>>
>>>> First, download the website into R:
>>>>
>>>>   html.raw <- readLines(
>>>> 'http://www.rateinflation.com/consumer-price-index/usa-cpi.php' )
>>>>
>>>> Then, convert to an HTML object using the XML package:
>>>>
>>>>   html.data <- htmlTreeParse( html.raw, asText = T, useInternalNodes =
>>>> T
>>>> )
>>>>
>>>> A quick scan of the page source in the browser reveals that the table
>>>> you
>>>> want is encased in a div with a class of "dynamicContent"-- we will use
>>>> a
>>>> xpath specification[1] to retrieve all rows in that table:
>>>>
>>>>   table.html <- getNodeSet( html.data,
>>>> '//div[@class="dynamicContent"]/table/tr' )
>>>>
>>>> Now, the data values can be extracted from the cells in the rows using
>>>> a
>>>> little sapply and xpathXpply voodoo:
>>>>
>>>>   table.data <- t( sapply( table.html, function( row ){
>>>>
>>>>     row.data <-  xpathSApply( row, './td', xmlValue )
>>>>     return( row.data)
>>>>
>>>>   }))
>>>>
>>>>
>>>> Good luck!
>>>>
>>>> -Charlie
>>>>  
>>>>   [1]:  http://www.w3schools.com/XPath/xpath_syntax.asp
>>>>
>>>> -----
>>>> Charlie Sharpsteen
>>>> Undergraduate
>>>> Environmental Resources Engineering
>>>> Humboldt State University
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Downloading-data-from-from-internet-tp25568930p25622550.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list