[R] How to import HTML and SQL files

Warren Young warren at etr-usa.com
Wed Feb 4 15:36:18 CET 2009


Arup wrote:
> I can't import any HTML or SQL files into R..:confused:

Yeah, I'm confused, too.

What exactly is it you're trying to do?  Not the technical task you 
asked about, but the effect you're trying to achieve?  Can you give 
details about the exact nature of your data sources, or, better, examples?

I ask because actually importing HTML and SQL files is almost certainly 
the wrong approach.  You almost never want to handle texts in either 
language directly in R.

For SQL, you usually don't have "SQL files": files literally containing 
SQL queries.  Or if you do happen to have SQL query files, you probably 
don't want to parse them with R.  I expect what you really want is to be 
able to query a database using SQL.  For that, look up DBI on CRAN. 
This will let you connect R to a database server, and use SQL to get 
data from it in a format that R can process directly.

For HTML, the problem is that HTML is a very difficult language to parse 
correctly in the general case.  Much of the reason for that is that few 
web pages are actually legal HTML, but browsers will quietly cope with 
many classes of errors.  To parse such stuff in R, it's usually best to 
take a case-by-case approach, matching particular structures within the 
file so you can extract the few bits of data you want.  You might want 
to post a snippet of the HTML here to get suggestions.

If you really do have to be able to accept arbitrary HTML, I'd suggest 
running the HTML through a filter that converts it to XHTML, then use 
the XML package from CRAN to load it up into R.

You might also want to look into the RCurl package, if the HTML lives on 
a web server.  You can download it directly instead of saving it out to 
an HTML file.  Then you can use the methods above to process it.




More information about the R-help mailing list