[R] Converting scraped data

Brian Diggs diggsb at ohsu.edu
Wed Oct 6 22:32:24 CEST 2010


On 10/6/2010 8:52 AM, Simon Kiss wrote:
> Dear Colleagues,
> I used this code to scrape data from the URL conatined within. This code
> should be reproducible.
>
> require("XML")
> library(XML)
> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
> tables <- readHTMLTable(theurl)
> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
> class(tables)
> test<-data.frame(tables, stringsAsFactors=FALSE)
> test[16,c(2:5)]
> as.numeric(test[16,c(2:5)])
> quartz()
> plot(c(1:4), test[15, c(2:5)])
>
> calling the values from the row of interest using test[16, c(2:5)] can
> bring them up as represented on the screen, plotting them or coercing
> them to numeric changes the values and in a way that doesn't make sense
> to me. My intuitino is that there is something going on with the way the
> characters are coded or classed when they're scraped into R. I've looked
> around the help files for converting from character to numeric but can't
> find a solution.
>
> I also tried this:
>
> as.numeric(as.character(test[16,c(2:5)] and that also changed the values
> from what they originally were.
>
> I'm grateful for any suggestions.
> Yours, Simon Kiss

str() gives you an indication of how things are stored and can help in 
these situations.

 > str(test)
'data.frame':   45 obs. of  10 variables:
  $ NULL.V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1 
1 35 1 1 1 23 18 2 32 ...
  $ NULL.V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1 
32 3 ...
  $ NULL.V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA 
NA 30 1 ...
  $ NULL.V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA 
NA 30 NA ...
  $ NULL.V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1 
NA NA 29 NA ...
  $ NULL.V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
  $ NULL.V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
  $ NULL.V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
  $ NULL.V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
  $ NULL.V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...

So columns 2-5 are factors, despite the stringsAsFactors=FALSE in the 
data.frame call.  That is because they were factors already in tables

 > str(tables)
List of 1
  $ NULL:'data.frame':   45 obs. of  10 variables:
   ..$ V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1 1 
35 1 1 1 23 18 2 32 ...
   ..$ V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1 32 
3 ...
   ..$ V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA NA 
30 1 ...
   ..$ V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA NA 
30 NA ...
   ..$ V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1 NA 
NA 29 NA ...
   ..$ V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
   ..$ V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
   ..$ V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
   ..$ V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
   ..$ V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...

So your idea that the "numbers" you see are really character 
representations and not actually numbers is right.  And you are almost 
there with the as.numeric(as.character()) construct.  That would work 
for a single factor, but doesn't work for a data.frame.

 > test[16,c(2:5)]
    NULL.V2 NULL.V3 NULL.V4 NULL.V5
16     7.2     9.1     7.7    15.2
 > as.character(test[16,c(2:5)])
[1] "25" "27" "26" "14"

You get a string representation of the underlying factor levels, not the 
labels.  If you do this column-by-column, it does work.  Since 
data.frames are special types of lists, you can use lapply:

 > test[16,c(2:5)]
    NULL.V2 NULL.V3 NULL.V4 NULL.V5
16     7.2     9.1     7.7    15.2
 > lapply(test[16,c(2:5)], as.character)
$NULL.V2
[1] "7.2"

$NULL.V3
[1] "9.1"

$NULL.V4
[1] "7.7"

$NULL.V5
[1] "15.2"

 > as.numeric(lapply(test[16,c(2:5)], as.character))
[1]  7.2  9.1  7.7 15.2

That said, I'd extract the responses part of the data out, clean it all, 
and then do whatever you planned with it:

responses <- test[11:42,1:5]
responses[,1] <- factor(responses[,1])
responses[,2:5] <- lapply(responses[,2:5], function(x) 
{as.numeric(as.character(x))})
names(responses) <- c("Response", "Q1", "Q2", "Q3", "Q4")

 > str(responses)
'data.frame':   32 obs. of  5 variables:
  $ Response: Factor w/ 32 levels "Afghanistan/Military",..: 5 6 4 8 9 
10 11 12 14 15 ...
  $ Q1      : num  2.4 2.1 NA 5.6 2.3 7.2 1 1.8 28.4 0.6 ...
  $ Q2      : num  3.3 1.6 NA 5.6 1.8 9.1 0.4 2.4 19.4 2.1 ...
  $ Q3      : num  3.4 1.3 0.3 5.3 2.6 7.7 0.3 1.3 21 1.7 ...
  $ Q4      : num  2.7 1.5 0.6 5.1 1.3 15.2 0.2 0.7 16.7 2 ...


> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University



More information about the R-help mailing list