[R] Create single vector after looping through multiple data frames with GREP

Simon Kiss sjkiss at gmail.com
Sun Oct 10 18:35:34 CEST 2010


Hello all, 

I changed the subject line of the e-mail, because the question I''m posing now is different than the first one. I hope that this is proper etiquette.  However, the original chain is included below.

I've incorporated bits of  both Ethan and Brian's code into the script below, but there's one aspect I can't get my head around. I'm totally new to programming with control structures. The reproducible code below creates a list containing 19 data frames, one each for the "Most Important Problem"  survey data for Canada.

What I'd like at this stage is a loop where I can search through all the data frames for rows containing the search term and then bind the rows together in a plotable (sp?) format.

At the bottom of the code below, you'll find my first attempt to make use of a search string and to put it into a plotable format.  It only partially works.  I can only get the numbers for one year, where I'd like to be able to get a string of numbers for several years.But, on the upside, grep appears to do the trick in terms of selecting rows.  

Can any one suggest a solution?
Yours truly,
Simon Kiss

#This is the reproducible code to set-up all the data frames
require("XML")
library(XML)
#This gets the data from the web and lists them
mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_",
c(1987:2001,2003:2006), ".htm", sep="")
alltables <- lapply(mylist, readHTMLTable)

#convert to dataframes
r<-lapply(alltables, function(x) {as.data.frame(x)} )

#This is just some house-cleaning; structuring all the tables so they are uniform 
r[[1]][3]<-r[[1]][2]
r[[1]][2]<-c(" ")
r[[2]][4]<-r[[2]][2]
r[[2]][5]<-r[[2]][3]
r[[2]][2:3]<-c(" ")
r[[3]][4:5]<-r[[3]][3:4]
r[[3]][3]<-c(" ")

#This loop deletes some superfluous columns and rows, turns the first column in to character strings and the data into numeric
for (i in 1:19) {
n.rows<-dim(r[[i]])[1]
r[[i]] <- r[[i]][15:n.rows-3, 1:5]
n.rows<-dim(r[[i]])[1]
row.names(r[[i]]) <-NULL
names(r[[i]]) <- c("Response", "Q1", "Q2", "Q3", "Q4")

r[[i]][, 1]<-as.character(r[[i]][,1])
#r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5]))
r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x) {as.numeric(as.character(x))})
#n.rows<-dim(r[[i]])[1]
#r[[i]]<-r[[i]][9
}

#This code is my first attempt at introducing a search string, getting the rows, binding and plotting;
economy<-r[[10]][grep('Economy', r[[10]][,1]),]
economy_2<-r[[11]][grep('Economy', r[[11]][,1]),]
test<-cbind(economy, economy_2)
plot(as.numeric(test), type='l')

#here's another attempt I'm trying....
economy<-data.frame
for (i in 15:19) {
economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ]
}

Begin forwarded message:

> From: Simon Kiss <sjkiss at gmail.com>
> Date: October 7, 2010 4:59:46 PM EDT
> To: Simon Kiss <simonjkiss at yahoo.ca>
> Subject: Fwd: [R] Converting scraped data
> 
> 
> 
> Begin forwarded message:
> 
>> From: Ethan Brown <ethancbrown at gmail.com>
>> Date: October 6, 2010 4:22:41 PM GMT-04:00
>> To: Simon Kiss <sjkiss at gmail.com>
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Converting scraped data
>> 
>> Hi Simon,
>> 
>> You'll notice the "test" data.frame has a whole mix of characters in
>> the columns you're interested, including a "-" for missing values, and
>> that the columns you're interested in are in fact factors.
>> 
>> as.numeric(factor) returns the level of the factor, not the value of
>> the level. (See ?levels and ?factor)--that's why it's giving you those
>> irrelevant integers. I always end up using something like this handy
>> code snippet to deal with the situation:
>> 
>> unfactor <- function(factors)
>> # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
>> # Transform a factor back into its factor names
>> {
>>  return(levels(factors)[factors])
>> }
>> 
>> Then, to get your data to where you want it, I'd do this:
>> 
>> require(XML)
>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
>> tables <- readHTMLTable(theurl)
>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>> class(tables)
>> test<-data.frame(tables, stringsAsFactors=FALSE)
>> 
>> 
>> result <- test[11:42, 1:5] #Extract the actual data we want
>> names(result) <- c("Response", "Q1", "Q2","Q3","Q4")
>> for(i in 2:5) {
>> # Convert columns to factors
>> result[,i] <- as.numeric(unfactor(result[,i]))
>> }
>> result
>> 
>> From here you should be able to plot or do whatever else you want.
>> 
>> Hope this helps,
>> Ethan Brown
>> 
>> 
>> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at gmail.com> wrote:
>>> Dear Colleagues,
>>> I used this code to scrape data from the URL conatined within.  This code
>>> should be reproducible.
>>> 
>>> require("XML")
>>> library(XML)
>>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
>>> tables <- readHTMLTable(theurl)
>>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>>> class(tables)
>>> test<-data.frame(tables, stringsAsFactors=FALSE)
>>> test[16,c(2:5)]
>>> as.numeric(test[16,c(2:5)])
>>> quartz()
>>> plot(c(1:4), test[15, c(2:5)])
>>> 
>>> calling the values from the row of interest using test[16, c(2:5)] can bring
>>> them up as represented on the screen, plotting them or coercing them to
>>> numeric changes the values and in a way that doesn't make sense to me. My
>>> intuitino is that there is something going on with the way the characters
>>> are coded or classed when they're scraped into R.  I've looked around the
>>> help files for converting from character to numeric but can't find a
>>> solution.
>>> 
>>> I also tried this:
>>> 
>>> as.numeric(as.character(test[16,c(2:5)] and that also changed the values
>>> from what they originally were.
>>> 
>>> I'm grateful for any suggestions.
>>> Yours, Simon Kiss
>>> 
>>> 
>>> 
>>> *********************************
>>> Simon J. Kiss, PhD
>>> Assistant Professor, Wilfrid Laurier University
>>> 73 George Street
>>> Brantford, Ontario, Canada
>>> N3T 2C9
>>> Cell: +1 519 761 7606
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
> 
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 519 761 7606



More information about the R-help mailing list