[R] Very confused with class

Dan Davison davison at stats.ox.ac.uk
Thu Aug 21 17:57:54 CEST 2008


On Thu, Aug 21, 2008 at 04:20:57PM +0100, Williams, Robin wrote:

> Hi Dan, 
>   Thanks for the reply, yes, I am using read.csv on the attached file.

OK, so how about using the colClasses argument. Your problem is that
some malfunctioning software has inserted the value "#VALUE!" into
some of your supposedly numeric cells. So deal with that with the
na.strings argument. Like I said, when reading in data, it's worth
spending a minute looking at the documentation for read.table/read.csv
rather than spending an hour messing about with the results of not
doing so.

> Southwest <-  read.csv("southwest.csv", colClasses=c("character",rep("numeric",10), "character"), na.strings="#VALUE!")
> str(Southwest)
'data.frame':	1530 obs. of  12 variables:
 $ date      : chr  "5/1/1997" "5/2/1997" "5/3/1997" "5/4/1997" ...
 $ maxtemp   : num  18.8 21.8 16.6 14.9 14.2 9.3 9.9 12.7 12.8 13.2 ...
 $ mintemp   : num  7.7 9.8 11 12.2 11.3 4.5 2.1 5.7 6.7 7.3 ...
 $ pressure  : num  1028 1023 1015 1001  989 ...
 $ humid     : num  59 44 83 80 87 57 64 83 70 69 ...
 $ wind      : num  8.4 11.1 8.2 17.4 13.8 16.2 11.1 14.9 12.7 16.6 ...
 $ rain      : num  0 0 6 1 3.3 2.6 4.3 6 3.2 1.6 ...
 $ index     : num  1 2 3 4 5 6 7 8 9 10 ...
 $ admissions: num  5.00 4.72 5.16 3.67 3.62 ...
 $ detrended : num  4.79 4.47 5.30 3.91 3.51 ...
 $ detrended2: num  4.79 4.47 5.30 3.91 3.51 ...
 $ d.o.w.    : chr  "Thu" "Fri" "Sat" "Sun" ...

NB you could coerce those dates to a date class rather than character
but I'll leave that up to you.

str() is your friend.

Dan

> However, as when I do 
> Southwest <- data.frame(read.csv("southwest.csv")

read.csv returns a data frame; no need to wrap it in data.frame()

> Names(southwest)
>   the output is the column headings (i.e. the variables), and looking at
> the data I only get the numbers, I assume the column headings haven't
> become confused with the data. 
> I.e. if I just do 
> Southwest$pressure
> The output is correct, i.e. the values contained in the pressure column.
> 
>   Appologies for my repeated question, but I'm somewhat confused on this
> one and my lack of experience with R isn't helping matters. I don't even
> understand why R is interpreting these figures as factors in the first
> place, doesn't this imply that any similar data would be interpreted as
> factors?   
> Thanks for any further help.
> Robin Williams 
> Met Office summer intern - Health Forecasting 
> robin.williams at metoffice.gov.uk 
> -----Original Message-----
> From: Dan Davison [mailto:davison at stats.ox.ac.uk] 
> Sent: Thursday, August 21, 2008 4:11 PM
> To: Williams, Robin
> Cc: r-help at r-project.org
> Subject: Re: [R] Very confused with class
> 
> Hi Robin,
> 
> You haven't said where you're getting the data from. But if the answer
> is that you're using read.table, read.csv or similar to read the data
> into R, then I advise you to go back to that stage and get it right from
> the outset. It's very, very common to see people who are relatively new
> to R splattering their code with calls to as.numeric, just because they
> haven't read the data in properly in the first place. It's also common
> in those who aren't new to R... So e.g. if you are using read.table,
> then use the colClasses argument to specify the classes of your columns,
> and use str() on the result until you're happy with the data frame
> produced.
> 
> It's not entirely clear why you would have ended up with factors if your
> data are numeric. That often happens when people mix characters with
> numbers. Perhaps you have mixed the header row up with the data?
> 
> Anyway, what you are seeing are the integer encodings of the factors.
> E.g. 
> 
> > f <- factor(11:20)
> > str(f)
>  Factor w/ 10 levels "11","12","13",..: 1 2 3 4 5 6 7 8 9 10
> > as.numeric(f)
>  [1]  1  2  3  4  5  6  7  8  9 10
> 
> But don't mess with them. Just make sure that things which shouldn't be
> factors never become factors.
> 
> Dan
> 
> On Thu, Aug 21, 2008 at 03:40:58PM +0100, Williams, Robin wrote:
> > Hi all,
> >   I am very confused with class.
> >   I am looking at some weather data which I want to use as explanatory
> 
> > variables in an lm. R has treated these variables as factors (i.e. 
> > with different levels), whereas I want them treated as discretely 
> > measured continuous variables. So I need to reassign the class of 
> > these variables, right?
> > Indeed, doing
> > class(southwest$pressure)
> > (pressure being air pressure), I get
> > #> factor.
> >   Now what class should I use to reassign them so that my model 
> > fitting process goes as I want it to? I have obviously done something 
> > wrong. I did southwest$pressure <- as(southwest$pressure,"numeric") 
> > numeric seeming like a reasonable class to assign to this variable.
> > However, doing some summary stats like
> > mean(southwest$pressure)
> > #> 341,
> > max(southwest$pressure)
> > #> 761,
> > which is clearly nonsense, as my maximum value is around 1040. 
> > Something similar has happened to maxtemp (maximum temperature), which
> 
> > I also reassigned from a factor to class numeric, which now apparently
> 
> > has a maximum value of 147!
> >   Clearly it must be the reassignment of class that has caused these 
> > problems, as summary stats on the data before I reassigned the classes
> 
> > were fine. What is wrong with the class numeric? Reading the numeric 
> > help page didn't reveal anything to me. Can someone suggest the 
> > correct class?
> > Many thanks for any help.  
> > Robin Williams
> > Met Office summer intern - Health Forecasting 
> > robin.williams at metoffice.gov.uk
> >  
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --
> http://www.stats.ox.ac.uk/~davison



-- 
http://www.stats.ox.ac.uk/~davison



More information about the R-help mailing list