[R] Why do data frame column types vary across apply, lapply?

Erik Iverson eriki at ccbr.umn.edu
Fri Apr 30 17:45:18 CEST 2010


> 
> I still have little ability to predict how these functions will treat the
> columns of data frames:

All of this is explained by knowing what class of data functions *work 
on*, and what class of data *you have*.
> 
>> # Here's a data frame with a column "a" of integers, 
>> # and a column "b" of characters: 
>> df <- data.frame( 
> + a = 1:2, 
> + b = c("a","b") 
> + ) 
>> df 
>   a b 
> 1 1 a 
> 2 2 b 

First, let's see what we have?

Use str(df)

str(df)
'data.frame':	2 obs. of  2 variables:
  $ a: int  1 2
  $ b: Factor w/ 2 levels "a","b": 1 2

So we have a data.frame with two variables, one of class integer and one 
of class factor.  Notice how neither are of class character.

>> # Except -- both columns are characters: 
>> apply (df, 2, typeof) 
>           a           b 
> "character" "character" 

See ?apply.  The apply function works on *matrices*. You're not passing 
it a matrix, you're passing a data.frame.  Matrices are two dimensional 
vectors and are of *ONE* type. So apply could either

1) report an error saying "give me a matrix"

or


2) try to convert whatever you gave it to a matrix.

Apply does (2), and converts it to the best thing it can, a character 
matrix.  It can't be a numeric matrix since you have mixed types of 
data, so it goes to the "lowest common denominator", a matrix of 
characters.  This is all explained in the first paragraph of ?apply.


>> # Except -- they're both integers: 
>> lapply (df, typeof) 
> $a 
> [1] "integer" 
> 
> $b 
> [1] "integer" 

?typeof is probably not very useful for casual R use.  I've never used 
it.  More useful is ?class.  ?typeof is showing you how R is storing 
this stuff low-level.  Factors are just integer codes with labels, and 
you have an integer variable and a factor variable, thus ?typeof reports 
both integers.

Try lapply(df, class)
> 
>> # Except -- only one of those integers is numeric: 
>> lapply (df, is.numeric) 
> $a 
> [1] TRUE 
> 
> $b 
> [1] FALSE 

Yes, because you have a factor, and in the first 3 paragraphs of 
?as.numeric, you'd see:

Factors are handled by the default method, and there
      are methods for classes ‘"Date"’ and ‘"POSIXt"’ (in all three
      cases the result is false).  Methods for ‘is.numeric’ should only
      return true if the base type of the class is ‘double’ or ‘integer’
      _and_ values can reasonably be regarded as numeric (e.g.
      arithmetic on them makes sense).

See, it all makes perfect sense :).

My advice?  Don't worry about typeof.  *Always* know what class your 
objects are, and what class the functions you're using expect.  Use ?str 
liberally.



More information about the R-help mailing list