[R] cbind() and factors.

Frank E Harrell Jr f.harrell at vanderbilt.edu
Sat Dec 11 14:42:35 CET 2004


Gabor Grothendieck wrote:
> michael watson (IAH-C <michael.watson <at> bbsrc.ac.uk> writes:
> 
> : 
> : Hi
> : 
> : I'm seeing some "odd" behaviour with cbind().  My code is:
> : 
> : > cat <- read.table("cogs_category.txt", sep="\t", header=TRUE,
> : quote=NULL, colClasses="character")
> : > colnames(cat)
> : [1] "Code"        "Description"
> : > is.factor(cat$Code)
> : [1] FALSE
> : > is.factor(cat$Description)
> : [1] FALSE
> : > is.factor(rainbow(nrow(cat)))
> : [1] FALSE
> : > cat <- cbind(cat,"Color"=rainbow(nrow(cat)))
> : > is.factor(cat$Color)
> : [1] TRUE
> : > ?cbind
> : 
> : I read a text file in which has two columns, Code and Description.
> : Neither of these are factors.  I want to add a column of colours to the
> : data frame using rainbow().  The rainbow function also does not return a
> : factor.  However, if I cbind my data frame (which has no factors in it)
> : and the results of rainbow() (which is a vector, not a factor), then for
> : some reason the new column is a factor...??
> 
> Others have already explained the problem and given what is likely
> the best solution but here is one other idea, just in case.
> 
> You may require a data frame depending on what you want to do but
> if you don't then you could alternately use a character matrix
> since that won't result in any conversions to factor.
> 
> Lets call the data frame from read.table, Cat.df, and our 
> matrix, Cat.m.  cat is not wrong but its confusing 
> since there is a common R function called cat.  Now we can 
> write the following and don't have to worry about factors:
> 
> Cat.df <- read.table(...)
> # create a character matrix and cbind Colors to it
> Cat.m <- cbind(as.matrix(Cat.df), Color = rainbow(nrow(Cat.df)))
> 
> If you do find you need a data frame later you can convert it back
> like this:
> 
> Cat.df <- as.data.frame(Cat.m)
> Cat.df[] <- Cat.m  # clobber factors with character data
> 

For speed, the mApply function in the Hmisc package (used by the Hmisc 
summarize function) does looping for stratified statistical summaries by 
operating on matrices rather than data frames.   factors are converted 
to numerics, and service routines can save and restore the levels and 
other attributes.  Here is an example from the summarize help file, plus 
related examples:

# To run mApply on a data frame:
m <- mApply(asNumericMatrix(x), race, h)
# Here assume h is a function that returns a matrix similar to x
at <- subsAttr(x)  # get original attributes and storage modes
matrix2dataFrame(m, at)
# Get stratified weighted means
g <- function(y) wtd.mean(y[,1],y[,2])
summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')
mApply(cbind(y,wts), llist(sex,race), g)

# Compare speed of mApply vs. by for computing
d <- data.frame(sex=sample(c('female','male'),100000,TRUE),
                 country=sample(letters,100000,TRUE),
                 y1=runif(100000), y2=runif(100000))
g <- function(x) {
   y <- c(median(x[,'y1']-x[,'y2']),
          med.sum =median(x[,'y1']+x[,'y2']))
   names(y) <- c('med.diff','med.sum')
   y
}
system.time(by(d, llist(sex=d$sex,country=d$country), g))
system.time({
              x <- asNumericMatrix(d)
              a <- subsAttr(d)
              m <- mApply(x, llist(sex=d$sex,country=d$country), g)
             })
system.time({
              x <- asNumericMatrix(d)
              summarize(x, llist(sex=d$sex, country=d$country), g)
             })

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list