[R] perhaps 'aggregate()' (was: How to write efficient R code)

Wed Feb 18 14:40:14 CET 2004

Sebastian and Andy  -

Yes, Andy has read the question correctly.  A similar task that
I do quite often is to subtract the mean of a class from all of
the members of the class, and do this within every column of a
(numeric) data frame.  Kurt Hornik's function  aggregate()  is
the one to use.  Here's an example using the same data set that
he uses in the example on the help page.  (Only the commands are
shown here.  You'll have to try them to see the output, because
I cannot cut and paste easily into my email.)

data(state)
ls()
	#  This data set puts individual columns into your workspace,
	#  rather than making a data frame of them.

example <- data.frame(state.abb, state.name, state.region, state.x77)
str(example)
means   <- aggregate(example[ ,3+seq(8)], list(example[ ,3]), mean)
str(means)
residuals <- example[ ,3+seq(8)] - means[as.numeric(example[ ,3]), -1]
result  <- cbind(example[ ,seq(3)], residuals)
str(result)

 -- Ah, I think this example would be easier to read if I had used
the columns from the workspace directly, rather than packaging them
into a data frame 'example' first, the using numeric subscripts on
the data frame.  But, at least this illustrates some common ways of
subscripting subsets of columns from a data frame.

Again, see  help("aggregate"), help("Subscript")  to see what I am
doing here.

-  best  -  tom blackwell  -  u michigan medical school  -  ann arbor  -

(Ah, I see that Andy has just replied this morning as well.  I'll see
what his reply was as soon as I send off this one.)

On Tue, 17 Feb 2004, Sebastian Luque wrote:

> Hi,
>
> This is exactly what I meant Andy, the stratifying variable can be
> thought of as a factor. However, I tried your code and I get the error:
> "Error in Ops.data.frame......- only defined for equally-sized data
> frames". What may be happening?
> The result of 'apply' functions, or 'split' or 'by' and the like give
> lists as results, with a names attribute that, in my case, would have
> the levels of the factor. How can one get the results back to a
> data.frame object, with the newly calculated variables? The indexing for
> lists is not as straight forward as for data frames.
>
> Thanks to both of you for showing me the power of indexing in R functions!
>
> Sebastian
>
>
> Liaw, Andy wrote:
>
> >I'm guessing what Sebatian want is to do the differencing by a stratifying
> >variable such as ID; e.g., the data may look like:
> >
> >df <- as.data.frame(cbind(ID=rep(1:5, each=3), x=matrix(rnorm(45), 15, 3))
> >
> >So using Tom's solution, one would do something like:
> >
> >mdiff <- function(x) x[-1,] - x[nrow(x),]
> >sapply(split(df[,-1], df[,1]), mdiff)
> >
> >There could well be more efficient ways!
> >
> >Andy
> >
> >
> >
> >>From: Tom Blackwell
> >>
> >>Sebastian  -
> >>
> >>For successive differences within a single column 'x'
> >>
> >>differences <- c(NA, diff(x)),
> >>
> >>same as
> >>
> >>differences <- c(NA, x[-1] - x[-length(x)]).
> >>
> >>See  help("diff"), help("Subscript").  The second version also
> >>works when  x  is a matrix or a data frame, except now the result
> >>is a matrix or data frame of the same size.
> >>
> >>x <- data.frame(matrix(rnorm(1e+5), 1e+4))
> >>dim(x)               # 10000    10
> >>differences <- rbind(rep(NA, 10), x[-1, ] - x[-dim(x)[1], ])
> >>dim(differences)     # 10000    10
> >>
> >>However, you write "I need to do this for all the subsets of data
> >>created by the numbers in one of the columns of the data frame ..."
> >>and I'm not sure I understand how an 'id' column would create many
> >>subsets of the data.  So the simple examples above may not answer
> >>the question you are asking.
> >>
> >>-  tom blackwell  -  u michigan medical school  -  ann arbor  -
> >>
> >>On Tue, 17 Feb 2004, Sebastian Luque wrote:
> >>
> >>
> >>
> >>>Hi,
> >>>
> >>>In fact, I've been trying to get rid of loops in my code for more
> >>>than a week now, but nothing I try seems to work. It sounds as if
> >>>you have lots of experience with loops, so would appreciate any
> >>>pointers you may have on the following.
> >>>
> >>>I want to create a column showing the difference between the ith
> >>>row and i-1. Of course, the first row won't have any value in it,
> >>>because there is nothing above it to subtract to. This is fairly
> >>>easy to do with a simple loop, but I need to do this for all the
> >>>subsets of data created by the numbers in one of the columns of
> >>>the data frame (say, an id column). I would greatly appreciate
> >>>any idea you may have on this.
> >>>
> >>>Thanks in advance.
> >>>
> >>>Best regards,
> >>>Sebastian
> >>>--
> >>>  Sebastian Luque
> >>>
> >>>sluque at mun.ca
> >>>
> >>>
> >>>
> >>>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide!
> >>http://www.R-project.org/posting-guide.html
> >>
> >>
> >>
> >>
> >
> >
> >------------------------------------------------------------------------------
> >Notice:  This e-mail message, together with any attachments, contains
> >information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New
> >Jersey, USA 08889), and/or its affiliates (which may be known outside the
> >United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as
> >Banyu) that may be confidential, proprietary copyrighted and/or legally
> >privileged. It is intended solely for the use of the individual or entity
> >named on this message.  If you are not the intended recipient, and have
> >received this message in error, please notify us immediately by reply e-mail
> >and then delete it from your system.
> >------------------------------------------------------------------------------
> >
> >
> >
>
> --
> Sebastian Luque
>
> sluque at mun.ca
> Tel.: +1 (204) 586-8170
>
>
>
>
>