[R] Iteratively subsetting data by factor level across multiple variables
William Dunlap
wdunlap at tibco.com
Thu Jan 15 22:46:20 CET 2015
There are lots of ways to do this. You have to decide on how you want to
organize the results.
Here are two ways that use only core R packages. Many people like the plyr
package for this
split-data/analyze-parts/combine-results sort of thing.
> df <- data.frame(x=1:27,response=log2(1:27),
g1=rep(letters[1:2],len=27),g2=rep(LETTERS[24:26],c(10,10,7)))
> s <- split(seq_len(nrow(df)), df[c("g1","g2")])
> mean(subset(df, df$g1=="a" & df$g2=="Z")$response)
[1] 4.578656
> vapply(s, function(si)mean(df$response[si]), FUN.VALUE=0) # a.Z part is
previous result
a.X b.X a.Y b.Y a.Z b.Z
1.976834 2.381378 3.880430 3.976834 4.578656 4.581611
> coef(lm(response~x, data=subset(df, df$g1=="a" & df$g2=="Z"))) #
regression example
(Intercept) x
3.12905040 0.06040022
> vapply(s, function(si)coef(lm(response ~ x, data=df[si,])),
FUN.VALUE=rep(0,2))
a.X b.X a.Y b.Y a.Z b.Z
(Intercept) 0.0862735 0.6882213 2.40741927 2.50763309 3.12905040 3.13556268
x 0.3781121 0.2821928 0.09820075 0.09182506 0.06040022 0.06025202
For the particular case of computing means of a partition of the data you
can use lm() once,
which gives the same numbers organized in a different way:
> coef(lm(response ~ x * (g1:g2) - x - 1, data=df))
g1a:g2X g1b:g2X g1a:g2Y g1b:g2Y g1a:g2Z g1b:g2Z
0.08627350 0.68822126 2.40741927 2.50763309 3.12905040 3.13556268
x:g1a:g2X x:g1b:g2X x:g1a:g2Y x:g1b:g2Y x:g1a:g2Z x:g1b:g2Z
0.37811212 0.28219281 0.09820075 0.09182506 0.06040022 0.06025202
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Thu, Jan 15, 2015 at 11:42 AM, Reid Bryant <reidbryant at gmail.com> wrote:
> Hi R experts!
>
> I would like to have a scripted solution that will iteratively subset data
> across many variables per factor level of each variable.
>
> To illustrate, if I create a dataframe (df) by:
>
> variation <- c("A","B","C","D")
> element1 <- as.factor(c(0,1,0,1))
> element2 <- as.factor(c(0,0,1,1))
> response <- c(4,2,6,2)
> df <- data.frame(variation,element1,element2,response)
>
> I would like a function that would allow me to subset the data into four
> groups and perform analysis across the groups. One group for each of the
> two factor levels across two variables. In this example its fairly easy
> because I only have two variables with two levels each, but would I would
> like this to be extendable across situations where I am dealing with more
> than 2 variables and/or more than two factor levels per variable. I am
> looking for a result that will mimic the output of the following:
>
> element1_level0 <- subset(df,df$element1=="0")
> element1_level1 <- subset(df,df$element1=="1")
> element2_level0 <- subset(df,df$element2=="0")
> element2_level1 <- subset(df,df$element2=="1")
>
> The purpose would be to perform analysis on the df across each subset.
> Simplistically this could be represented as follows:
>
> mean(element1_level0$response)
> mean(element1_level1$response)
> mean(element2_level0$response)
> mean(element2_level1$response)
>
> Thanks,
> Reid
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list