[R] Difficult subset challenge
Joshua Wiley
jwiley.psych at gmail.com
Sat Dec 10 23:21:22 CET 2011
Hi Noah,
I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want (0 - M) / sigma. If that is the case, here is an
example:
## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code v1 v2
G1 1.2 2.3
G1 0 2.4
G1 1.4 3.4
G2 2.9 2.3
G2 4.3 4.4"), header = TRUE)
closeAllConnections()
## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp
If you want the zeros standardized, it will take a bit of a different
approach. The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc. That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.
Cheers,
Josh
On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <noahsilverman at ucla.edu> wrote:
> Hi,
>
> I'm having difficulty coming up with a good way to subest some data to generate statistics.
>
> My data frame has multiple observations by group.
>
> Here is an overly-simplified toy example of the data
> ==========================
> code v1 v2
> G1 1.2 2.3
> G1 0 2.4
> G1 1.4 3.4
> G2 2.9 2.3
> G2 4.3 4.4
> etc..
> =========================
>
> I want to normalize the data *by group* for certain variable. But, I want to ignore 0 values when calculating the mean and standard deviation.
>
> What I *want* to do is something like this:
> =======================
> for (code in unique (d$code) ){
> mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
> sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
> d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
> }
> =======================
>
> My goal, if it isn't apparent, is to replace values with their normalized value. (But, the statistics used for normalization are calculated skipping zero values.)
>
> This doesn't work as the indexing from the which command is relative (1,2,3, etc.)
>
>
> Suggestions?
>
>
>
> --
> Noah Silverman
> UCLA Department of Statistics
> 8208 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
More information about the R-help
mailing list