[R] Difficult subset challenge

Sat Dec 10 23:21:22 CET 2011

Hi Noah,

I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want  (0 - M) / sigma.  If that is the case, here is an
example:

## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code    v1      v2
G1              1.2     2.3
G1              0       2.4
G1              1.4     3.4
G2              2.9     2.3
G2              4.3     4.4"), header = TRUE)
closeAllConnections()

## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp

If you want the zeros standardized, it will take a bit of a different
approach.  The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc.  That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.

Cheers,

Josh

On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <noahsilverman at ucla.edu> wrote:
> Hi,
>
> I'm having difficulty coming up with a good way to subest some data to generate statistics.
>
> My data frame has multiple observations by group.
>
> Here is an overly-simplified toy example of the data
> ==========================
> code    v1      v2
> G1              1.2     2.3
> G1              0       2.4
> G1              1.4     3.4
> G2              2.9     2.3
> G2              4.3     4.4
> etc..
> =========================
>
> I want to normalize the data *by group*  for certain variable.  But, I want to ignore 0 values when calculating the mean and standard deviation.
>
> What I *want* to do is something like this:
> =======================
>         for (code in unique (d$code) ){
>                 mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
>                 sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
>                 d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
>         }
> =======================
>
> My goal, if it isn't apparent, is to replace values with their normalized value.  (But, the statistics used for normalization are calculated skipping zero values.)
>
> This doesn't work as the indexing from the which command is relative (1,2,3, etc.)
>
>
> Suggestions?
>
>
>
> --
> Noah Silverman
> UCLA Department of Statistics
> 8208 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/