[R] Computing sd across an array with missing values

Fri Feb 27 02:54:53 CET 2009

It just so happens that I created a vectorized SD function the other
day. See the column and row versions below. If there are any
rows/columns with only one non-NA value, it will return NaN.

col_sd = function(x){
	dimx = dim(x)
	x.mean = .Internal(colMeans(x,dimx[1],dimx[2],na.rm=TRUE))
	err = t(t(x)-x.mean)
	err.sq = err*err
	sum.err.sq = .Internal(colSums(err.sq,dimx[1],dimx[2],na.rm=TRUE))
	n = .Internal(colSums(!is.na(x),dimx[1],dimx[2],na.rm=TRUE))
	sqrt(sum.err.sq/(n-1))
}

row_sd = function(x){
	dimx = dim(x)
	x.mean = .Internal(rowMeans(x,dimx[1],dimx[2],na.rm=TRUE))
	err = x-x.mean
	err.sq = err*err
	sum.err.sq = .Internal(rowSums(err.sq,dimx[1],dimx[2],na.rm=TRUE))
	n = .Internal(rowSums(!is.na(x),dimx[1],dimx[2],na.rm=TRUE))
	sqrt(sum.err.sq/(n-1))
}

On Wed, Feb 25, 2009 at 5:11 PM, Jorge Ivan Velez
<jorgeivanvelez at gmail.com> wrote:
> Dear Matt,
> You're absolutely right. The reason why my code gives different results is
> because it calculates the standard deviation for each row/column in all the
> arrays rather than for every cell. My bad.
>
> You can easily get the results you posted running
>
>>apply(p,1:2,sd,na.rm=TRUE)
>
> Here is another option which gives the same results you posted :-)
>  Unfortunately there is not timing improvement :(   (Note that I ran both
> sd1 and sd2 functions using pp instead of p)
>
> # -------------
> # System times
> # -------------
>> pp <- array(c(1:5, rep(NA, times = 3)), dim = c(1000, 1000, 60))
>> system.time(apply(pp, 1:2, sd1))
> #   user  system elapsed
> #  93.16    0.23   94.87
>> system.time(apply(pp, 1:2, sd2))
> #   user  system elapsed
> #  66.51    0.30   67.84
>
> # ----------
> # Functions
> # ----------
> sd1<-function(x) sd(x,na.rm=TRUE)
>
> sd2<- function(i){
> if(sum(!is.na(i))==0) temp.sd <- NA
> else temp.sd <- sd(i, na.rm = TRUE)
> temp.sd
> }
>
>
> Regards,
>
> Jorge
>
>
> On Wed, Feb 25, 2009 at 3:23 PM, Matt Oliver <moliver at udel.edu> wrote:
>
>> Hi Jorge, this does not seem to return the same thing as
>>
>> p <- array(c(1:5, rep(NA, times = 3)), dim = c(5, 5, 3))
>> sd_fun <- function(i){
>> if(sum(!is.na(i))==0){
>> temp.sd <- NA
>> }else{
>> temp.sd <- sd(i, na.rm = TRUE)
>> }
>> return(temp.sd)
>> }
>>
>> apply(p, 1:2, sd_fun)
>>
>> Am I missing something basic here?
>>
>>
>>
>> On Wed, Feb 25, 2009 at 11:47 AM, Jorge Ivan Velez <
>> jorgeivanvelez at gmail.com> wrote:
>>
>>>
>>> Hi Mark,
>>> There is a typo in the first way. My apologies. It should be:
>>> # First
>>> res<-apply(p,3,function(X)
>>>        c(scols=apply(X,2,sd,na.rm=TRUE),srows=apply(X,1,sd,na.rm=TRUE))
>>>        )
>>> res
>>>
>>> HTH,
>>>
>>> Jorge
>>>
>>>
>>> On Wed, Feb 25, 2009 at 11:42 AM, Jorge Ivan Velez <
>>> jorgeivanvelez at gmail.com> wrote:
>>>
>>>>
>>>> Dear Matt,
>>>>
>>>> Here are two ways:
>>>>
>>>> # Data
>>>> p <- array(c(1:5, rep(NA, times = 3)), dim = c(5, 5, 3))
>>>>
>>>> # First
>>>> res<-apply(p,3,function(X)
>>>>        c(scols=apply(X,2,sd,na.rm=TRUE),srows=apply(X,2,sd,na.rm=TRUE))
>>>>        )
>>>> res
>>>>
>>>> # Second
>>>> res2<-apply(p,3,function(X)
>>>>
>>>> list(scols=apply(X,2,sd,na.rm=TRUE),srows=apply(X,1,sd,na.rm=TRUE))
>>>>        )
>>>>
>>>> lapply(res2,function(x) do.call(rbind,x))
>>>>
>>>> HTH,
>>>>
>>>> Jorge
>>>>
>>>>
>>>> On Wed, Feb 25, 2009 at 10:53 AM, Matt Oliver <moliver at udel.edu> wrote:
>>>>
>>>>> Dear help, suppose I have this array and want to compute sd aross rows
>>>>> and
>>>>> columns.
>>>>>
>>>>> p <- array(c(1:5, rep(NA, times = 3)), dim = c(5, 5, 3))
>>>>>
>>>>> apply(p, 1:2, sd) fails because sd requires at least 2 numbers to
>>>>> compute sd
>>>>>
>>>>> apply(p, 1:2, sd, na.rm = TRUE) fails for the same reason
>>>>>
>>>>> I crafted my own function that does what I want
>>>>>
>>>>> sd_fun <- function(i){
>>>>> if(sum(!is.na(i))==0){
>>>>> temp.sd <- NA
>>>>> }else{
>>>>> temp.sd <- sd(i, na.rm = TRUE)
>>>>> }
>>>>> return(temp.sd)
>>>>> }
>>>>>
>>>>>
>>>>> apply(p, 1:2, sd_fun)
>>>>>
>>>>> This does what I want, but when I scale up to large arrays like
>>>>>
>>>>> pp <- array(c(1:5, rep(NA, times = 3)), dim = c(1000, 1000, 60))
>>>>>
>>>>> the apply function takes a long time to run.
>>>>>
>>>>> Is there a faster, more efficient way to do this?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>> --
>>>>> Matthew J. Oliver
>>>>> Assistant Professor
>>>>> College of Marine and Earth Studies
>>>>> University of Delaware
>>>>> 700 Pilottown Rd.
>>>>> Lewes, DE, 19958
>>>>> 302-645-4079
>>>>> http://www.ocean.udel.edu/people/profile.aspx?moliver
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Matthew J. Oliver
>> Assistant Professor
>> College of Marine and Earth Studies
>> University of Delaware
>> 700 Pilottown Rd.
>> Lewes, DE, 19958
>> 302-645-4079
>> http://www.ocean.udel.edu/people/profile.aspx?moliver
>>
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Mike Lawrence
Graduate Student
Department of Psychology
Dalhousie University

Looking to arrange a meeting? Check my public calendar:
http://tinyurl.com/mikes-public-calendar

~ Certainty is folly... I think. ~