[R] Average 2 Columns when possible, or return available value

Sat Jun 26 02:24:08 CEST 2010

Just want to add that if you want to clean out the NA rows in a matrix
or data frame, take a look at ?complete.cases. Can be handy to use
with big datasets. I got curious, so I just ran the codes given here
on a big dataset, before and after removing NA rows. I have to be
honest, this is surely an illustration of the power of rowMeans. I'm
amazed myself.

DF <- data.frame(
  A=rep(DF$A,10000),
  B=rep(DF$B,10000)
)

> system.time(apply(DF,1,mean,na.rm=TRUE))
   user  system elapsed
  13.26    0.06   13.46

> system.time(matrix(rowMeans(DF, na.rm=TRUE), ncol=1))
   user  system elapsed
   0.03    0.00    0.03

> system.time(t(as.matrix(aggregate(t(as.matrix(DF)),list(rep(1:1,each=2)),mean,
+ na.rm=TRUE)[,-1]))
+ )

Timing stopped at: 227.84 1.03 249.31  -- I got impatient and pressed the escape

> DF <- DF[complete.cases(DF),]

> system.time(apply(DF,1,mean,na.rm=TRUE))
   user  system elapsed
   0.39    0.00    0.39

> system.time(matrix(rowMeans(DF, na.rm=TRUE), ncol=1))
   user  system elapsed
   0.01    0.00    0.02

> system.time(t(as.matrix(aggregate(t(as.matrix(DF)),list(rep(1:1,each=2)),mean,
+ na.rm=TRUE)[,-1]))
+ )
   user  system elapsed
  10.01    0.07   13.40

Cheers
Joris

On Sat, Jun 26, 2010 at 1:08 AM, emorway <emorway at engr.colostate.edu> wrote:
>
> Forum,
>
> Using the following data:
>
> DF<-read.table(textConnection("A B
> 22.60 NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
> 102.00 NA
>  19.20 NA
>  19.20 NA
>  NA NA
>  NA NA
>  NA NA
>  11.80 NA
>  7.62 NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
>  NA NA
>  75.00 NA
>  NA NA
>  18.30 18.2
>  NA NA
>  NA NA
>  8.44 NA
>  18.00 NA
>  NA NA
>  12.90 NA"),header=T)
> closeAllConnections()
>
> The second column is a duplicate reading of the first column, and when two
> values are available, I would like to average column 1 and 2 (example code
> below).  But if there is only one reading, I would like to retain it, but I
> haven't found a good way to exclude NA's using the following code:
>
> t(as.matrix(aggregate(t(as.matrix(DF)),list(rep(1:1,each=2)),mean)[,-1]))
>
> Currently, row 24 is the only row with a returned value.  I'd like the
> result to return column "A" if it is the only available value, and average
> where possible.  Of course, if both columns are NA, NA is the only possible
> result.
>
> The result I'm after would look like this (row 24 is an avg):
>
>  22.60
>    NA
>    NA
>    NA
>    NA
>    NA
>    NA
>    NA
> 102.00
>  19.20
>  19.20
>    NA
>    NA
>    NA
>  11.80
>  7.62
>    NA
>    NA
>    NA
>    NA
>    NA
>  75.00
>    NA
>  18.25
>    NA
>    NA
>  8.44
>  18.00
>    NA
>  12.90
>
> This is a small example from a much larger data frame, so if you're
> wondering what the deal is with list(), that will come into play for the
> larger problem I'm trying to solve.
>
> Respectfully,
> Eric
> --
> View this message in context: http://r.789695.n4.nabble.com/Average-2-Columns-when-possible-or-return-available-value-tp2269049p2269049.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php