[R] paired wilcox test on each row of a large dataframe

Dan Davison davison at stats.ox.ac.uk
Fri Feb 12 18:41:31 CET 2010


gauravbhatti <gaurav15984 <at> hotmail.com> writes:

> 
> 
> hI

> I have to calculate V statistic for each row of a large dataframe
> (28000). I can not use multtest package for paired wilcox test. I have
> been using for loop which are. Is there a way to speed the computation
> with another method like using apply or tapply?

Using a for loop is fine here (and basically unavoidable). If you need
it to be faster, use a matrix rather than a data.frame. (i.e. make a
matrix containing columns 1-12, which are all numeric and so do not need
to be in a data frame).

Below are versions using apply, sapply and an explicit for loop. There's
not much difference in speed. But the last one, in which the data is in
a data.frame with rownames, is much slower.


> d <- matrix(rnorm(12000), nrow=1000)
> system.time(ans <- apply(d, 1, function(row) unlist(wilcox.test(row[1:6],
row[7:12])[c("p.value","statistic")])))
   user  system elapsed 
  2.660   0.064   2.730 
> system.time(ans2 <- sapply(1:nrow(d), function(i)
unlist(wilcox.test(d[i,1:6], d[i,7:12])[c("p.value","statistic")])))
   user  system elapsed 
  2.480   0.108   2.583 
> system.time({ans3 <- matrix(nrow=nrow(d), ncol=2) ;
for(i in 1:nrow(d)) {
ans3[i,] <- unlist(wilcox.test(d[i,1:6], d[i,7:12])
[c("p.value","statistic")])}})
   user  system elapsed 
  2.504   0.000   2.503 

> d <- as.data.frame(d)
> rownames(d) <- paste(letters, 1:nrow(d))
> system.time(ans2 <- sapply(1:nrow(d), function(i)
unlist(wilcox.test(as.numeric(d[i,1:6]),
as.numeric(d[i,7:12]))[c("p.value","statistic")])))
   user  system elapsed 
  5.673   0.212   5.885 

Dan


> My data set looks like this:
>                  11573_MB   11911_MB   11966_MB   12091_MB  12168_MB  
> 12420_MB................
> cg00000292 0.62123125 0.82663502 0.74687013 0.61774927 0.7337809 0.73203721
> cg00002426 0.63631315 0.64408750 0.61975158 0.72500713 0.5753110 0.65146526
> cg00003994 0.05035499 0.05189776 0.05882848 0.11198073 0.1313330 0.03883439
> cg00005847 0.13936423 0.14967690 0.31874454 0.15876243 0.1111117 0.15070058
> cg00006414 0.09059770 0.09915681 0.09952658 0.13955982 0.1757718 0.07566312
> cg00007981 0.05622769 0.04143790 0.07167018 0.08051046 0.1378107 0.05439999
>   ..............  11573_CB   11911_CB   11966_CB   12091_CB   12168_CB 
> 12420_CB
> cg00000292 0.83059018 0.65396035 0.74519819 0.76007659 0.70335691 0.7857631
> cg00002426 0.61450928 0.59160923 0.69857198 0.73028911 0.71808719 0.6741295
> cg00003994 0.04223668 0.07910444 0.05416764 0.06156407 0.06381321 0.0643354
> cg00005847 0.13897704 0.06407313 0.20449931 0.15683154 0.18936196 0.1610695
> cg00006414 0.06520757 0.12243180 0.11380134 0.10957321 0.15759518 0.1236715
> cg00007981 0.04789030 0.11699024 0.07143036 0.05996888 0.10829510 0.1069037
> .
> ..
> .
> .
> .
> There are 12 columns and 27000 rows. I have to perform paired test on each
> row (1:6 vs 7:12) and store the p value and statistic in two columns . Whats
> the fastest way?
> Gaurav Bhatti
>



More information about the R-help mailing list