[R] resampling for correlation and testing

Benton, Paul hpaul.benton08 at imperial.ac.uk
Thu Mar 29 04:03:08 CEST 2012


On Mar 29, 2012, at 1:41 AM, ilai wrote:

> On Wed, Mar 28, 2012 at 3:53 PM, Benton, Paul
> <hpaul.benton08 at imperial.ac.uk> wrote:
>> Hello all R-er,
>> 
>> I'm trying to run a resampling method on some data. The current method I have takes 2+ days or a lot of memory . I was wondering if anyone has a better suggestion.
>> 
>> Currently I take a matrix and get the correlation matrix from it. This will be called rho.A. Each element in this will be tested against the distribution from the resampled correlation B matrix.
>> 
>> Some example code:
>> 
>> A<-matrix(rnorm(100), ncol=10)
>> B<-matrix(rnorm(100), ncol=10)
>> 
>> rho.A<-cor(A)
>> 
>> {
>> idx<-sample(1:10, 10)
>> idx
>> # [1] 8 4 5 7 1 9 2 10 6  3
>> 
>> rho.B<-cor(B[,idx])
>> } ## repeat this x time (currently 500)
>> 
>> ## in essence we then have the following :
>> rho.arrayB<-array(runif((10*10)*500), dim=c(10,10,500))
> 
> Err... no we don't. sample(10,10) ; sample(10,10) ... only permutes
> the columns, so the 500 cor(B) have exactly the same values in
> different off diag positions. Using runif they are unique

My apologies for trying to make some example data. I should have done exactly what I was doing, which would have been harder to read. Since runif doesn't give exactly the same thing I'll give my functions that I'm using.

## note rperm was originally submitted to the stackexchange.com
## http://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations
rperm <- function(m, size=2, replaceTF=TRUE) { # Obtain m unique permutations of 1:size
    # Function to obtain a new permutation.
    newperm <- function() {
        count <- 0                # Protects against infinite loops
        repeat {
            # Generate a permutation and check against previous ones.
            p <- sample(1:size, replace=replaceTF)
            hash.p <- paste(p, collapse="") # make a name for the list
            if (is.null(cache[[hash.p]])) break 
            ## stop if we have not already made this seq - repeat to make another new seq

            # Prepare to try again.
            count <- count+1
            if (count > 1000) {   # 1000 is arbitrary; adjust to taste
                p <- NA           # NA indicates a new permutation wasn't found
                hash.p <- ""
                break
            }
        }
        cache[[hash.p]] <<- TRUE  # Update the list of permutations found
        p                         # Return this (new) permutation
    }

    # Obtain m unique permutations.
    cache <- list()
    replicate(m, newperm())  
} 

library(parallel)
cl <- makeCluster(detectCores())
bootComb<-t(rperm(times, ncol(mat.B))) 
pValues.mat<-matrix(NA, nrow=nrow(mat.B), ncol=nrow(mat.B))
	for(i in 1:nrow(mat.B)){
		cat(round(i/nrow(mat.B),2)*100, "% \r")
		rho<-parApply(cl, bootComb, 1, function(idx, xmat, ymat){
			rho<-apply(ymat[,idx], 1, function(ymat, xmat){
				## parallel here due to memory sizes
				rho<-cor(xmat, ymat)
			}, xmat[idx]) ##gives back a vector of rho for n vs m
		}, mat.B[i, ], mat.B) ## returns a matrix of rho's for each combination
		
		rho.all<- cbind(rho, rho.A[i,])
		pValues.mat[i, ] <- parApply(cl, rho.all, 1, function(rhoVec){
			tl <- length(rhoVec) ## test index
			bl <- tl-1 ## dist index
			wilcox.test(rhoVec[1:bl], rhoVec[tl])$p.value
		})
	}
	stopCluster(cl)


>> 
>> ## Then test if rho.A[1,1] come from the distribution of rho.B[1,1]
>> pvalueMat[1,1]<-wilcox.test(rho.array[1,1,] , rho.A[1,1])$p.value
>> 
> 
> From what I know cor(A)[ i , i ] = cor(B)[ j , j ] = 1   for any
> choice of A,B,i and j

No, cor(a)[i, j] != cor(b)[i , j] 

If your concern is because they are coming from the same distribution then again this is example data. Even then I would imagine that the correlation would be different for small n. Either way resampling the columns will give different  correlation values. Please try the code yourself.

> I don't think Wilcox intended his test to be used in this way….
Probably true …. care to suggest a different statistical test ?

> 
> I would start with fixing these issues first so you don't wait 2 days
> for a vector of NaN's

Actually I didn't get a vector or NaN's I got a matrix of pvalues. Which has proved useful. Now I want to make the function faster and was looking for a bit of help.

> 
> Cheers
> 
> 
>> However, my array size would be 2300 x 2300 x 500 which R won't let me even make as an empty structure. Any suggestion are more than welcomed !!
>> 
>> Cheers,
>> 
>> Paul
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list