[BioC] [ComBat] ComBat outputting NAs with Large Batches - Switch to Log Likelihood

Jason Stein steinja at ucla.edu
Wed Nov 21 20:00:33 CET 2012


Hi fellow ComBat fans,

Not sure if my previous message posted to the listserv because of the  
attachment.  Now re-sending without attachment:

I was having a problem with ComBat when using large batch sizes and  
the non-parametric parameter estimation.  Almost all probe intensity  
values after ComBat correction would be NAs.

I've traced the source of this problem and have a solution (switching  
from likelihood to log likelihood) described below.  I hope you all  
find it helpful, or that you would let me know if you find any errors.  
  Also, I can send the attachment which shows why I changed what I  
changed if anyone would like to take a look.

ComBat correction

ComBat was sometimes outputting NAs for all probe values using  
non-parametric parameter estimates.  I traced this to the likelihood  
calculation in the int.eprior function.  The original function is  
pasted below:

# Monte Carlo integration function to find the nonparametric  
adjustments for a particular batch
int.eprior <- function(sdat,g.hat,d.hat){
   g.star <- d.star <- NULL
   r <- nrow(sdat)
   for(i in 1:r){
     g <- g.hat[-i]
     d <- d.hat[-i]
     x <- sdat[i,!is.na(sdat[i,])]
     n <- length(x)
     j <- numeric(n)+1
     dat <- matrix(as.numeric(x),length(g),n,byrow=T)
     resid2 <- (dat-g)^2
     sum2 <- resid2%*%j
     LH <- 1/(2*pi*d)^(n/2)*exp(-sum2/(2*d))
     LH[LH=="NaN"]=0
     g.star <- c(g.star,sum(g*LH)/sum(LH))
     d.star <- c(d.star,sum(d*LH)/sum(LH))
     ##if(i%%1000==0){cat(i,'\n')}
   }
   adjust <- rbind(g.star,d.star)
   rownames(adjust) <- c("g.star","d.star")
   adjust
}

The likelihood calculation LH here goes to zero when sum2 is very  
large due to numerical precision errors.  This causes the parameter  
estimates g.star and d.star to become NaN from a divide by zero error,  
and consequently for the output of the program to be NAs. This can  
happen when there are a lot of samples in a batch or when there are  
large residuals (or both).  To correct this, can use the  
log-likelihood, which is less susceptible to numerical precision  
errors since it is not so small.

I changed the function to use log-likelihood and the entire edited  
function is now pasted below:

# Monte Carlo integration function to find the nonparametric  
adjustments for a particular batch
int.eprior <- function(sdat,g.hat,d.hat){
   #Get the number of probes
   r <- nrow(sdat);
   #The number of probes removing the probe of interest
   G = r - 1;
   #g.star and d.star are the output matrices
   g.star <- d.star <- matrix(NA,nrow=G,ncol=1);
   #Loop over each probe
   for(i in 1:r){
     #Remove the beta value of the probe of interest
     g <- g.hat[-i];
     #Remove the variance of the probe of interest
     d <- d.hat[-i];
     #Isolate the value of this probe across all samples
     x <- sdat[i,!is.na(sdat[i,])];
     #Number of samples in this batch
     n <- length(x)
     #A matrix of dimension gxn which contains the probe values  
repeated in each row
     dat <- matrix(as.numeric(x),length(g),n,byrow=T)
     #The residual between the probe values and the regression coefficient
     resid2 <- (dat-g)^2
     #Calculate Log-likelihood using Gaussian
     LL = matrix(0,nrow=length(G),ncol=1);
     for (j in 1:n) {
       LL = LL + -0.5*log(2*pi*d)-(resid2[,j]/(2*d));
     }
     #Order by the log-likelihood values to avoid Inf in parameter estimates
     orderind = order(LL,decreasing=TRUE);
     LL = LL[orderind];
     g = g[orderind];
     d = d[orderind];
     #Determine the parameter estimates
     numerator.g = 1;
     denominator = 1;
     numerator.d = 1;
     for (k in 2:G) {
       numerator.g = numerator.g + (g[k]/g[1]) * exp(LL[k]-LL[1]);
       denominator = denominator + exp(LL[k]-LL[1]);
       numerator.d = numerator.d + (d[k]/d[1]) * exp(LL[k]-LL[1]);
     }
     g.star[i] = g[1]*(numerator.g/denominator);
     d.star[i] = d[1]*(numerator.d/denominator);
   }
   adjust <- rbind(g.star,d.star)
   rownames(adjust) <- c("g.star","d.star")
   adjust
}



More information about the Bioconductor mailing list