[BioC] Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data

Gordon K Smyth smyth at wehi.EDU.AU
Sun Jun 30 05:42:06 CEST 2013


Dear Tom,

I must say that this is first time I have heard anyone suggest that a 
mixed linear model is easier or simpler than an ordinary linear model. 
Having used the lme4 package myself, the opposite seems true to me by a 
very long way.

Do you know that mixed models have been proposed and implemented several 
times for microarray data?  For example the maanova package or the 
lmscFit() separate channel approach of the limma package.  A recent paper 
by Naomi Altman and myself gives a comparison of some of the approaches:

   http://www.biomedcentral.com/1471-2105/14/165

For some designs, a mixed model or separate approach can indeed recover 
more information than a classic log-ratio analysis.  I don't fully follow 
the approach that you describe, but it sounds to have some similarities 
with the limma separate channel approach as well as with maanova.

limma can be generalized to RNA-seq, see

   http://www.statsci.org/smyth/pubs/VoomPreprint.pdf

limma does allow for a treatment of randomized block designs but, apart 
from this exception, it is certainly true that RNA-seq analysis methods do 
not allow random effect models.  So there is a gap in the market if you 
are keen to fill it.

There isn't any data object class called RGLimma that I know of.  Do you 
mean RGList?

Best wishes
Gordon

> Date: Sat, 29 Jun 2013 02:49:53 +0000
> From: Tom Wenseleers <Tom.Wenseleers at bio.kuleuven.be>
> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
> Subject: [BioC] Statistical power of limma vs mixed model approach to
> 	analyze microarray and differential proteomics/peptidomics experiments
> 	and Poisson/binomial mixed models for RNAseq data
>
> Dear all,

> I was just analyzing some data from a differential (dual label LC/MS) 
> single factor peptidomics experiment using both a per-feature mixed 
> model approach (using lme4's lmer function, using a model of the form 
> intensity/abundance ~ (1|run)+label+treatment, and testing significance 
> and calculating least square means of the differential expression ratios 
> for the different treatments using lmerTest and lsmeans, using a 
> Satterthwaite df approximation) and an empirical Bayes standard 
> deviation shrinkage method as implemented in limma (by putting my data 
> in an RGLimma object). To my surprise, I am finding more statistically 
> significantly differentially expressed features using the mixed model 
> approach than using limma (independent of the types of normalizations I 
> use in limma, and I did check that the sd vs abundance was flat). 
> Before, I have also observed the same phenomenon in some microarray 
> datasets.
>
> I am wondering if this would point towards a mixed model approach having 
> superior statistical power than limma, even for this very simple design. 
> Could this be the case? I only did 8 replicate runs in this case 
> (corresponding to arrays in 2-color microarray exps), so not terribly 
> many... Could it be that limma has superior statistical power for the 
> analysis of relatively small designs with few experimental replicates 
> (where the SD shrinkage could be beneficial), but that for larger number 
> of replicates and also for complex designs (in particular when there are 
> multiple sources of dependencies in the data) a mixed model approach 
> would work better?
>
> Also, would a mixed model approach in general not be much easier to 
> specify (just requiring the model formula to be specified, as opposed to 
> all the contrasts, which becomes pretty tedious for complex designs) and 
> statistically be more flexible and powerful for multifactorial 
> experiments (e.g. easily allowing for treatment interaction effects to 
> be tested, and also allowing for multiple crossed random factors, e.g. 
> to take into account several dependencies in the data, e.g. owing to 
> spatial correlations in the microarray slides/print tip effects, other 
> random factors, e.g. due to the use of samples with multiple genetic 
> backgrounds or the presence of repeated measures factors, etc). For 
> repeated measures designs I would also think that mixed model fits 
> obtained using nlme could be even more flexible, e.g. allowing for 
> various custom error covariance structures (e.g. to take into account 
> temporal autocorrelations, etc); in fact, even lmer supports 
> unstructured covariance mode! ls (which would allow variances to be 
> different in your different groups, which I think could be quite 
> common). Model selection, on the other hand, would appear a little more 
> tricky, but also feasible, e.g. by minimizing the AIC of a model that is 
> fit over all features/genes combined (as in 
> http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b).
>
> Finally, an important advantage that I would see of using this approach 
> is that it would readily generalize to RNAseq data if one would use 
> generalized mixed models (glmer's) with a Poisson error structure 
> (correcting for the total nr of mapped reads, etc and other important 
> covariates) (or, alternatively, using a binomial error structure, to 
> analyze the proportion of reads mapping to a particular gene). 
> Overdispersion in the data in this approach could readily be taken into 
> account by including an observation-level random factor in the model, 
> cf. 
> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015867.html. 
> The advantage here again would be that complex designs, e.g. with 
> multiple crossed random factors, could readily be taken into account (to 
> my understanding, edgeR or DEseq do not allow for this, right?). Would 
> anyone perhaps have any thoughts why not to go for such a generic 
> approach?
>
> Would there perhaps be any market for a package to analyze differential 
> expression in limma RGLimma and ESet data (or CountDataSet or DGEList 
> objects for RNAseq data) using the approach I outline above? If there is 
> enough interest, I would be keen to develop it, as it would seem pretty 
> straightforward to do...
>
> Cheers,
> Tom Wenseleers
>
>
> _______________________________________________________________________________________
>
> Prof. Tom Wenseleers
> *      Lab. of Socioecology and Social Evolution
>           Dept. of Biology
>           Zoological Institute
>           K.U.Leuven
>           Naamsestraat 59, box 2466
>           B-3000 Leuven
>           Belgium
> * +32 (0)16 32 39 64 / +32 (0)472 40 45 96
> * tom.wenseleers at bio.kuleuven.be
> http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list