[BioC] Your thoughts on Limma Array Weights?

Fri Jun 27 06:41:27 CEST 2008

HI Mark,
The example I showed was a triplicate biological replicate and there I
decided to do use the weights over removing the strange array. In a
previous experiment I removed the "strange" array and did not use
weights. But I did not elaborate; in that latter case there was some
biology at work. The cells used in that experiment were all grown form
tissues and then passed through a process of differentiation. So I had
reason to believe the strange array was different due to biology (state
of differentiation perhaps) rather than some other "more technical"
effect (as the array and RNA otherwise appeared fine in other QC
measurements). So with that additional motivation I decided to drop it,
as it appeared to be a true biological outlier, and no magical weighting
can fix that, but you never know for sure if you've made the right
decision, you can only see if what you get makes sense... use GOstats
etc.

The paper describing the arrayWeights method
http://www.biomedcentral.com/content/pdf/1471-2105-7-261.pdf
shows that in simulations with 3 arrays per class that dropping a poor
array is worse than just keeping it and the best performance is obtained
from the weighting strategy. **My impressions** from hearing Gordon
Smyth talk is that his philosophy is to keep the arrays you have and
weight them appropriately, to mot remove them. Does he always weight
them in his own work, I don't know. It's probably in the grey judgment
call zone, like everything else.

 That paper also shows the same results for a simple t-test, arguing
that there is no strange "double-dipping in linear models" when using
linear models to get weights and a linear model to get differential
expression (though of course neither is just a simple linear model)

So, along the same lines unless there is something particularly glaring
you should keep your 2 arrays per class and perhaps try the weighing,
and gently remind your investigators (using bolt-cutters and a
blow-torch) how difficult it is to get a standard deviation from 2
measurements...

Best of luck
Paul

-----Original Message-----
From: Mark Cowley [mailto:m.cowley0 at gmail.com] 
Sent: Friday, June 27, 2008 11:37 AM
To: Matt Ritchie
Cc: Paul Leo; bioconductor
Subject: Re: [BioC] Your thoughts on Limma Array Weights?

Hi Paul and Matt,
have either of you compared situations with only small number of  
arrays in a 2 group comparison, eg 2 vs 2, or 3 vs 3 and your either  
throw one array away (due to QC), or just down-weight it?
I commonly throw the poor array away, but one data set that i'm  
currently working on only has 2 vs 2 (read: "pilot experiment"), and  
when you remove an array, then it's 2 vs 1 which is not much fun.

cheers,
Mark

On 27/06/2008, at 9:53 AM, Matt Ritchie wrote:

> Dear Paul,
>
> I have noticed cases where the results are 'better' (i.e. you get more
> extreme moderated t-statistics or log-odds) if you remove suspect  
> arrays.
> In one recent example I recall, the experimenter eventually  
> discovered that
> the genotype of a sample hybridised to one of their arrays was not  
> what they
> originally thought.  This meant that the linear model they were  
> fitting was
> not right.  Although the weight assigned to this array was small,  
> removing
> it from the analysis altogether still produced better results.  The  
> array
> weights method cannot correct for these kinds of gross errors.
>
> I usually take a try it and see approach in my own analyses, similar  
> to what
> you have done (i.e. run the analysis with equal weights, with array  
> weights,
> or after removing any suspect arrays altogether, then look at the  
> results of
> each to see which gives the most extreme statistics).
>
> Best wishes,
>
> Matt
>
>> I use limma quite a bit but have not really been using arrayWeights
>> much, until recently.
>> I like it a lot but have  found, in some cases, that it appears   
>> better
>> just ditch the very poorly performing arrays..and then I proceed  
>> without
>> weighing .
>>
>> What are peoples real world experience with arrayWeights, are you  
>> using
>> it routinely ?
>>
>> For example my typical usage... time series with biological  
>> triplicates
>>
>>> design
>>   t0hr t6hr t24hr t24p6hr
>> 1     1    0     0       0
>> 2     1    0     0       0
>> 3     1    0     0       0
>> 4     0    1     0       0
>> 5     0    1     0       0
>> 6     0    1     0       0
>> 7     0    0     1       0
>> 8     0    0     1       0
>> 9     0    0     1       0
>> 10    0    0     0       1
>> 11    0    0     0       1
>> 12    0    0     0       1
>>
>> arrayw<-arrayWeights(selDataMatrix,design=design)
>>> arrayw
>>        1         2         3         4         5         6         7
>> 8         9
>> 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722
>> 0.7162097 2.3992850
>>       10        11        12
>> 0.1744961 1.3821469 0.6379648  ## note array 10:  which was a  
>> outlier in
>> hierarchical clustering (though was still more similar to arrays its
>> biological replicates than any other arrays (based on genes where
>> sd/mean> 0.1)..
>>
>> fit <- lmFit(selDataMatrix, design,weights=arrayw)
>> fit <- lmFit(selDataMatrix, design)
>>
>> cont.matrix <- makeContrasts(
>> tchange6hr="t6hr-t0hr" ,
>> tchange24hr="t24hr-t0hr" ,
>> tchange24p6hr="t24p6hr-t0hr" ,
>> diff0to6="t6hr-t0hr" ,
>> diff6to24="t24hr-t6hr" ,
>> diff24to24p6="t24p6hr-t24hr" ,
>> levels=design)
>>
>> fit2 <- contrasts.fit(fit, cont.matrix)
>> fit2 <- eBayes(fit2)
>>
>> ** Get
>>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 2927
>>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1)
>> [1] 5263
>>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 2083
>>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 2927
>>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 2931
>>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 3810
>>
>> ####################### AS APPOSED TO THE TYPICAL:
>>
>> fit <- lmFit(selDataMatrix, design)
>> fit2 <- contrasts.fit(fit, cont.matrix)
>> fit2 <- eBayes(fit2)
>>
>> ** Get
>>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 1725
>>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1)
>> [1] 3438
>>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 1512
>>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 1725
>>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 1605
>>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1)
>> [1] 2610
>>
>> Is more differential expression  better .. always... I guess so  
>> unless
>> there are more false positives right?  I am slightly worried that in
>> using a linear model to access array quality and produce weights ,  
>> that
>> this will then naturally bias a method such as limma that then   
>> using a
>> linear model, again, to determine differential expression. After  
>> trying
>> a few different permutations (use weights, remove "worst" arrays and
>> redo without weights) that this is not a big concern but would  
>> welcome
>> some feedback from others and insight into how they are using this
>> function .
>>
>> Thanks
>> Paul
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor