[BioC] combining GSA and lmFit

Wed Apr 29 21:21:13 CEST 2009

Hi,
  a very slightly contrary point of view:

Gordon K Smyth wrote:
> Dear Dick,
> 
> On Tue, 28 Apr 2009, Dick Beyer wrote:
> 
>> Hi Gordon,
>>
>> Thanks for sharing your views on this topic.
>>
>> I was wondering, when you say "Thirdly, GSA computes p-values from
>> permuation, and permutation does not perform well for linear models,"
>> what is it about the permutation approach that does not perform very
>> well?  Is it due to the null hypothesis being equality of
>> distributions rather than assuming your are testing equality of means?
> 
> Well, you could put it like that.  Suppose you have a one-way layout.
> The trouble is that the groups other than the two you want to compare
> interfere rather than help with the permutations.  Have a look at the
> research literature -- there's very little on permutation outside the
> two group problem.

  AFAIK there has been a great deal of work and some of the fundamental problems
for dealing with complex experiments have been addressed in recent years.
  Pesarin, "Multivariate Permutation Tests" is one reasonable book.
and the coin package, from CRAN has another view.
  There are a few other tools around.

> 
>> I see that the Bioconductor package GSEAlm which uses linear models
>> with GSEA and uses sample permutations might have similar problems as
>> combining GSA and lmFit.  I guess I was thinking a GSA/lmFit
>> combination would be OK because of people using GSEAlm.  But if the
>> underlying null assumption is equality of distributions rather than a
>> weaker null of equality of means, then that would be important to keep
>> in mind when interpreting the resulting p-values.
> 
> My understanding is that GSEAlm implements proposals from a nice paper
> published in Bioinformatics and is concerned with issues different to
> your GSA/lm combination, but the authors may have more to say.

  Not sure I am following the necessary distinction here. Yes, in general
permutation tests are modeling equality of distributions and that is what is
going on, presently in GSEAlm. However, one can take much stronger views using
either of the references posted above.

  best wishes
    Robert

> 
>> I wonder if there would be a useful way to test the issues you raise
>> about the effect on gene set analysis when using limma or SAM
>> statistics which depend on ensembles of genes, as well as the effect
>> of the moderated statistics of limma or SAM on the GSA standardization
>> method. I'm not sure I understand the GSA steps enough to know if
>> those are designed to take care of problems you might otherwise use a
>> SAM-type statistic to deal with.
> 
> If it was easy to do GSA for linear models, I expect that Efron and
> Tibshirini would have done it.  I'm not sure it is wise to cobble
> something ad hoc together if you don't understand the theory very well.
> 
> I didn't reply in order to point you to my work, but my group has taken
> an approach to GSEA for linear models which we're very happy with, which
> is implemented in the roast() and romer() functions of the limma package.
> 
> Best wishes
> Gordon
> 
>> Lots to think about.  Thanks very much for your comments.
>>
>> Cheers,
>> Dick
>> *******************************************************************************
>>
>> Richard P. Beyer, Ph.D.    University of Washington
>> Tel.:(206) 616 7378    Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696    4225 Roosevelt Way NE, # 100
>>             Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>>
>> On Tue, 28 Apr 2009, Gordon K Smyth wrote:
>>
>>> Dear Dick,
>>>
>>> Anything in GSA which works with the SAM statistic should also work fine
>>> with limma moderated t-statistics.
>>>
>>> However there are several issues that come to my mind which affect both
>>> statistics.  Firstly, both SAM and limma statistics depend on the whole
>>> ensemble of genes, i.e., they are not merely computed genewise.  This is
>>> unlike the floored mean statistics assumed in the GSA theory paper. 
>>> This
>>> has clear computational implications, but also could give rise to some
>>> theoretical issues.
>>>
>>> Secondly, it's not too clear to me whether it makes sense to compute
>>> regularized or moderated statistics after the standardization steps
>>> that GSA
>>> does.
>>>
>>> Thirdly, GSA computes p-values from permuation, and permutation does not
>>> perform well for linear models.
>>>
>>> These are simply my thoughts, which you asked for.  You may have ways
>>> around
>>> all these issues.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT)
>>>> From: Dick Beyer <dbeyer at u.washington.edu>
>>>> Subject: [BioC] combining GSA and lmFit
>>>> To: Bioconductor <bioconductor at stat.math.ethz.ch>
>>>>
>>>> Hi All,
>>>>
>>>> I have extended the GSA code
>>>> (http://www-stat.stanford.edu/~tibs/GSA/) to
>>>> include lmFit() from the limma package so as to have linear model
>>>> capabilities with GSA.  Basically, I'm using the modified t-statistic
>>>> values from lmFit just like the SAM-like t-statistic values are used in
>>>> the GSA code.
>>>>
>>>> I was wondering if anyone had any thoughts on whether this was, in
>>>> principle, an OK thing to be doing.  I am worrying about whether
>>>> there is
>>>> an underlying issue I'm not aware of in using the moderated t-statistic
>>>> values from limma as opposed to the SAM t-statistic values that uses
>>>> the
>>>> s0 term in the denominator.
>>>>
>>>> My tests on some microarray data I have shows that in a qqplot of
>>>> t-statistic values from the two methods, they are in pretty close
>>>> agreement except for large values of the t-values.
>>>>
>>>> If anyone knows of reasons not to be doing this or could point me to
>>>> places with possible explanations, I'd be very grateful.
>>>>
>>>> Cheers,
>>>> Dick
>>>>
>>>> *******************************************************************************
>>>>
>>>> Richard P. Beyer, Ph.D.    University of Washington
>>>> Tel.:(206) 616 7378    Env. & Occ. Health Sci. , Box 354695
>>>> Fax: (206) 685 4696    4225 Roosevelt Way NE, # 100
>>>>             Seattle, WA 98105-6099
>>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>>> http://staff.washington.edu/~dbeyer
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org