[BioC] "romer"ing and "roast"ing around gene sets

Mon Jul 19 11:18:12 CEST 2010

Dear Gordon,

Thanks a lot for your answer. It clarifies all the issues. And, of course, 
thanks for such a nice piece of software.

Best,

R.

On Monday 19 July 2010 01:57:21 Gordon K Smyth wrote:
> Dear Ramon,
> 
> I agree.  Using roast() on a database on gene sets is fine as long as you
> allow for multiple testing in an appropriate way.  We provide the mroast()
> function to try to make this easier.  My lab recently had occasion to use
> mroast() with all canonical pathways and we found that it took only a few
> minutes on an oldish PC with nrotations=9999.
> 
> You're right, romer() and roast() are answering different questions.  As
> long as you're aware of this, then you're on firm ground.  And the reason
> why we suggest for romer() for really large scale testing is simply
> because roast() can give so many statistically significant results as to
> be harder to interpret, especially if you use set.statistic="msq".  This
> might not be a problem for you.
> 
> At this stage, roast() is the more mature software product.  While we've
> used romer() for a study published in Blood, we haven't yet published the
> methodology in its own right, and it will probably be refined a bit more
> before we do so.
> 
> Thanks for the P.S. about the documentation.  I've updated it now.
> 
> Best regards
> Gordon
> 
> On Fri, 16 Jul 2010, Ramon Diaz-Uriarte wrote:
> > Dear Gordon,
> >
> > Reading your email, I think there is something I am not following
> > completely. You say, regarding the GSEA-like approach in "romer"
> >
> >> This is actually a biologically well-motivated approach when you are
> >> testing large numbers of sets.
> >>
> >> If you want to test every set in the MSigDB, then testing one by one
> >> with roast() would probably be just too slow anyway.  romer() is more
> >> efficient when the number of sets is very large.
> >
> > What I found very attractive about roast is that the differential
> > expression test is done for groups of genes so, in addition to possible
> > increases in power, interpretation is simplified (e.g., if we use all
> > the GO categories, we deal only with ~ 1500 entities).  Even if the
> > examples in your Bioinformatics paper involve just a few sets, I was
> > thinking about systematically using roast in, say, all GO categories, or
> > all the 690 canonical pathways.
> >
> > Moreover, if we want to use the "focused gene testing", even if roast
> > takes longer, I do not see how the larger efficiency of romer would make
> > it an alternative procedure: they are answering different questions,
> > right?
> >
> >
> > But now, I am starting to think that maybe the idea of systematically
> > testing all 1500 go categories might be a bad idea.
> >
> >
> > Best,
> >
> > R.
> >
> > P.S. The help for roast says y it must be a numeric matrix. But I think
> > it works fine with ExpressionSet objects directly, too.
> >
> > On Thursday 15 July 2010 03:29:49 Gordon K Smyth wrote:
> >> Dear Robert,
> >>
> >> I'm just adding briefly to Di's comments.
> >>
> >>> From: "Robert M. Flight" <rflight79 at gmail.com>
> >>> To: bioconductor at stat.math.ethz.ch
> >>> Subject: [BioC] "romer"ing and "roast"ing around gene sets
> >>>
> >>> Hi All,
> >>>
> >>> I am having trouble with the distinction between the functions "roast"
> >>> and "romer" in the limma package. From the publication describing
> >>> "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems
> >>> that it tests a particular gene set for differential expression,
> >>> whereas "romer" tests a battery of sets to find those that are
> >>> differentially expressed compared to the rest?
> >>
> >> Yes.
> >>
> >>> I am really having trouble discerning the true difference between these
> >>> two, and how they compare to GSEA. I always thoght that the primary
> >>> purpose of GSEA was to determine those gene sets that are significantly
> >>> associated with a phenotypic comparison, i.e. those gene sets showing
> >>> differential expression.
> >>
> >> This is an understandable assumption, which isn't quite true!  GSEA
> >> actually tries to pick out the sets that stand out as more strongly
> >> differentially expressed (DE) than others.  So, if all the sets were DE
> >> to exactly the same degree, then GSEA wouldn't find anything
> >> significant, because no set would stand out from the others.  This is
> >> actually a biologically well-motivated approach when you are testing
> >> large numbers of sets.
> >>
> >> If you want to test every set in the MSigDB, then testing one by one
> >> with roast() would probably be just too slow anyway.  romer() is more
> >> efficient when the number of sets is very large.
> >>
> >> Beware that romer(), like GSEA, tends to give pretty modest p-values.
> >> The ranking of the sets may be more useful than the absolute p-values.
> >>
> >> Best wishes
> >> Gordon
> >>
> >>> If any one can help me clear this up, that would be great, because as
> >>> of now I am thoroughly confused. To me, if I have a dataset, and I want
> >>> to know which gene sets (from say MSigDB) are differentially expressed,
> >>> then it sounds like I would use "roast", but the way it is described in
> >>> the publication (and the help in limma), this isn't what I would do,
> >>> but rather I should use "romer", and see if any of the sets show
> >>> differential expression compared to the rest in the database.
> >>>
> >>> Color me confused,
> >>>
> >>> -Robert
> >>>
> >>> Robert M. Flight, Ph.D.
> >>> Bioinformatics and Biomedical Computing Laboratory
> >>> University of Louisville
> >>> Louisville, KY
> >>>
> >>> PH 502-852-0467
> >>> EM robert.flight at louisville.edu
> >>> EM rflight79 at gmail.com
> >>>
> >>> Williams and Holland's Law:
> >>> ? ? ?? If enough data is collected, anything may be proven by
> >>> statistical methods.
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:20}}