[BioC] which genes to choose for being truly differentially expressed from a long list

Tue Dec 3 17:15:48 CET 2013

Hi EJ,

This is the time to revisit the original hypothesis you are testing, or 
if you are trying to generate hypotheses, then what hypotheses you 
would find interesting.

In general, a set of differentially expressed genes isn't particularly 
compelling, especially when you start to think about Biological 
relevance and measurement errors. For example, it's pretty easy to come 
up with a single gene that appears to be differentially expressed, but 
is in fact a false positive. And what does a single differentially 
expressed gene (if a true positive) mean in the greater Biological 
context of your experiment?

So univariate statistics aren't that useful IMO in this context, and 
trying to reduce your set of genes to a tractable number that you can 
eyeballometrically test for 'interestingness' is likely not the way to 
go. In other words, taking a list of genes and scanning them to see if 
you can find particular genes that you already know are implicated in 
the process you are examining is not likely to be a useful exercise.

Instead, you might re-consider the rationale for doing this experiment 
and see if there is something you can do to further that cause. As an 
example, perhaps you are trying to find pathways that are perturbed by 
some treatment. There are any number of ways to try to tease that sort 
of information out; you can do GO hypergeometric tests (or KEGG tests 
if you prefer). Or you could look for interesting gene sets at the 
Broad, and do GSEA against those. Or maybe you want to mine the data 
for interesting relationships using something like WGCNA (a google 
search will get you there). Note that WGCNA is particularly useful if 
you have other (preferably continuous) phenotypes.

Given that you apparently have a big signal here, I would recommend 
trying to use that signal rather than chopping away at your data simply 
to reduce the dimensionality.

Best,

Jim

On Tuesday, December 03, 2013 10:39:05 AM, Ekta Jain wrote:
> Thank you James. I actually meant to ask that if I have a list of 500
> genes and they all have fold change values in the range of 2-3, If i
> condense the list based on cut off value of say 2.8 fold change I
> still have a list of 200 some genes. How could I eliminate other genes
> to include more parameters in the threshold to define them as showing
> differential expression. I am not sure if there is any such thing but
> I suppose there might some definition which helps to identify genes as
> a perfect candidate for being differentially expressed.
>
> Appreciate your help,
>
> EJ
>
>
> On 3 December 2013 20:42, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi EJ,
>
>
>     On Tuesday, December 03, 2013 10:04:01 AM, EJ [guest] wrote:
>
>
>         Dear All,
>         I used LIMMA for a dataset on Human plus 2.0 arrayto get fold
>         change values for differentially expressed genes. I have a
>         long list of 500 some genes with fold changes > 2 from the
>         topTable function. How can I select genes which are most
>         differentially expressed from this list?
>
>
>     You will have to first define what you mean by 'most
>     differentially expressed'. If you mean 'largest fold change' then
>     please see ?topTable, specifically the sort.by <http://sort.by>
>     argument.
>
>     Best,
>
>     Jim
>
>
>
>         Thank you,
>         EJ
>
>           -- output of sessionInfo():
>
>         R version 3.0.1 (2013-05-16)
>         Platform: i386-w64-mingw32/i386 (32-bit)
>
>         locale:
>         [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
>         [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
>         [5] LC_TIME=English_India.1252
>
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>          methods
>         [8] base
>
>         other attached packages:
>         [1] limma_3.16.7       affy_1.38.1        Biobase_2.20.1
>         BiocGenerics_0.6.0
>
>         loaded via a namespace (and not attached):
>         [1] affyio_1.28.0         BiocInstaller_1.10.3
>          preprocessCore_1.22.0
>         [4] zlibbioc_1.6.0
>
>         --
>         Sent via the guest posting facility at bioconductor.org
>         <http://bioconductor.org>.
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099