[BioC] Overlapping genes in subsets of lists

Martin Morgan mtmorgan at fhcrc.org
Wed Oct 8 15:50:56 CEST 2008

"Sean Davis" <sdavis2 at mail.nih.gov> writes:

> On Wed, Oct 8, 2008 at 8:34 AM, Heike Pospisil
> <pospisil at zbh.uni-hamburg.de> wrote:
>> Hello there,
>> I have 100 lists of differentially expressed genes, and I am trying to find
>> genes overrepresented in these 100 lists (I call them a 'cluster of genes').
>> What's worse, I expect not only one cluster of genes, but three or four or
>> five of them. That is why, a simple intersection() will not help. I wish to
>> had a function that can select all genes which appear in 100% of 33 lists of
>> genes (cluster 1), all genes which appear in 100% of 22 lists (cluster 2) and
>> all genes which appear in 100% of the remaining 45 lists (cluster 3). (I hope
>> my explanation is clear).
>> Does anybody know a package or a strategy how to define such clusters?
> Just a thought, but you could make a matrix with "gene lists" as the
> columns (ie., gene list 1 in column 1, gene list 2 in column 2, etc.)
> and rows with the union of all genes.  Put a "1" in each cell for a
> gene that is present in a gene list and "0" elsewhere.  Once you have
> this matrix, you can use normal clustering methods to look for
> patterns.  For example, you could produce a heatmap of these data and
> look for blocks.

One way of doing this might be...

> library(GSEABase)
> data(sample.ExpressionSet)
> obj = sample.ExpressionSet

> gs1 = GeneSet(obj[200:230,], setName="set1")
> gs2 = GeneSet(obj[210:240,], setName="set2")
> gs3 = GeneSet(obj[220:250,], setName="set3")
> gsc = GeneSetCollection(gs1, gs2, gs3)
> inc = incidence(gsc)
> colnames(inc[,colSums(inc)==3])
 [1] "31459_i_at" "31460_f_at" "31461_at"   "31462_f_at" "31463_s_at"
 [6] "31464_at"   "31465_g_at" "31466_at"   "31467_at"   "31468_f_at"
[11] "31469_s_at"

(if the gene sets are in a list 'lst', e.g., because they were created
in an lapply, then

> gsc = do.call("GeneSetCollection", lst)

saves some typing / coordination).


> Sean
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

More information about the Bioconductor mailing list