[BioC] yet another gene universe question

Max Kuhn mxkuhn at gmail.com
Thu Sep 30 19:52:49 CEST 2010


I have access to gene sets from 19 different databases (including GO
and KEGG). Some of these sets are highly curated collections for one
specific biological area (such as metabolism) while others are larger
(~6K gene sets). The distribution of gene sets per database is:

> stem(tbl)
  The decimal point is 3 digit(s) to the right of the |

  0 | 01122333446688925
  2 | 4
  4 |
  6 | 3

Appropriately defining the universe is critical, as people on this
list have previously demonstrated.

Does anyone have an opinion about how to define the gene universe when:

1) the genes include in all the gene sets is small (say 20% of the
total number of genes).
2) only specific gene sets across databases are tested at once. For
example, someone might want to get all the gene sets for a specific
area (say cell cycle) across the different databases and test those at
once

I've been thinking that the universe aught to be the set of genes that
are available across all the gene sets being tested. In case 1 above,
this seems too small while in case 2 it seems excessively large (cue
the Goldilocks jokes).

Thanks,

Max



More information about the Bioconductor mailing list