[BioC] GO term enrichment analysis over whole genome, copy number aberration investigation

michael watson (IAH-C) michael.watson at bbsrc.ac.uk
Fri Sep 5 13:21:07 CEST 2008


I'd say you have two choices of universe.  Your list of IDs are HUGO
gene ids, so your universe must be based on these.

Choice 1: List of HUGO ids in your genome of choice.  Of course, the
number of HUGO ids may not match the humber of entrez gene ids, or any
other ids, as no two ID systems map perfectly one-to-one.

Choice 2: List of HUGO ids in your genome of choice that have at least
one GO term.  

For some genomes, choice one and two will be the same; for others it
will be radically different.

If you choose the 2nd, then you must ensure that your list of
"significant" HIGO ids also only contains IDS with a GO term. 

For an analysis per chromosome, you'd have to subset both your
"significant" list and your universe by chromosome.  

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Nathan
Harmston
Sent: 05 September 2008 11:11
To: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] GO term enrichment analysis over whole genome,copy
number aberration investigation

Hi,

I currently have a list of HUGO gene ids which relate to genes in
areas of gain over a whole chromosome, and would like to perform GO
enrichment analysis on them. So I have 2 problems:

1. currently i have been defining my gene universe based on affymetrix
arrays, however now I am working over the whole genome. gene_universe
= getBM(c("entrezgene"), mart = ensembl) ......however this leaves me
with a gene_universe of 20275 gene ids (is this right?)
2. moving from my HUGO identifiers to entrez gene ids? I can do this
using biomaRt
test = getBM(c("entrezgene"), filters = "hgnc_symbol", values =
stGained, mart = ensembl)

however, this is not the same length as my number of hugo gene
identifiers (in my case 30 are missing). Why is this? Is this just
some weird annotation bug that can't be fixed or is it the way I m
doing it. Does the bioconductor have the GO information for all genes
in the genome and not just those in the annotation files for the
affymetrix arrays?

Finally.....what are the statistical implications of performing GO
enrichment (Im using a conditional test) over a whole genome, would it
be better to run the gene set enrichment analysis on each chromosome
(I don think so)? I m trying to find evidence that genes relating to
certain functions are gained over the whole chromosome (cancer study).
I've ran a test one and have found some things which make sense.

Many thanks in advance,

Nathan

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list