[BioC] Difficulties in using the mgsa package for Gene Set Analysis

Mon Jan 21 10:29:17 CET 2013

Dear Sebastian,

Thanks for your reply, I wasn't aware of the existence of the @itemAnnotations data frame! I have tried to convert the primary id to gene symbols, but in doing so I've become aware of another problem: the RGD IDs are stored in the data frame as the data frame row names, but some of them refer to the same symbol, so you can't just substitute the RGD IDs for the gene symbols (as row names cannot be repeated).

This also means that the gene sets, as defined by the readGAF function, include repeated genes, and that would likely affect the results of the analysis... As an example, the first set defined (GO:0000002) is composed by 17 RGD IDs, but only 14 different genes, as 3 of them are repeated twice with a different RGD ID in each case.

I believe that, by using the info in the different slots of the MgsaGoSets object, it should be possible to remove the replicated entries from both the sets and the entries, or at least create a list that the Mgsa function can use for the analysis, so I'll start looking into doing that. I've also looked into the suggestion of using biomart, which was a very good idea, but I'd still be facing the problem of the duplicated elements in the gene sets.

Thanks again!

Juan
________________________________________
From: Sebastian Bauer [sebastian.bauer at charite.de]
Sent: Thursday, January 17, 2013 11:39 AM
To: Juan M.Adrian [guest]
Cc: bioconductor at r-project.org; Adrian Segarra, Juan
Subject: Re: Difficulties in using the mgsa package for Gene Set Analysis

Dear Juan,

[...]
> Item annotations:
>          symbol                              name
> 1302934 St8sia5 ST8 alpha-N-acetyl-neuraminide...
> ...
> 1302939   Eef1g eukaryotic translation elongat...
> ... and  29261  other items.
>
> Applying the function mgsa() to my list of differentially expressed genes
> and these gene sets doesn't work, as it looks for matches between the
> 'symbol' category in the gene sets and the genes of interest. However, the
> numbers in the 'symbol' category are RGD IDs (from the Rat Genome
> Database, http://rgd.mcw.edu/), and I haven't been able to find a way to
> either change these to something else (Entrez ID, gene symbol, etc) or
> somehow get the RGD IDs for my genes of interest without looking for them
> manually.
>
> So, in order to apply MGSA to my data, I am hoping to get some help on how
> to do one of these three things:
>
> 1) Modify the MgsaGoSets object so it uses as 'symbol' a more common gene
> ID, such as Entrez ID, instead of RGD ID.

I've peeked into RGD association file. As far as I understood it (I found
no documentation in the README) it provides both RGD and gene symbols. The
readGAF() function reads both information in as you can see in the output.
However, only the primary id is used by mgsa() and the primary id is RGD.
If you can turn your list into a list of gene symbols you could use the
undocumented gaf at itemAnnotations data frame to convert from the one name
space to the other.

> 2) Obtain the RGD IDs of my list of differentially expressed genes from a
> more common gene ID.

I'm unfortunately no expert in this, but maybe you can use BioMart at
Ensemble for this. Unfortunately, this site doesn't work for me currently
so I couldn't try it out.

See http://www.ensembl.org/info/data/biomart.html

Hope this helps.

Bye
Sebastian