[BioC] Using GOstats for a non-model organism

James W. MacDonald jmacdon at med.umich.edu
Tue Feb 15 16:51:41 CET 2011


Hi Maureen,

On 2/14/2011 5:50 PM, Maureen J. Donlin wrote:
> James,
>
> Thanks for the reply. I figured out how to get the data into a data frame.
> I was doing 2 things wrong, but here is the code that worked.
>
>  > CneoGO <- read.table("Cneo_GOannot.txt", header=TRUE)
>  > head(CneoGO)
> Goterm Evidence GeneID
> 1 GO:0015893 IEA CNAG_00003
> 2 GO:0043231 IEA CNAG_00003
> 3 GO:0015203 IEA CNAG_00003
> 4 GO:0044425 IEA CNAG_00003
> 5 GO:0044444 IEA CNAG_00003
> 6 GO:0015846 IEA CNAG_00003
>
>  > goframeData = data.frame(CneoGO$Goterm, CneoGO$Evidence, CneoGO$GeneID)
>  > head(goframeData)
> CneoGO.Goterm CneoGO.Evidence CneoGO.GeneID
> 1 GO:0015893 IEA CNAG_00003
> 2 GO:0043231 IEA CNAG_00003
> 3 GO:0015203 IEA CNAG_00003
> 4 GO:0044425 IEA CNAG_00003
> 5 GO:0044444 IEA CNAG_00003
> 6 GO:0015846 IEA CNAG_00003

This step is unnecessary. The result of read.table() *is* a data.frame, 
so you are just creating another data.frame here.

>
> So continuing with the tutorial guide, I executed the following:
>
>  > library("GSEABase")
> Loading required package: annotate
>
>  > goFrame = GOFrame(goframeData, organism = "Cryptococcus neoformans")
> Loading required package: GO.db
>
>  > goFrame
> An object of class "GOFrame"
> Slot "data":
> CneoGO.Goterm CneoGO.Evidence CneoGO.GeneID
> 1 GO:0015893 IEA CNAG_00003
> 2 GO:0043231 IEA CNAG_00003
> ...
> Slot "organism":
> [1] "Cryptococcus neoformans"
>
>  > goAllFrame = GOAllFrame(goFrame)
>
>  > goAllFrame
> An object of class "GOAllFrame"
> Slot "data":
> go_id evidence gene_id
> 1 GO:0000001 IEA CNAG_00006
> 2 GO:0000001 IEA CNAG_00088
> ...
> Slot "organism":
> [1] "Cryptococcus neoformans"
>
>
>  > gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
>  > gsc
> GeneSetCollection
> names: GO:0000001, GO:0000002, ..., GO:2000045 (6658 total)
> unique identifiers: CNAG_00006, CNAG_00088, ..., CNAG_06995 (4822 total)
> types in collection:
> geneIdType: GOAllFrameIdentifier (1 total)
> collectionType: GOCollection (1 total)
>
>  > universe = Lkeys(CneoGO)
> Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function "Lkeys", for signature
> "data.frame"

So here you are getting mixed up with what Marc had to do to get his 
example to run, and what you need to do. The 'universe' is just the 
complete set of gene IDs from which your significant set was chosen.

If you had an org.Cn.eg.db package, then you would do something similar. 
However, you don't, which is the point of this exercise. The 
corresponding set of gene IDs that you do have is the third column of 
the data.frame you created above (goFrameData or CneoGO).

Note here that you want to make sure that the gene IDs you use are 
character values, not factors. The default for R when reading in a 
data.frame is to convert a vector of strings to factor, so you either 
want to use

CneoGO <- read.table("Cneo_GOannot.txt", header=TRUE, stringsAsFactors = 
FALSE)

and then

universe <- CneoGO[,3]

or proceed as you already have, but then

universe <- as.character(CneoGO[,3])

In addition, note that you will need to construct your 'genes' vector 
differently from what is shown on p.3 of the vignette, instead selecting 
the set of significant genes from the results of your analysis (again, 
using the CNAG gene IDs).

 From that point on, you continue as Marc shows in the vignette.

Best,

Jim



>
> Am I missing some data that is found in the library("org.Hs.egGO")? I
> can do the same commands with it and the structure of the goFrame,
> goAllFrame and gsc seem to be the same.
>
> Here's what I am trying to do. I have a microarray data set from a time
> course experiment done with a fungal genome, C. neoformans. I have
> clusters of genes which are associated based how their expression
> changed in relation to the other genes on the array. So what I have are
> gene lists, with no expression data or fold changes. For each list of
> genes, I want to know what GO terms are over-represented.
>
> I apologize if these questions are too basic. It's just that most of the
> software out there for gene enrichment analysis are designed for model
> organisms.
>
> Again, any help is greatly appreciated.
>
> Regards,
> Maureen
>
>
>
>
>
> On 2/14/11 3:23 PM, James W. MacDonald wrote:
>> Hi Maureen,
>>
>> On 2/14/2011 3:27 PM, Maureen J. Donlin wrote:
>>> Hi all,
>>>
>>> I'm new to R and have some very basic questions about using GOstats with
>>> a non-model organism.
>>> I'm trying to use the tutorial by Marc Carlson "How to Use GOstats
>>> and...with unsupported model organisms."
>>>
>>> I've created a GO to gene mapping file with the following 3 columns of
>>> data:
>>> Goterm Evidence GeneID
>>> GO:0015893 IEA CNAG_00003
>>> GO:0043231 IEA CNAG_00003
>>> GO:0015203 IEA CNAG_00003
>>> GO:0044425 IEA CNAG_00003
>>> ...
>>>
>>> I can import it using read.table, but I don't seem to be able to invoke
>>> the data frame correctly.
>>
>> When you read it in using read.table(), you automatically have a
>> data.frame.
>>
>>>
>>> The tutorial reads:
>>> library("org.Hs.eg.db")
>>> frame = toTable(org.Hs.egGO)
>>> goFrameData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
>>
>> Yep, this is just some code that Marc uses to create a data.frame so
>> he can give an example.
>>
>>>
>>> I imported the data into an object using read.table
>>> >CneoGOanno <- read.table("Cneo_GOannot.txt")
>>>
>>> I tried to create a frame using:
>>> > frame = toTable(CneoGOannot)
>>> Error in function (classes, fdef, mtable) :
>>> unable to find an inherited method for function "toTable", for signature
>>> "data.frame"
>>>
>>> Do I have to create some sort of database for this organism first? If
>>> so, what is it's format?
>>>
>>> Any suggestions would be most appreciated.
>>
>> Just go to the next step, which will be something like
>>
>> goFrame <- GOFrame(CneoGOanno, organism = "Cryptococcus neoformans")
>> goAllFrame <- GOALLFrame(goFrame)
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>>
>>> Regards,
>>> Maureen Donlin
>>>
>>> At the risk of too long of an email, here's the session info:
>>> > sessionInfo()
>>> R version 2.12.1 (2010-12-16)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] org.Hs.eg.db_2.4.6 GOstats_2.16.0 RSQLite_0.9-4 DBI_0.2-5
>>> graph_1.28.0 Category_2.16.0 AnnotationDbi_1.12.0
>>> [8] Biobase_2.10.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.28.0 genefilter_1.32.0 GO.db_2.4.5 GSEABase_1.12.2
>>> RBGL_1.26.0 splines_2.12.1 survival_2.36-2 tools_2.12.1
>>> [9] XML_3.2-0 xtable_1.5-6
>>>
>>>
>>
>

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list