[BioC] clustering genes in GO categories

Wed Aug 29 15:17:53 CEST 2012

On 08/29/2012 12:34 AM, Assa Yeroslaviz wrote:
> Hello bioC users,
>
> as you can see below, this was posted over a year ago. Unfortunately I
> tried the same today and for some mysterious it is not working correctly
> any more.
> What I have is the same data.frame:
>> dat
>              id flybasename_gene flybase_gene_id entrezgene
> 1 1616608_a_at             Gpdh     FBgn0001128      33824
> 2 1622892_s_at          CG33057     FBgn0053057     318833
> 3 1622892_s_at            mkg-p     FBgn0035889      38955
> 4   1622893_at              IM3     FBgn0040736      50209
> 5   1622894_at          CG15120     FBgn0034454      37248
>
> GOMF
> 1 carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
> phosphodiesterase activity:protein binding:
> 2     nucleotide binding:protein binding:ATP binding:chaperone
> binding:ammonium transmembrane transporter activity
> 3     nucleotide binding:protein binding:ATP binding:chaperone
> binding:ammonium transmembrane transporter activity
> 4                    aminopeptidase activity:metalloexopeptidase
> activity:hydrolase activity:manganese ion binding
> 5
> protein binding
>
> What I would like to have is a second data frame with the GO categories as
> row names and the gene IDs to be put in each of the GO categories they
> belong to. like that:
>
>
> GO    genes
> protein binding     FBgn0001128    FBgn0053057     FBgn0035889 etc.
> ammonium transmembrane transporter activity      FBgn0053057    FBgn0035889
> hydrolayse activity   FBgn0040736     FBgn0001128
>
>
> Below is the script I used before, and as far as I can remember it did work
> very good:
>
>
> lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"])
> lst2 <- lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
>
> unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
> use.names = FALSE))
> done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
> done_df <- lapply(done, paste, collapse = ",")
> out <- data.frame(GO = names(done_df), FBgn = unlist(done_df))
>
> But the result I am getting are not the GO categories, but a numbered list
> of the the number of gene IDs, which looks like that:
>
>> out
>    GO                    FBgn
> 1  1             FBgn0040736
> 2  2             FBgn0001128
> 3  3 FBgn0035889,FBgn0053057
> 4  4             FBgn0034454

Probably GOMF is a factor, but was a character,

dat$GOMF <- as.character(dat$GOMF)

Here's a different code chunk, using Biobase::reverseSplit

map <- with(dat, strsplit(setNames(GOMF, flybase_gene_id), ":"))
revmap <- sapply(reverseSplit(map), paste, collapse=",")
data.frame(GO=names(revmap), FBgn = as.vector(revmap))

Martin
>
> I would like to know if something was changed in the apply command
> structure to prevent the same results as before. I would appreciate your
> help.
>
> Thanks
> Assa
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793