[BioC] building enzyme centric metabolic network

Mon Apr 20 19:14:16 CEST 2009

Hi Martin:
Thanks a lot. This _is_ a much better way of creating the incidence matrix.
Saroj

Martin Morgan wrote:
> Hi Saroj --
>
> All those 'for' loops and the idea that you'd have to 'monitor
> progress' make me think there's a better way to do this. Without
> commenting on the merits of the approach, if the goal is to get an
> 'incidence matrix' listing which genes (rows) appear in which pathways
> (columns) I might
>
> load the library
>
>   library(org.Mm.eg.db)
>
> get a data frame of gene and path ids (probably there's a more
> AnnotationDbi way of doing this...)
>
>   tbl <- toTable(org.Mm.egPATH)
>
> get the unique gene names
>
>   ugenes <- unique(tbl$gene_id)
>
> For each path_id, create a logical vector with 'TRUE' when a gene_id
> appears in the path. This is a little tricky here; get("%in%") gets
> the "%in%" function, usually one could just use a function name;
> x=ugenes provides the 'x' (first) argument to %in%, forcing the genes
> in each pathway into the 'table' (second) argument to %in%, ordering
> ugenes %in% genes means that every pathway ends up with an identical
> length and order logical vector
>
>   occur <- with(tbl, tapply(gene_id, path_id, get("%in%"), x=ugenes))
>
> convert the result into a matrix by unlisting the tapply result
>
>   m <- matrix(unlist(occur), ncol=dim(occur),
>               dimnames=list(gene=ugenes, path=names(occur)))
>
> and voila             
>
>   
>> m[1:5,1:10]
>>     
>        path
> gene    00010 00020 00030 00031 00040 00051 00052 00053 00061 00062
>   11298 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>   11303 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>   11304 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>   11305 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>   11306 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> some sanity checks
>
>   
>> dim(tbl) ## 11039 gene / path combinations
>>     
> [1] 11039     2
>   
>> sum(m)
>>     
> [1] 11039
>
>   
>> colSums(m)[1:5] ## number of genes in the first 5 pathways
>>     
> 00010 00020 00030 00031 00040 
>    54    27    25     2    22 
>   
>> nrow(toTable(revmap(org.Mm.egPATH)["00010"]))
>>     
> [1] 54
>   
>> nrow(toTable(revmap(org.Mm.egPATH)["00020"]))
>>     
> [1] 27
>
>   
>> max(rowSums(m)) # maximum paths a gene belongs to
>>     
> [1] 31
>   
>> which.max(rowSums(m))  # gene 26413, row 2354
>>     
> 26413 
>  2354 
>   
>> nrow(toTable(org.Mm.egPATH["26413"]))
>>     
> [1] 31
>
> It would be possible to change the structure of m into a more
> space-efficient one, but as rowSums and colSums indicate above it's
> very useuful in its current form.
>
> Martin
>
>
> Saroj K Mohapatra <smohapat at vbi.vt.edu> writes:
>
>   
>> Hi Anupam:
>>
>> Using your second criterion, i.e., two enzymes are linked if they are in the same pathway, one strategy would be to use the information from org. packages for creating a network for each species.
>>
>> # For mouse, use the package org.Mm.eg.db to get the list of all entrez ids, associated KEGG id
>> library("org.Mm.eg.db")
>> egKEGGids = as.list(org.Mm.egPATH)
>>
>> # many entrez ids have no KEGG id; discard these
>> sum(is.na(egKEGGids))
>> egKEGGids = egKEGGids[!is.na(egKEGGids)]
>>
>> # For all possible entrez id pairs, find a KEGG id; if not available, drop that pair
>> keggnet = matrix(ncol=3, nrow=0) # Empty matrix with 3 columns
>> colnames(keggnet) = c("EG1", "EG2", "KEGG")
>> mycounter = 1 # to count the current entrez id pair
>> numEg = length(egKEGGids) # number of entrez ids
>> for(i in 1:(numEg-1)) {
>>   for(j in (i+1):numEg) {
>>     eg1 = names(egKEGGids)[i]
>>     eg2 = names(egKEGGids)[j]
>>     kegg1 = as.character(unlist(egKEGGids[i])) 
>> 		# kegg id for the first gene
>>     kegg2 = as.character(unlist(egKEGGids[j])) 
>> 		# kegg id for the 2nd gene
>>     common = intersect(kegg1, kegg2) # common between two
>>     numCommon = length(common) # how many shared
>>     if(numCommon>0) {
>>       for(k in 1:numCommon) {
>> 	keggnet = rbind(keggnet, c(eg1, eg2, common[k]))
>> 	cat(i,  "\t", j,   "\t", numEg,"\t | ") 
>> 	cat(eg1,"\t", eg2, "\t", common[k], "\n")
>> 		# monitor progress
>> 	mycounter = mycounter+1 # move the counter
>>       }
>>     }
>>   }
>> }
>>
>> # now write the data to a file for future use
>> write.table(keggnet, file="mouse_KEGG_network.txt", 
>> 	col.names=T, row.names=F, sep="\t", quote=F)
>>
>> The resulting file has three columns for two entrez ids (EG1, EG2) and KEGG (ID) shared by the two entrez ids. There might be a faster way of using it though (in stead of the for loops). Also, there might already be such networks available within bioconductor. 
>>
>> I imagine that you would need one such file for each species of your interest (to study "evolutionary aspects").
>>
>> Best,
>>
>> Saroj
>>
>> ----- Original Message -----
>> From: "anupam sinha" <anupam.contact at gmail.com>
>> To: Bioconductor at stat.math.ethz.ch
>> Sent: Sunday, April 19, 2009 4:35:12 AM GMT -05:00 US/Canada Eastern
>> Subject: [BioC] building enzyme centric metabolic network
>>
>> Hi all,
>>          I am working on the evolutionary aspects of metabolic net
>> works(enzyme centric). This network will consists of nodes that are enzymes
>> and any two enzymes are linked if they share a metabolite (or in the same
>> pathway). Can anyone suggest me a way to build these networks using
>> KEGGgraph or KEGGSOAP packages ? Thanks in advance.
>>
>> Regards,
>>
>> Anupam Sinha
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     
>
>