[BioC] KEGGGraph: some complexed proteins are orphans in graphNEL

Fri May 1 15:06:57 CEST 2009

Hi David,

Thanks very much for your response.  A few comments:

1) If the orphan status of TSC1 is the result of an omission in the  
kgml, then I am all in favor or your suggestion to ask the KEGG staff  
if they can fix it.  Have you any had any luck with such requests in  
the past?  Your second 'guess' option might be worth adding if they  
are not very responsive.

2) You observe that TSC1 and TSC2 'physically interacting with each  
other [is] supported by PPI data'.  This suggests a topic we have been  
mulling over; maybe you have some ideas.  Judicious combination of  
interaction and pathway data is our goal.  The danger, of course, is  
that you get a rather useless hairball.  Have you experimented with  
merging PPI data with a KEGG graph?

3) As pathway and interaction data matures, and as packages such as  
yours and Tony Chiang's makes it available in bioc, we may try to  
identify some good conventions for handling, within the bioc graph  
classes, things like complexes, and multiple edges between nodes which  
are the result of combining datasets.

4) The National Cancer Institute's "Pathway Interaction Database"  
offers files in BioPAX level 2 format.  Could you be tempted to create  
a package (like KEGGgraph) for that data?

5) I notice that you store edge and node attributes for a KEGGgraph as  
(respectively) a single large environment in nodeDataDefaults and  
edgeDataDefaults.  It seems a little more natural to me to separate  
the environment out into single entries assigned to node- and edge- 
specific nodeData and edgeData.  What do you think?  Your current  
approach is certainly usable, of course.

Thanks for all the good work!

  - Paul

On May 1, 2009, at 12:19 AM, Jitao David Zhang wrote:

> Hi Paul,
>
>   Thank you again for the feedback.
>
>   I have checked the KGML file (ftp://ftp.genome.jp/pub/kegg/xml/organisms/hsa/hsa04150.xml 
> ), the reason for the discrepancy you have reported seems to be that  
> KEGG did not record these two molecules as complex (which is given  
> by a 'group' entry, mostly representing a protein complex, for  
> example see ELK1/ELK4/SRF complex in MAPK pathway, http://www.genome.jp/kegg/pathway/hsa/hsa04010.html 
> , KGML file can be downloaded at ftp://ftp.genome.jp/pub/kegg/xml/organisms/hsa/hsa04010.xml) 
> .
>
>   Since at the moment KEGGgraph parses KGML files royally without  
> any post-modification, it cannot reflect the relation between TSC1  
> and TSC2 in this example, even if visually it seems that KEGG wanted  
> to present them as protein/functional complex (btw, these two are  
> also physially interacting with each other supported by PPI data). I  
> suggest that we could do 2 things:
> 	• Inform the staff at KEGG and discuss whether the relation between  
> them should be related (maybe into a complex node containing TSC1  
> and TSC2, while all the interactions will be directed to the complex  
> and an interaction must be established between the two)
> 	• Add a new feature to KEGGgraph pathway to 'guess' the  
> relationship between nodes based on their graphical attributes: say,  
> if two nodes shares boundary it may present a protein complex. This  
> may, under certain circumstances, lead to errorneous results, hence  
> I suppose not to add it to the functionality by default but rather  
> an addinitioal feature for advanced users.
>    A few words about using complex as a node: it is definely fine.  
> Two potential problems arise: the edges to the complex may be  
> ambigious if not annotated correctly (I had a 'bad' example for this  
> but gonna find it out), and there is no uniform identifier available  
> (so far as I know) for these complexes, while normally the  
> KEGGgraphs are indexed by GeneID. A work-around may be to assign a  
> complex name like 'C1', however these ids are not unique across  
> graphs and will lead to problems when graphs from different sources  
> have to be merged. Personally I tend to use cluster to represent  
> these complexes. This feature is still in testing and I will update  
> the package once it is stable and productive.
>
>   Thank you again for the feedbacks and I am open to further  
> discussion.
>
>   Best wishes,
> David
>
> 2009/5/1 Paul Shannon <pshannon at systemsbiology.org>
> We have been using the admirable KEGGGraph package to obtain  
> pathways in graphNEL form.  It is very useful.
>
> mTor is the signalling pathway we are working with: http://www.genome.jp/dbget-bin/get_pathway?org_name=hsa&mapno=04150
>
> We find that proteins which appear only as members of a complex are  
> orphans in the graphNEL.
>
> For instance, "hsa:7248" (TSC1) forms a complex with "hsa: 
> 7249" (TSC2).  TSC2 is well connected, but its complex partner TSC1
> is an orphan.
>
> There are a number of ways to handle this, some quite sophisticated,  
> some not.  Once could define a node for the complex, create edges to  
> that node, and then specify (with a 'complex membership' edge) that  
> TSC1 and TSC2 both belong.
>
> mTor presents a good (though challenging) use case: there are two  
> differently-acting complexes which include mTor and GBL.  The third  
> member of the complex is different, however, as are the interactions  
> the two complexes participate in.   This seems to argue for  
> 'complex' being a node type.
>
> One simple improvement, which solves some of the 'orphan complex  
> node' problem, could be this workaround:  all members of each  
> complex participate in all the interactions which belong to the  
> complex.
>
> Here is some incomplete (but suggestive) evidence of the orphan  
> status of TSC1.  A more detailed search reveals that TSC1 is not  
> found in the target nodes of any of the edges of g.mTor.
>
> f <- '~/s/data/public/kegg/hsa04150.xml'
> g.mTor <- parseKGML2Graph (f)
> tsc1 <- 'hsa:7248'
> tsc2 <- 'hsa:7249'
> tsc1 %in% nodes (g.mTor)  #  TRUE
> tsc2 %in% nodes (g.mTor)  #  TRUE
> tsc2 %in% names (edges (g.mTor)) # TRUE
> tsc1 %in% names (edges (g.mTor)) # TRUE
> edges (g.mTor)[[tsc1]]   # character(0)
> edges (g.mTor)[[tsc2]]   # "hsa:6009"
>
> Thanks,
>
>  - Paul
>
>
> sessionInfo ()
>
> R version 2.9.0 (2009-04-17)
> i386-apple-darwin8.11.1
>
> locale:
> en_US/en_US/en_US/C/en_US/en_US
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] RBGL_1.20.0         gaggle_1.12.0       rJava_0.6-2          
> org.Hs.eg.db_2.2.6  RUnit_0.4.22        KEGG.db_2.2.5        
> RSQLite_0.7-1
>  [8] DBI_0.2-4           AnnotationDbi_1.6.0 Biobase_2.4.0        
> KEGGgraph_1.0.0     graph_1.22.0        XML_2.3-0
>
> loaded via a namespace (and not attached):
> [1] cluster_1.11.13 tools_2.9.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> -- 
> Cheers,
> David