[BioC] pathview puzzle

Thu Aug 22 20:00:25 CEST 2013

Colleagues, 

I'd like to use pathview with E.coli data. 
While the Homo sapience example from the manual works just fine:

pv.out <- pathview(gene.data = gse16873.d[, 1], pathway.id = demo.paths$sel.paths[i], species="hsa", out.sufix="gse1683", kegg.native=TRUE)

using an analogous run with E.coli data keeps failing: 

eco.out <- pathview(gene.data = data02010, pathway.id = "02010", out.suffix = "ecotest", species = "eco", kegg.native=TRUE)
[1] "Downloading xml files for eco02010, 1/1 pathways.."
[1] "Downloading png files for eco02010, 1/1 pathways.."
Error in mol.data[as.character(items[hit]), ] : subscript out of bounds
In addition: Warning messages:
1: In node.map(gene.data, node.data, node.types = gene.node.type, node.sum = node.sum) :
 NAs introduced by coercion
2: In FUN(1:153[[1L]], ...) : NAs introduced by coercion

I've checked variations of the input data structure, tried subsetting of the genes to those used in the pathway to be colored only - as shown here, and the "subscript out of bounds" error was still there. 

In fact, if we compare the structure of the data in the vignette and the cistom data, they are the same:

str(gse16873.d[, 1])
 Named num [1:11979] -0.3076 0.4159 0.1985 -0.2316 -0.0449 ...
 - attr(*, "names")= chr [1:11979] "10000" "10001" "10002" "10003" ...

str(data02010)
 Named num [1:47] 2.95 2.25 1.97 1.72 1.72 ...
 - attr(*, "names")= chr [1:47] "b0365" "b0366" "b0829" "b0830" ...

If we look at the respective XML files, we see consistency as well:

<entry id="2" name="hsa:51343" type="gene"
 link="http://www.kegg.jp/dbget-bin/www_bget?hsa:51343">
 <graphics name="FZR1, CDC20C, CDH1, FZR, FZR2, HCDH, HCDH1" fgcolor="#000000" bgcolor="#BFFFBF"
 type="rectangle" x="919" y="536" width="46" height="17"/>
 </entry>

 <entry id="4" name="eco:b1513" type="gene"
 link="http://www.kegg.jp/dbget-bin/www_bget?eco:b1513">
 <graphics name="lsrA" fgcolor="#000000" bgcolor="#BFFFBF"
 type="rectangle" x="339" y="1882" width="46" height="17"/>
 </entry>

I.e. XML gene entries have name="Organism_ID:GeneID", and the GeneIDs are expected to be the names attached to the expression data.
This is true in both of the 2 cases, however hsa example works and eco example does not. 

Couterintuitively, the "subscript out of bounds" error seems to stem not from the fact of having some unrecognizable IDs in the expression file but rather from having RECOGNIZABLE (!!!!) IDs there. If we change the IDs in the expression file to some nonsence, the function eats it up and there is no "out of bounds" error anymore! (this observation came from an attempt to use gene names instead of b-numbers in the expression file; the phenomenon was checked several times in clean environments etc)

Example (with the bla.data object in the attached rda file)

bla.out <- pathview(gene.data = bla.data, out.suffix = "bla", species = "eco", pathway.id = "02010", kegg.native=TRUE)
Working in directory ....
Writing image file eco02010.bla.png
There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FUN(1:153[[153L]], ...) : NAs introduced by coercion

As a result of using nonsense IDs, graphical files are generated just fine, without coloring, of course. 

And using real IDs that match the XML file contents always resulted in the "out of bounds" error (the data02010 object is included in the attached file)

Any ideas?

Thanks,

Oleg