[BioC] gene2pathway retrain: which model is more complete?
b.t.tokovenko at imbg.org.ua
Sun Aug 22 16:00:38 CEST 2010
I have 2 PCs: server running Debian Lenny and R 2.7.1, and home
running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and
When running `model.rno = retrain(organism = "rno")`, I got slightly
different outputs describing the components to build the model:
genes: 4055 of 5667
level detectors: 74
genes: 3987 of 5577
level detectors: 75
Question 1: retrain() manual states that all the data for model
training is fetched from KEGG and Ensembl. How then could these
differences be possible? I've run each twice, to be sure that was not
a momentarily glitch.
Seeing this, I've decided to manually supply gene2Domains mapping.
Using BioMart, I asked for all entrez-interpro pairs:
> model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list)
Feeding this list to retrain(), I got these numbers:
genes: 5677 of 5677
level detectors: 78
Question 2 (main question): Of these 3 models I now have, which one is
theoretically better to use? The one with most genes, most level
detectors, or most features?
Question 3: Is the format of my entrez2interpro_list correct? There
were no errors, but that list has duplicate rownames. I wonder if each
EntrezID should be in the list only once, with all relevant IPRs
packed into a nested list.
(possibly related) Question 4: How could it happen that there are only
1852 features for the most complete coverage of gene mappings in
"manual gene2Domains" case?
Laboratory of Systems Biology,
Department of Genetic Information Translation Mechanisms,
Institute of Molecular Biology and Genetics, Kyiv, Ukraine
More information about the Bioconductor