[BioC] gene2pathway retrain: which model is more complete?

Mon Aug 23 15:15:06 CEST 2010

I believe there is a plausible explanation for Question 1: quite a
number of software packages have different versions at home and on the
server, *including* gene2pathway - which is 1.6.0 on server and 1.6.1
at home. Previously, I erroneously believed gene2pathway versions were
the same.

Now only Question 2 remains somewhat unanswered. As soon as the final
model is retrained, I'll be able to compare average prediction errors
and thus conclude on which model is better.

On 23 August 2010 14:52, Bogdan <b.t.tokovenko at imbg.org.ua> wrote:
> After converting my custom gene2Domains mapping into a list of vectors
>
>> head(entrez2interpro_nested)
> $`679594`
>  [1] "IPR019956" "IPR019954" "IPR019955" "IPR000626"
>
> $`682397`
> [1] "IPR019956" "IPR019954"
>
> and feeding that into retrain(), I now have the 4th model (most
> complete?), built using
> genes: 5667 of 5667
> features: 4007
> level detectors: 78
>
> This obsoletes my Questions 3 and 4 from my previous email.
> However, Questions 1 and 2 are still not fully clear to me.
>
> I would now paraphrase Q2 into:
> Of all the retrain()-generated models I now have, which one is
> theoretically better to use?
> The one with the most genes, most level detectors, or most features (domains)?
> Or the one with the lowest average prediction error, disregarding all
> other factors?
>
> On 22 August 2010 17:00, Bogdan <b.t.tokovenko at imbg.org.ua> wrote:
>> Dear all,
>>
>> I have 2 PCs: server running Debian Lenny and R 2.7.1, and home
>> running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and
>> dependencies) installed.
>>
>> When running `model.rno = retrain(organism = "rno")`, I got slightly
>> different outputs describing the components to build the model:
>>
>> (server)
>> genes: 4055 of 5667
>> features: 3553
>> level detectors: 74
>>
>> (home)
>> genes: 3987 of 5577
>> features: 3488
>> level detectors: 75
>>
>> Question 1: retrain() manual states that all the data for model
>> training is fetched from KEGG and Ensembl. How then could these
>> differences (above) be possible? I've run each retrain twice, to be sure that was not
>> a momentarily glitch.
>>
>>
>> Seeing this, I've decided to manually supply gene2Domains mapping.
>> Using BioMart, I asked for all entrez-interpro pairs:
>>> head(entrez2interpro_list)
>> $`679594`
>> [1] "IPR019956"
>>
>> $`679594`
>> [1] "IPR019954"
>>
>> $`679594`
>> [1] "IPR019955"
>>
>> $`679594`
>> [1] "IPR000626"
>>
>> $`682397`
>> [1] "IPR019956"
>>
>> $`682397`
>> [1] "IPR019954"
>>
>>> length(unique(names(entrez2interpro_list)))
>> [1] 17666
>>
>>> model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list)
>>
>> Feeding entrez2interpro_list to retrain(), I got these numbers:
>>
>> (manual gene2Domains)
>> genes: 5677 of 5677
>> features: 1852
>> level detectors: 78
>>
>> Question 2 (main question): Of these 3 models I now have, which one is
>> theoretically better to use? The one with most genes, most level
>> detectors, or most features?
>>
>> Question 3: Is the format of my entrez2interpro_list correct? There
>> were no errors, but that list has duplicate rownames. I wonder if each
>> EntrezID should be in the list only once, with all relevant IPRs
>> packed into a nested list.
>> (possibly related) Question 4: How could it happen that there are only
>> 1852 features for the most complete coverage of gene mappings in
>> "manual gene2Domains" case?

-- 
Regards,
Bogdan Tokovenko
--
Laboratory of Systems Biology,
Department of Genetic Information Translation Mechanisms,
Institute of Molecular Biology and Genetics, Kyiv, Ukraine
http://SysBio.org.ua/
http://BioMed.org.ua/COTRASIF/