[BioC] athPkgBuilder

Sat Aug 19 04:36:16 CEST 2006

Dear All,

As promised, I updated athPkgBuilder (AnnBuilder v 1.11.8), just commit
to devel svn repository. Here are the changes:

1. Previously the URL of data sources were specified in parameter
fileExt. This remain unchanged. But now you can use function getFileExt
to generate a list, and feed it directly to parameter fileExt:
> getFileExt("AG")
$base
[1] "Microarrays/Affymetrix/affy_AG_array_elements-2006-07-14.txt"

$estAssign
[1] "Genes/est_mapping/est.Assignment.Locus"

$seqGenes
[1] "Genes/TAIR_sequenced_genes"

$go
[1] "Ontologies/Gene_Ontology/ATH_GO_GOSLIM.20060815.txt"

$aliases
[1] "Genes/gene_aliases.20060620"

$aracyc
[1] "Pathways/aracyc_dump_20060214"

$kegg
[1] "/ath/ath_gene_map.tab"

$pmid
[1] "User_Requests/LocusPublished.08012006.txt"

Function getFileExt takes the chip name as input, either "ATH" or "AG".

2. If you compare the above list with the one athPkgBuilder had before,
there are a few changes:

$base: a new slot, the URL of the probe-to-gene mapping file. It is used
only when no baseName is given. In another word, users can give their
own mapping or set baseName=NULL (default) and use TAIR's.

$aracyc: the slot name was $path before. I changed it to $aracyc to
clarify that the data comes from AraCyc. The enzyme annotation from
AraCyc is stored in environment "ENZYME" in the final package. The
pathway annotation is stored in environment "ARACYC" in the final package.

$kegg: a new slot, the URL of KEGG's pathway data. The pathway
annotation is stored in environment "PATH" in the final package. Noticed
that environment "PATH" was obtained from AraCyc before. So, this is a
change. The main reason for the change is that we get pathway data from
KEGG for all other annotation packages.

$pmid: use a different file from TAIR now. Thanks for Tine's contribution.

3. when a probeset ID matches multiple genes:

There is a new parameter "indexby" for function "athPkgBuilder". The
value is either "PROBE" (default) or "ACCNUM".

(1) If indexby="PROBE":
If a probeset ID matches multiple genes, it is annotated with character
string "multiple" in all annotations (e.g. agACCNUM, agGO, etc). But
there is a new environment "MULTIHIT" (e.g. agMULTIHIT), whose key are
probeset IDs, and values are AGI locus ID. All probeset IDs are
included. If the probeset matches one or none gene, its value in
"MULTIHIT" is NA, otherwise is a vector of all matching AGI locus IDs.

(2) If indexby="ACCNUM":
All annotations are indexed by AGI locus IDs rather than probeset IDs.
For example, environment "agGO" uses AGI locus ID as key, and GO
annotation as value. All the AGI locus IDs ever occur in the
probe-to-gene mapping file are included. Then environment "ACCNUM" (e.g.
agACCNUM) provide probe-to-gene mapping: key is probeset ID, and value
is AGI locus ID.

4. other issues:

(1) GO annotation: Thomas suggest to get GO annotation from GO.org
instead of TAIR. I contacted TAIR, and here is the reply:

The 2 files should be the same at same point in time. The GO database is
more up
to date because it gets updates from our curation database every night,
whereas
the TAIR database is updated every 2 weeks, at least for the time being.
However, as an exception, last night there was a problem with the update
in the
GO database and the data there is incomplete. After tonight's update the
data in
GO should be fine.

It seems the two files are almost the same. Therefore, I prefer not to
change them, just because I am lazy :)

(2) CHRLOC: currently obtained from
ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.Locus
. Maybe we should change the source. But again, I will follow your
suggestions.

I didn't update the data package ath1121501 and ag, because the source
data of all annotation packages were suppose to be obtained in April, 2006.

Any feedback/bug report for the changes are highly appreciated. Thanks!

nianhua