[BioC] athPkgBuilder data source

Wed Aug 9 20:46:09 CEST 2006

Dear list,

I had some doubts on the data sources used by athPkgBuilder that I post on
bioc-devel list two months ago, but got no reply. I would like to try one more
time here. Sorry for the double posting.

----------------------------------------------------------------

I did a close look at the athPkgBuilder function in AnnBuilder (builder of
ath1121501 and ag) and have some questions about the data source being used: 

1. probeset id to gene mapping:
The current mapping strategy was
1) map probe id to "Representative.Public.ID" by using Affymetrix GeneChip
annotation data
2) use "Representative.Public.ID" as if it was AGI locus id to get other
annotations (pathway, go, etc) from TAIR

It seems that the "Representative.Publid.ID is a mix of AGI locus id, UniGene
Cluster and a small part of other sources. In the affymetrix annotation file,
there is another column called "Transcript ID (Array Design)", which has almost
the same value as "Prepresentative.Public.ID". I feel it was originated from
ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether affymetrix
update those two columns on a regular basis or not. 

But if all the annotations (chromosome, go, pathway) come from TAIR, maybe we
should use TAIR's mapping of  probeset id to AGI locus id:
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ :
    "The oligonucleotide sequences of the probes were mapped to the Arabidopsis 
Transcripts dataset from the Arabidopsis genome TAIR6 version (released November
11, 2005). 
The dataset included mitochondria and chloroplast genes, as well as pseudogenes
and non-
coding RNAs. The mapping to the TAIR6 Transcripts was performed using the BLASTN
program 
with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy
chips, the 
required match length to achieve this e-value is 23 or more identical
nucleotides. To 
assign a probe set to a given locus, at least 9 of the probes included in the
probe set 
were required to match a transcript at that locus."

Not all probeset ids have matching AGI locus ids. Do we need to provide mapping
to other gene identifiers such as GenBank Accession number or Entrez Gene IDs to
make annoations more complete? Affymetrix starts to provide probeset id to
Entrez Gene ID mappings in their annotation files. Should we include that
information? Also, I can see three possible ways to get probe-to-GenBank
mapping: 1) from affymetrix annotation file directly, 2)probe to AGI locus and
then AGI locus to GenBank accession, all from TAIR, 3)probe to Entrez Gene from
affy, and then Entrez Gene to GenBank from NCBI. Which way is the best? or
should we use the "voting" algorithm used by ABPkgBuilder?

2. chromosome location
The current package get chromosome locations from
ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.Locus 
Even though the file seems being updated very often, the directory it locates in
and the README file were not. So, it is not clear for me how it was
generated/updated. Any hint on that? Will
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better source?
The meaning of chromosome location in those two sources may be different though.
The former means the location of a GenBank EST, and the later means "chromosome
coordinates of the best probe set match to the Transcripts 
dataset".

3. gene description (ath1121501GENENAME)
The current package (1.12.1) get the description from
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The descriptions
are the same as ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both
of them means the description of the AGI locus corresponding to a affy probeset.
In the Affymetrix annotation file, there is a column called "Target
Description". It is the description of the gene that a probeset is targeting to.
All probesets have descriptions, therefore we get a better coverage than getting
description from TAIR. When the "Representative Public ID" (or "Transcript ID")
is a AGI locus id, it seems the description was retrieved from TAIR. However, it
is not clear how this information is updated, and whether it is synchronized
with TAIR's update or not. Another possible source of description is Entrez
Gene, since Affymetrix maps probeset to Entrez Gene.

4. pathway
Pathway information is currently obtained from AraCyc, a pathway tool in TAIR:
http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel it only
contains metabolic pathways (it can be wrong as I only read the introduction).
KEGG contains regulatory pathways as well, and it is also manually curated.
Those two sources are independant from each other. Shall we include both of them?

5. pubmed
Probeset to pubmed mapping is currently obtained from
ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060309.txt .
The pubmed ids represents the publications that TAIR used to map a AGI locus id
to a concept in Plant Ontology. But I think environment like ath1121501PUBMED
should represent the publications about the matching gene of a probeset. I
didn't find AGI locus to pubmed mapping in TAIR. So, we have to get it from
either Entrez Gene id or GenBank accession. This gets back to the frist
question: what is the best way to map probeset to GenBank/Entrez Gene?

Hope this email is not too long. Any feedback will be highly appreciated. If we
decide to use a better data source, I will be happy to do the implementation.

many thanks

Nianhua Li
computational biology, public health, FHCRC