[BioC] athPkgBuilder data source :missing probesets

Thu Aug 10 16:08:40 CEST 2006

Dear Nianhua,

as Tine and you pointed out, there are some probesets that don't match a 
gene.

We used to map the probesets by ourselves based on the oligos as well 
and came to very similar conclusions as TAIR.
Also the very old mappings taht matched every single probeset were based 
on the target sequences so the sequences where the oligos were designed 
against but not the actual oligos.

Thus including the "missing" ones you would rather get spurious/wrong 
assignments in most cases.
Most (all) of the missing ones just don't hit a gene model from the 
latest TAIR release and should therefore really not be annotated with 
any gene.
I personally would prefer to not map the ones that -given the current 
knowledge- just sample different or no genes at all.

The only thing, you might want to change is the thresholds of TAIR 
(which I again think are quite reasonable), but I think that they are 
quite reasonable, and at least if you/me/everyone relies on their 
mapping, we at least talk about the same thing.

Cheers,
Björn

Nianhua Li wrote:
> Dear list,
> 
> I had some doubts on the data sources used by athPkgBuilder that I post on
> bioc-devel list two months ago, but got no reply. I would like to try one more
> time here. Sorry for the double posting.
> 
> ----------------------------------------------------------------
> 
> I did a close look at the athPkgBuilder function in AnnBuilder (builder of
> ath1121501 and ag) and have some questions about the data source being used: 
> 
> 1. probeset id to gene mapping:
> The current mapping strategy was
> 1) map probe id to "Representative.Public.ID" by using Affymetrix GeneChip
> annotation data
> 2) use "Representative.Public.ID" as if it was AGI locus id to get other
> annotations (pathway, go, etc) from TAIR
> 
> It seems that the "Representative.Publid.ID is a mix of AGI locus id, UniGene
> Cluster and a small part of other sources. In the affymetrix annotation file,
> there is another column called "Transcript ID (Array Design)", which has almost
> the same value as "Prepresentative.Public.ID". I feel it was originated from
> ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether affymetrix
> update those two columns on a regular basis or not. 
> 
> But if all the annotations (chromosome, go, pathway) come from TAIR, maybe we
> should use TAIR's mapping of  probeset id to AGI locus id:
> ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ :
>     "The oligonucleotide sequences of the probes were mapped to the Arabidopsis 
> Transcripts dataset from the Arabidopsis genome TAIR6 version (released November
> 11, 2005). 
> The dataset included mitochondria and chloroplast genes, as well as pseudogenes
> and non-
> coding RNAs. The mapping to the TAIR6 Transcripts was performed using the BLASTN
> program 
> with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy
> chips, the 
> required match length to achieve this e-value is 23 or more identical
> nucleotides. To 
> assign a probe set to a given locus, at least 9 of the probes included in the
> probe set 
> were required to match a transcript at that locus."
> 
> Not all probeset ids have matching AGI locus ids. Do we need to provide mapping
> to other gene identifiers such as GenBank Accession number or Entrez Gene IDs to
> make annoations more complete? Affymetrix starts to provide probeset id to
> Entrez Gene ID mappings in their annotation files. Should we include that
> information? Also, I can see three possible ways to get probe-to-GenBank
> mapping: 1) from affymetrix annotation file directly, 2)probe to AGI locus and
> then AGI locus to GenBank accession, all from TAIR, 3)probe to Entrez Gene from
> affy, and then Entrez Gene to GenBank from NCBI. Which way is the best? or
> should we use the "voting" algorithm used by ABPkgBuilder?
> 
> 2. chromosome location
> The current package get chromosome locations from
> ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.Locus 
> Even though the file seems being updated very often, the directory it locates in
> and the README file were not. So, it is not clear for me how it was
> generated/updated. Any hint on that? Will
> ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better source?
> The meaning of chromosome location in those two sources may be different though.
> The former means the location of a GenBank EST, and the later means "chromosome
> coordinates of the best probe set match to the Transcripts 
> dataset".
> 
> 3. gene description (ath1121501GENENAME)
> The current package (1.12.1) get the description from
> ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The descriptions
> are the same as ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both
> of them means the description of the AGI locus corresponding to a affy probeset.
> In the Affymetrix annotation file, there is a column called "Target
> Description". It is the description of the gene that a probeset is targeting to.
> All probesets have descriptions, therefore we get a better coverage than getting
> description from TAIR. When the "Representative Public ID" (or "Transcript ID")
> is a AGI locus id, it seems the description was retrieved from TAIR. However, it
> is not clear how this information is updated, and whether it is synchronized
> with TAIR's update or not. Another possible source of description is Entrez
> Gene, since Affymetrix maps probeset to Entrez Gene.
> 
> 4. pathway
> Pathway information is currently obtained from AraCyc, a pathway tool in TAIR:
> http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel it only
> contains metabolic pathways (it can be wrong as I only read the introduction).
> KEGG contains regulatory pathways as well, and it is also manually curated.
> Those two sources are independant from each other. Shall we include both of them?
> 
> 5. pubmed
> Probeset to pubmed mapping is currently obtained from
> ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060309.txt .
> The pubmed ids represents the publications that TAIR used to map a AGI locus id
> to a concept in Plant Ontology. But I think environment like ath1121501PUBMED
> should represent the publications about the matching gene of a probeset. I
> didn't find AGI locus to pubmed mapping in TAIR. So, we have to get it from
> either Entrez Gene id or GenBank accession. This gets back to the frist
> question: what is the best way to map probeset to GenBank/Entrez Gene?
> 
> Hope this email is not too long. Any feedback will be highly appreciated. If we
> decide to use a better data source, I will be happy to do the implementation.
> 
> many thanks
> 
> Nianhua Li
> computational biology, public health, FHCRC
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
-+-+-+-+-+-+-+-+-+-+-+-
Björn Usadel, PhD

Max Planck Institute of Molecular Plant Physiology
System Regulation Group

Am Mühlenberg 1
D-14476 Golm
Germany

Tel    (+49 331) 567-8114

Email  usadel at mpimp-golm.mpg.de
WWW    mapman.mpimp-golm.mpg.de