[BioC] annotation package for chicken affyprobes

Nianhua Li nli at fhcrc.org
Mon Aug 28 19:36:46 CEST 2006


Dear Lina,

The annotation process of ABPkgBuilder is to first use mybasefile and
all other mapping sources you provide to generate a mapping between
probeset IDs to Entrez Gene IDs, and then use Entrez Gene IDs to
retrieve annotations from public databases. Therefore, probeset ID to
Entrez Gene ID mapping is the base of all other annotations. I noticed
this line in your QC data: chickenLOCUSID found 9722 of 38535. It means
only 9722 probeset IDs have been mapped to Entrez Gene. That is the
reason for low annotation coverage, I think. GenBank to Entrez Gene
mapping is based on
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz, and UniGene to
Entrez Gene mapping is based on
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2unigene (less sure). Could you
please trace a few probeset IDs that didn't map to Entrez Gene and see
what is going on?

chickenCHRLOC should contain information of chromosome locations from
UCSC Genome database. The mapping is based on two files in
http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Gallus_gallus/database/
: refGene.txt.gz and refLink.txt.gz. The first file provides chromosome
locations for RefSeq IDs. The second file provides EntrezGene to RefSeq
mapping. It is surprising that only 156 out of 9722 probeset IDs found
UCSC annotations. If you could provide some Entrez Gene IDs, I can trace
the problem in the code.

chickenCHR and chickenMap should contain information of chromosome
locations from Entrez Gene. The information comes from
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz.

Hope it helps.

nianhua


Lina Hultin-Rosenberg wrote:
> Hi again!
>
> I managed to build the annotation package for chicken - thanks for all your
> help! 
>
> I was a bit surprised though by the low annotation coverage, see QC data
> below. I don't really know how the data is collected but I would think more
> information on chromosome location is known for the probesets. When reading
> about the new chicken genome assembly (http://genome.ucsc.edu) it says that
> around 95% of the sequence has been anchored to chromosomes. I thought the
> annotation process in R used this information? 
>
> What can be the reason for the very few anchored probesets? I might be doing
> something wrong or perhaps it is a problem of mapping probe id's to other
> identifiers? I used the genbank mappings as mybasefile and unigene and
> entrez mappings as other sources. Is there a way within R to increase
> annotation coverage? I am especially interested in chromosome location 
> (number), but maybe this is a problem that is best solved outside R?
>
> Would greatly appreciate some help!
>
> Thank you, 
> Lina
>
>
> =======================================================================
> QC data:
> Number of probes: 38535
> Probe number missmatch: None
> Probe missmatch: None
> Mappings found for probe based rda files:
>          chickenACCNUM found 25654 of 38535
>          chickenCHR found 9707 of 38535
>          chickenCHRLOC found 156 of 38535
>          chickenENZYME found 52 of 38535
>          chickenGENENAME found 0 of 38535
>          chickenGO found 4224 of 38535
>          chickenLOCUSID found 9722 of 38535
>          chickenMAP found 0 of 38535
>          chickenPATH found 87 of 38535
>          chickenPMID found 283 of 38535
>          chickenREFSEQ found 9709 of 38535
>          chickenSUMFUNC found 0 of 38535
>          chickenSYMBOL found 9722 of 38535
>          chickenUNIGENE found 289 of 38535
> Mappings found for non-probe based rda files:
>          chickenENZYME2PROBE found 33
>          chickenGO2ALLPROBES found 1785
>          chickenGO2PROBE found 930
>          chickenORGANISM found 1
>          chickenPATH2PROBE found 31
>          chickenPFAM found 7418
>          chickenPMID2PROBE found 101
>          chickenPROSITE found 5490
> ========================================================================== 
>
>



More information about the Bioconductor mailing list