[BioC] GEOquery and different types of GPL annotation files

Peter bioconductor-mailinglist at maubp.freeserve.co.uk
Fri Jan 20 15:44:46 CET 2006


I wrote:
> Does anyone know what the difference is between these two GEO GPL files?

It looks like the different files contain rather different annotation 
information (with very little overlap).  i.e. One is not just a subset 
of the other.

I suspect different users will have different preferences...

-----------------------------------------------------------------------

Looking at the E.coli chip,

> GPL199.annot (540kb)
> GPL199.soft (2166kb)

The larger (.soft) file includes a list of all GSM and GSE references 
using the platform, and the following columns:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL199&form=text&view=full

#ID = Affymetrix Probe Set ID ...
#ORF =
#Species Scientific Name = The genus and species of the ...
#Annotation Date = The date that the annotations for this ...
#SPOT_ID = Sequence Type: Indicates whether the sequence is ...
#Sequence Source = The database from which the sequence used ...
#Transcript ID(Array Design) =
#Representative Public ID = The accession number of a ...
#Alignments = Position of the alignment of the target sequence ...
#Gene Title = Title of Gene represented by the probe set.
#Gene Symbol = A gene symbol, when one is available (from UniGene).

The smaller (.annot) file includes the following columns:

ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL199.annot.gz

#ID =  Platform reference identifier
#Gene = Description field extracted from Entrez Gene
#Unigene = Cluster ID extracted from Entrez UniGene
#UniGene title = UniGene title extracted from Entrez UniGene
#Nucleotide = Title extracted from Entrez Nucleotide
#Protein = Title extracted from Entrez Protein
#GI = GenBank identifier(s)
#GenBank Accession = GenBank accession(s)
#Gene symbol = Gene name field extracted from Entrez Gene
#Platform_CLONEID = CLONE_ID column from GEO Platform data table
#Platform_ORF = ORF column from GEO Platform data table
#Platform_SPOTID = SPOT_ID column from GEO Platform data table
#Platform_SPACC = SP_ACC column from GEO Platform data table
#Platform_PTACC = PT_ACC column from GEO Platform data table

-----------------------------------------------------------------------

In the case of the HG-U133A human chip, the file size difference is much 
more significant (in terms of load times):

GPL96.annot (3115kb)
GPL96.soft (11979kb)

The larger (.soft) file includes a list of all GSM and GSE references 
using the platform, and the following columns:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL96&form=text&view=full

#ID = Affymetrix Probe Set ID ...
#Species Scientific Name = The genus and species of the ...
#Annotation Date = The date that the annotations for ...
#GB_LIST = GenBank Accession Number ...
#SPOT_ID = Sequence Type: Indicates whether the sequence is ...
#Sequence Source = The database from which the sequence used ...
#Representative Public ID = The accession number of a ...
#Gene Title = Title of Gene represented by the probe set.
#Gene Symbol = A gene symbol, when one is available (from UniGene).
#Entrez Gene = Entrez Gene database UID ...
#RefSeq Transcript ID = References to multiple sequences in RefSeq. ...
#Gene Ontology Biological Process = ...
#Gene Ontology Cellular Component = ...
#Gene Ontology Molecular Function = ...

The smaller (.annot) file has the following different columns:

ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL96.annot.gz

#ID =  Platform reference identifier
#Gene = Description field extracted from Entrez Gene
#Unigene = Cluster ID extracted from Entrez UniGene
#UniGene title = UniGene title extracted from Entrez UniGene
#Nucleotide = Title extracted from Entrez Nucleotide
#Protein = Title extracted from Entrez Protein
#GI = GenBank identifier(s)
#GenBank Accession = GenBank accession(s)
#Gene symbol = Gene name field extracted from Entrez Gene
#Platform_CLONEID = CLONE_ID column from GEO Platform data table
#Platform_ORF = ORF column from GEO Platform data table
#Platform_SPOTID = SPOT_ID column from GEO Platform data table
#Platform_SPACC = SP_ACC column from GEO Platform data table
#Platform_PTACC = PT_ACC column from GEO Platform data table



More information about the Bioconductor mailing list