[BioC] Affy Human Gene ST1.0 cdf

James W. MacDonald jmacdon at med.umich.edu
Thu Jun 14 22:24:28 CEST 2007


Hi Raffaele,

rcaloger wrote:
> "Anthony Bosco" wrote:
> 
>>Hi,
>>Is anyone currently working on annotation and cdf packages for the
>>Affymetrix Human Gene ST1.0 microarrays. 
> 
> 
> I have generated a cdf file for Human Gene ST 1.0 (hugene10stv1cdf) using the :
> window version: http://www.bioinformatica.unito.it/downloads/hugene10stv1cdf_1.0.0.zip
> Unix version:  http://www.bioinformatica.unito.it/downloads/hugene10stv1cdf_1.0.0.tar.gz

I would use some caution with these cdfs. The problem with this chip 
(and possibly the Exon chip as well) is that Affy has changed the 
probe/probeset relationship. In the past, as far as I know each probeset 
was comprised of a unique set of probes that were not shared with any 
other probeset. Unfortunately, this is no longer the case. From Affy's 
documentation of the pgf format, found here:

https://www.affymetrix.com/support/developer/fusion/File_Format_PGF_aptv161.pdf

Probe Level (level 2)
   probe_id (required): an integer id >= 0 which is a foreign key
into the CLF file; a specific probe may be present in more
than one probeset and as such is not guarateed to be
unique in the PGF file. Also note that the additional columns
of information at the probe level may be context dependent.
So for example a particular probe could potentially be a PM
probe in one probeset and an MM probe in another. While
unlikely, this is a possibility.


Both makecdfenv and makePlatformDesign make the assumption that there is 
a one-to-one mapping of probe ==> probeset, so any probes that map to 
more than one probeset are eliminated when the cdfenv/PDenv is produced.

For instance:

 > get("8023935", hugene10stv1cdf)
            pm mm
  [1,]  601698 NA
  [2,]  357305 NA
  [3,]  922075 NA
  [4,]  912769 NA
  [5,]  851484 NA
  [6,]  880231 NA
  [7,]  459294 NA
  [8,]  454968 NA
  [9,]  484219 NA
[10,]  707347 NA
[11,]   31081 NA
[12,]  996475 NA
[13,]  201717 NA
[14,]  393001 NA
[15,]  269817 NA
[16,]  616714 NA
[17,]   30396 NA
[18,]  982711 NA
[19,]   24048 NA
[20,]  436382 NA
[21,]  866428 NA
[22,]  215841 NA
[23,]  488004 NA
[24,] 1095598 NA
[25,]  896814 NA
[26,]  817403 NA
[27,]  774426 NA
[28,]  721396 NA
[29,]  768925 NA
[30,]  276653 NA
[31,]  909900 NA
[32,]  573479 NA
 > get("7991665", hugene10stv1cdf)
          pm mm
[1,] 744933 NA
 > get("7896738", hugene10stv1cdf)
Error in get(x, envir, mode, inherits) : variable "7896738" was not found

You can either grep these probesets from the .pgf file for this chip, or 
simply go to netaffx and do a query on these three probesets to see that 
they all interrogate the same transcript, and that 7896738 has 31 probes 
(all of which are in 8023935, so get removed), and 7991665 has 33 
probes, only one of which is unique (the other 32 are in 8023935).

About 5% of the probes on this chip get removed because they are 
duplicates. Not exactly a huge problem, but something to be aware of.

Best,

Jim


> 
> It was generated with the following code:
> library(affy)
> library(makecdfenv)
> tmp <- read.celfile("TisMap_Brain_01_v1_WTGene1.CEL")
> pname <- cleancdfname(tmp$HEADER$cdfName)
> 
> make.cdf.package("HuGene-1_0-st-v1.r3.cdf",
>        packagename  = pname,
>        version      = "1.0.0",
>        author       = "Raffaele A. Calogero",
>        maintainer   = "Raffaele A. Calogero <raffaele.calogero at unito.it>",
>        species      = "Homo_sapiens",
>        compress     = TRUE,
>        verbose      = TRUE)
>  #R CMD -build  hugene10stv1cdf 
> 
> Cheers
> Raffaele
> 
> 
>  
> 
> ----------------------------------------
> Prof. Raffaele A. Calogero
> Bioinformatics and Genomics Unit
> Dipartimento di Scienze Cliniche e Biologiche
> c/o Az. Ospedaliera S. Luigi
> Regione Gonzole 10, Orbassano
> 10043 Torino
> tel.   ++39 0116705417
> Lab.   ++39 0116705408
> Fax    ++39 0119038639
> Mobile ++39 3333827080
> email: raffaele.calogero at unito.it
>        raffaele[dot]calogero[at]gmail[dot]com
> www:   www.bioinformatica.unito.it
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.



More information about the Bioconductor mailing list