[BioC] Mouse Gene ST v1 CDF Issues (MoGene10stv1): Failure of affyPLM and pdfInfoBuilder

Mon Dec 1 23:26:41 CET 2008

>I am having some issues with the Affymetrix Mouse Gene ST 1.0 array
(MoGene10stv1) and bioconductor. I can see that there are issues regarding this
array and the unsupported CDF that can be downloaded from Affy but I was able
to create the mogene10stv1cdf library as outlined in the thread:

https://stat.ethz.ch/pipermail/bioc-devel/2007-October/001403.html

I have processed the data using both Bioconductors Affy Package and the
aroma.Affymetrix package but get different results. I believe the issue is that
aroma is using the affyPLM model. I wanted to check this using the bioconductor
affyPLM package but it will not work:

Method 1 - works fine:

library(affy)
AffyRaw <- ReadAffy()
AffyEset <- rma(AffyRaw)
data.affy <- exprs(AffyEset)

Method 2 - fails:

library(affyPLM)
AffyRaw <- ReadAffy()
fit <- fitPLM(AffyRaw, verbos=9)

 Background correcting PM
 Normalizing PM
 Fitting models
 Error in fitPLM(AffyRaw, verbos = 9) :
   Realloc could not re-allocate (size 1150530304) memory

I also tried the following but it still could not run:

fit <- fitPLM(AffyRaw, output.param=list(weights=FALSE, residuals=FALSE,
varcov="none", resid.SE=FALSE))

Finally, I dropped the number of arrays from 16 to 6, then down to 2, but still
no luck.

So from piecing together different threads I wondered if the issue lied with
the unsupported CDF. So I attempted to use the pdfInfoBuilder / oligo pipeline
as outlined in this thread:

http://article.gmane.org/gmane.science.biology.informatics.conductor/18963/matc
h=mogene

Again, I ran into problems:

> pgfFile <- "MoGene-1_0-st-v1.r3.pgf"
> clfFile <- "MoGene-1_0-st-v1.r3.clf"
> transFile <- "MoGene-1_0-st-v1.na26.mm9.transcript.txt"
> probeFile <- "MoGene-1_0-st-v1.probe.tab"
> pkg <- new("AffyGenePDInfoPkgSeed", author="Peter White", email="peter.white
at nationwidechildrens.org", version="0.1.3", genomebuild="UCSC mm9,  July
2007", biocViews="AnnotationData", pgfFile=pgfFile, clfFile=clfFile, transFile=
transFile, probeFile=probeFile)
> makePdInfoPackage(pkg, destDir=".")
Creating package in ./pd.mogene.1.0.st.v1
loadUnitsByBatch took 54.44 sec
loadAffyCsv took 53.58 sec
loadAffySeqCsv took 80.68 sec
DB sort, index creation took 90.24 sec
[1] TRUE
Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

Close R and start the command prompt and navigate to the directory with the
package:

R CMD INSTALL pd.mogene.1.0.st.v1\

installing to 'c:/PROGRA~2/R/R-28~1.0/library'

---------- Making package pd.mogene.1.0.st.v1 ------------
  adding build stamp to DESCRIPTION
  installing NAMESPACE file and metadata
  installing R files
  installing inst files
FIND: Parameter format not correct
make[2]: *** [c:/PROGRA~2/R/R-28~1.0/library/pd.mogene.1.0.st.v1/inst] Error 2
make[1]: *** [all] Error 2
make: *** [pkg-pd.mogene.1.0.st.v1] Error 2
*** Installation of pd.mogene.1.0.st.v1 failed ***

Removing 'c:/PROGRA~2/R/R-28~1.0/library/pd.mogene.1.0.st.v1'

So the installation fails and I cannot work out why (I have RTools and Cygwin
installed). I did notice some inconsistencies in the annotation files for these
arrays that can be downloaded from the Affy site and wondered if these could be
the source of the problem:

1.	From the file MoGene-1_0-st-v1.probe.tab there are 35,605 distinct
Transcript IDs.
2.	From the file MoGene-1_0-st-v1.na26.mm9.transcript.csv there are 35,567
transcript IDs . 38 transcripts ids are missing from this file. What are they
and why were they not included (10412488, 10412495, 10412500, 10412503,
10412520, 10417226, 10417239, 10417269, 10417286, 10441511, 10468907, 10490232,
10501544, 10535342, 10536010, 10536044, 10536095, 10536114, 10536118, 10536163,
10550163, 10550775, 10560746, 10577361, 10598118, 10598141, 10598159, 10598207,
10598220, 10598603, 10599086, 10606573, 10608226, 10608440, 10608551, 10608554,
10608603, 10608606)
3.	From the file MoGene-1_0-st-v1.r3.cdf there are 35,512 Transcript IDs.
So we are now missing an additional 93 probe sets (all of these can be found in
the transcript file: 10338002, 10338005, 10338006, 10338007, 10338008,
10338009, 10338010, 10338011, 10338012, 10338013, 10338014, 10338015, 10338016,
10338018, 10338019, 10338020, 10338021, 10338022, 10338023, 10338024, 10338027,
10338028, 10338030, 10338031, 10338032, 10338033, 10338034, 10338038, 10338039,
10338040, 10338043, 10338045, 10338046, 10338048, 10338049, 10338050, 10338051,
10338052, 10338053, 10338054, 10338055, 10338057, 10338058, 10338061, 10338062,
10349381, 10350469, 10354866, 10361826, 10362430, 10362438, 10362444, 10362452,
10362872, 10369759, 10374030, 10391748, 10395778, 10411504, 10422960, 10436496,
10436660, 10446349, 10453719, 10457089, 10458079, 10460144, 10461932, 10481652,
10482786, 10487009, 10498317, 10501216, 10502040, 10502768, 10503414, 10513713,
10521665, 10532622, 10535929, 10546555, 10552810, 10553535, 10560364, 10582560,
10582566, 10582570, 10582576, 10585872, 10586931, 10592453, 10601614,
10602194). Again, why were they not included?

BTW: I am using R 2.8.0 and the latest release of Bioconductor (2.3) on a
Windows XP 64-bit machine.

Any help out there would be greatly appreciated.

Thanks,

Peter

Peter White, Ph.D.
Director, Biomedical Genomics Core
Research Assistant Professor of Pediatrics
The Research Institute at
Nationwide Children's Hospital and
The Ohio State University

Mailing Address:

The Research Institute at
Nationwide Children's Hospital
700 Children's Drive, W510
Columbus, OH 43205

Office: (614) 355-2671
Lab: (614) 355-5252
Fax: (614) 722-2818
Web: http://genomics.nchresearch.org/