[BioC] Mouse Gene ST v1 CDF Issues (MoGene10stv1): Failure of affyPLM and pdfInfoBuilder

Fri Dec 5 16:51:30 CET 2008

Here is the response I received from Affymetrix (thanks) regarding the
differences in the CDF, probe.tab, and annotation.csv files for the
Mouse Gene 1.0 ST V1 array:

Hello Dr White-

1) We no longer use a cdf file for our software.  The unsupported one
was made for use in third party software so there are no current plans
to create a different version of the cdf file.

2)The cause for the missing 38 TCs in question #2 is that they were not
mappable to the mm9 version of the mouse genome assembly (NCBI build
37). The probe tab file contains all probes that exist on the array as
it was designed, and it was designed on the basis of the mm8 version of
the mouse genome (NCBI build 36).

So the probe.tab file is a design-time view of the probes, while the
NetAffx annotation CSV is an annotation-time view of the transcript
clusters. To provide the most accurate, biologically realistic
annotations, we used the most recent version of the genome. When a given
genomic region gets re-organized in the updated assembly, this can
prevent or substantially change the way the probes of a transcript
cluster map to the new genome version. These transcript clusters are
removed from the NetAffx annotation analysis, where they could cause
faulty RNA assignments. So these 38 TCs can be ignored. 

We have considered making available a mm9-version of the probe tab file
in the future, which would avoid this sort of confusion.

3) There are 93 transcript_cluster_id's on the MoGene 1.0 ST chip that
are listed in the csv annotation file, and searchable in the MoGene chip
at NetAffx, but that are not present in the [unsupported] CDF file from
netaffx.  45 of these ID's are present in the MoGene PGF file, and
correspond to the antigenomic probesets, but the remaining 48 are not in
the PGF file either. The remaining 48 transcript cluster IDs the
customer identified as not in the PGF file are from what we call
low-coverage transcript clusters: those having less than 4 probes. These
tend to be very short, non-biologically interesting sequences and were
excluded from the PGF with the intent that they should not be analyzed
by users. So the advice is that the user can safely ignore them.   In
the NA27 release of the annotations (due out end of next week) those
low-coverage transcript clusters should now be removed from the NetAffx
annotation CSV file for all of the Gene arrays.

> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at med.umich.edu]
> Sent: Tuesday, December 02, 2008 9:32 AM
> To: White, Peter
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Mouse Gene ST v1 CDF Issues (MoGene10stv1):
Failure
> of affyPLM and pdfInfoBuilder
> 
> Hi Peter,
> 
> I won't comment on aroma.affymetrix, nor building cdf packages using
> makecdfenv as the former has its own mailing list and the latter isn't
> really supported - the list archive you quote is Ben Bolstad showing
> that you _could_ use makecdfenv, but then raising several questions
> that
> have not been resolved to my knowledge.
> 
> As for building a pdInfoPackage, this works fine for me:
> 
>  > makePdInfoPackage(pkg, destDir=".")
> Creating package in ./pd.mogene.1.0.st.v1
> loadUnitsByBatch took 46.92 sec
> loadAffyCsv took 19.19 sec
> loadAffySeqCsv took 51.92 sec
> DB sort, index creation took 20.82 sec
> [1] TRUE
> Warning messages:
> 1: In is.na(x) : is.na() applied to non-(list or vector) of type
'NULL'
> 2: In is.na(x) : is.na() applied to non-(list or vector) of type
'NULL'
>  > sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices datasets  utils
> [8] methods   base
> 
> other attached packages:
> [1] pdInfoBuilder_1.6.0  oligo_1.6.0          oligoClasses_1.4.0
> [4] AnnotationDbi_1.4.0  preprocessCore_1.4.0 affxparser_1.14.0
> [7] RSQLite_0.7-1        DBI_0.2-4            Biobase_2.2.0
> 
> Note that it would have been helpful for you to give us your
> sessionInfo() as well.
> 
> The install went fine:
> 
> ---------- Making package pd.mogene.1.0.st.v1 ------------
>    adding build stamp to DESCRIPTION
>    installing NAMESPACE file and metadata
>    installing R files
>    installing inst files
>    preparing package pd.mogene.1.0.st.v1 for lazy loading
> Loading required package: RSQLite
> Loading required package: DBI
> Loading required package: oligoClasses
> Loading required package: Biobase
> Loading required package: tools
> 
> Welcome to Bioconductor
> 
>    Vignettes contain introductory material. To view, type
>    'openVignette()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation(pkgname)'.
> 
>    no man files in this package
>    installing indices
>    installing help
>    adding MD5 sums
> 
> * DONE (pd.mogene.1.0.st.v1)
> 
> I would bet that your problem stems from having Cygwin installed as
> well
> as the Windows Toolset (Rtools). If you don't have your path set
> correctly, then you may find the wrong version of certain tools and
> things won't build correctly.
> 
> I have personally found that Cygwin is problematic when installed, and
> can make matters worse if you then uninstall because for whatever
> reason
> you then cannot find certain tools. Does the install directory of the
> Windows Toolset reside higher up in the PATH than Cygwin?
> 
> Best,
> 
> Jim
> 
> 
> 
> Peter White wrote:
> >> I am having some issues with the Affymetrix Mouse Gene ST 1.0 array
> > (MoGene10stv1) and bioconductor. I can see that there are issues
> regarding this
> > array and the unsupported CDF that can be downloaded from Affy but I
> was able
> > to create the mogene10stv1cdf library as outlined in the thread:
> >
> > https://stat.ethz.ch/pipermail/bioc-devel/2007-October/001403.html
> >
> > I have processed the data using both Bioconductors Affy Package and
> the
> > aroma.Affymetrix package but get different results. I believe the
> issue is that
> > aroma is using the affyPLM model. I wanted to check this using the
> bioconductor
> > affyPLM package but it will not work:
> >
> > Method 1 - works fine:
> >
> > library(affy)
> > AffyRaw <- ReadAffy()
> > AffyEset <- rma(AffyRaw)
> > data.affy <- exprs(AffyEset)
> >
> > Method 2 - fails:
> >
> > library(affyPLM)
> > AffyRaw <- ReadAffy()
> > fit <- fitPLM(AffyRaw, verbos=9)
> >
> >  Background correcting PM
> >  Normalizing PM
> >  Fitting models
> >  Error in fitPLM(AffyRaw, verbos = 9) :
> >    Realloc could not re-allocate (size 1150530304) memory
> >
> > I also tried the following but it still could not run:
> >
> > fit <- fitPLM(AffyRaw, output.param=list(weights=FALSE,
> residuals=FALSE,
> > varcov="none", resid.SE=FALSE))
> >
> > Finally, I dropped the number of arrays from 16 to 6, then down to
2,
> but still
> > no luck.
> >
> > So from piecing together different threads I wondered if the issue
> lied with
> > the unsupported CDF. So I attempted to use the pdfInfoBuilder /
oligo
> pipeline
> > as outlined in this thread:
> >
> >
>
http://article.gmane.org/gmane.science.biology.informatics.conductor/18
> 963/matc
> > h=mogene
> >
> > Again, I ran into problems:
> >
> >> pgfFile <- "MoGene-1_0-st-v1.r3.pgf"
> >> clfFile <- "MoGene-1_0-st-v1.r3.clf"
> >> transFile <- "MoGene-1_0-st-v1.na26.mm9.transcript.txt"
> >> probeFile <- "MoGene-1_0-st-v1.probe.tab"
> >> pkg <- new("AffyGenePDInfoPkgSeed", author="Peter White",
> email="peter.white
> > at nationwidechildrens.org", version="0.1.3", genomebuild="UCSC mm9,
> July
> > 2007", biocViews="AnnotationData", pgfFile=pgfFile, clfFile=clfFile,
> transFile=
> > transFile, probeFile=probeFile)
> >> makePdInfoPackage(pkg, destDir=".")
> > Creating package in ./pd.mogene.1.0.st.v1
> > loadUnitsByBatch took 54.44 sec
> > loadAffyCsv took 53.58 sec
> > loadAffySeqCsv took 80.68 sec
> > DB sort, index creation took 90.24 sec
> > [1] TRUE
> > Warning messages:
> > 1: In is.na(x) : is.na() applied to non-(list or vector) of type
> 'NULL'
> > 2: In is.na(x) : is.na() applied to non-(list or vector) of type
> 'NULL'
> >
> > Close R and start the command prompt and navigate to the directory
> with the
> > package:
> >
> > R CMD INSTALL pd.mogene.1.0.st.v1\
> >
> > installing to 'c:/PROGRA~2/R/R-28~1.0/library'
> >
> > ---------- Making package pd.mogene.1.0.st.v1 ------------
> >   adding build stamp to DESCRIPTION
> >   installing NAMESPACE file and metadata
> >   installing R files
> >   installing inst files
> > FIND: Parameter format not correct
> > make[2]: *** [c:/PROGRA~2/R/R-
> 28~1.0/library/pd.mogene.1.0.st.v1/inst] Error 2
> > make[1]: *** [all] Error 2
> > make: *** [pkg-pd.mogene.1.0.st.v1] Error 2
> > *** Installation of pd.mogene.1.0.st.v1 failed ***
> >
> > Removing 'c:/PROGRA~2/R/R-28~1.0/library/pd.mogene.1.0.st.v1'
> >
> > So the installation fails and I cannot work out why (I have RTools
> and Cygwin
> > installed). I did notice some inconsistencies in the annotation
files
> for these
> > arrays that can be downloaded from the Affy site and wondered if
> these could be
> > the source of the problem:
> >
> > 1.	From the file MoGene-1_0-st-v1.probe.tab there are 35,605
> distinct
> > Transcript IDs.
> > 2.	From the file MoGene-1_0-st-v1.na26.mm9.transcript.csv there are
> 35,567
> > transcript IDs . 38 transcripts ids are missing from this file. What
> are they
> > and why were they not included (10412488, 10412495, 10412500,
> 10412503,
> > 10412520, 10417226, 10417239, 10417269, 10417286, 10441511,
10468907,
> 10490232,
> > 10501544, 10535342, 10536010, 10536044, 10536095, 10536114,
10536118,
> 10536163,
> > 10550163, 10550775, 10560746, 10577361, 10598118, 10598141,
10598159,
> 10598207,
> > 10598220, 10598603, 10599086, 10606573, 10608226, 10608440,
10608551,
> 10608554,
> > 10608603, 10608606)
> > 3.	From the file MoGene-1_0-st-v1.r3.cdf there are 35,512
Transcript
> IDs.
> > So we are now missing an additional 93 probe sets (all of these can
> be found in
> > the transcript file: 10338002, 10338005, 10338006, 10338007,
> 10338008,
> > 10338009, 10338010, 10338011, 10338012, 10338013, 10338014,
10338015,
> 10338016,
> > 10338018, 10338019, 10338020, 10338021, 10338022, 10338023,
10338024,
> 10338027,
> > 10338028, 10338030, 10338031, 10338032, 10338033, 10338034,
10338038,
> 10338039,
> > 10338040, 10338043, 10338045, 10338046, 10338048, 10338049,
10338050,
> 10338051,
> > 10338052, 10338053, 10338054, 10338055, 10338057, 10338058,
10338061,
> 10338062,
> > 10349381, 10350469, 10354866, 10361826, 10362430, 10362438,
10362444,
> 10362452,
> > 10362872, 10369759, 10374030, 10391748, 10395778, 10411504,
10422960,
> 10436496,
> > 10436660, 10446349, 10453719, 10457089, 10458079, 10460144,
10461932,
> 10481652,
> > 10482786, 10487009, 10498317, 10501216, 10502040, 10502768,
10503414,
> 10513713,
> > 10521665, 10532622, 10535929, 10546555, 10552810, 10553535,
10560364,
> 10582560,
> > 10582566, 10582570, 10582576, 10585872, 10586931, 10592453,
10601614,
> > 10602194). Again, why were they not included?
> >
> > BTW: I am using R 2.8.0 and the latest release of Bioconductor (2.3)
> on a
> > Windows XP 64-bit machine.
> >
> > Any help out there would be greatly appreciated.
> >
> > Thanks,
> >
> > Peter
> >
> > Peter White, Ph.D.
> > Director, Biomedical Genomics Core
> > Research Assistant Professor of Pediatrics
> > The Research Institute at
> > Nationwide Children's Hospital and
> > The Ohio State University
> >
> > Mailing Address:
> >
> > The Research Institute at
> > Nationwide Children's Hospital
> > 700 Children's Drive, W510
> > Columbus, OH 43205
> >
> > Office: (614) 355-2671
> > Lab: (614) 355-5252
> > Fax: (614) 722-2818
> > Web: http://genomics.nchresearch.org/
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> --
> James W. MacDonald, M.S.
> Biostatistician
> Hildebrandt Lab
> 8220D MSRB III
> 1150 W. Medical Center Drive
> Ann Arbor MI 48109-0646
> 734-936-8662
----------------------------------------- Confidentiality Notice:
The following mail message, including any attachments, is for the
sole use of the intended recipient(s) and may contain confidential
and privileged information. The recipient is responsible to
maintain the confidentiality of this information and to use the
information only for authorized purposes. If you are not the
intended recipient (or authorized to receive information for the
intended recipient), you are hereby notified that any review, use,
disclosure, distribution, copying, printing, or action taken in
reliance on the contents of this e-mail is strictly prohibited. If
you have received this communication in error, please notify us
immediately by reply e-mail and destroy all copies of the original
message. Thank you.