[BioC] Zeros in ReadAffy was Re: farms package fails on estrogen data

Ben Bolstad bmb at bmbolstad.com
Wed Mar 17 01:51:33 CET 2010


Yes I am aware that allocMatrix does not initialize.

However, it is expected that a data point will be read from the CEL file
for every location in this matrix.

In the case of the GCOS/XDA and Calvin/CommandConsole format CEL file
the order in which things are in the file tells you where to place the
data in the matrix. This should be such that for a non-corrupt file
every entry in the matrix would get assigned an intensity value.

In the case of text format CEL file there are actually x and y
coordinates given to each probe intensity. No specific ordering is
specified in the file documentation, so these x and y coordinates are
used to determine where the relevant intensity value should be placed in
the matrix. not much checking is being done on the validity of these
currently. Note of course that text format CEL files are starting to
become more rare. 

Now it turns out that in estrogen there are 9 CEL files. All 9 are text
format.

> library(affy)
>  setwd(system.file("extdata", package="estrogen"))
> list.celfiles()
[1] "bad.cel"      "high10-1.cel" "high10-2.cel" "high48-1.cel"
"high48-2.cel"
[6] "low10-1.cel"  "low10-2.cel"  "low48-1.cel"  "low48-2.cel" 
> ab = ReadAffy()
> range(intensity(ab))
[1]     0 46131
> apply(intensity(ab),2,range)
     bad.cel high10-1.cel high10-2.cel high48-1.cel high48-2.cel low10-1.cel
[1,]       0         32.5         49.5        420.5          425           0
[2,]   46131      18308.0      45794.0      46111.0        46108       16457
     low10-2.cel low48-1.cel low48-2.cel
[1,]           0       360.5         395
[2,]       20763     46102.0       46115

> which(intensity(ab)==0)
[1]  140224 2143523 2202236 2337976 2841090
> 140224%%640
[1] 64
> 140224%/%640
[1] 219

because things are zero based want to look for entry at 63 219 in
bad.cel

Low and behold:

 61     219     129.0   25.0     25
 62     219     84.0    14.1     25
 6219   133.3   32.4     20
 64     219     4241.0  634.0    25
 65     219     3215.0  565.7    25

so corrupt x/y location. What about in "low10-1.cel"


> which(intensity(ab)[,6] == 0)%%640
 95523 154236 289976 
   163    636     56 
> which(intensity(ab)[,6] == 0)%/%640
 95523 154236 289976 
   149    240    453 

And again

161     149     58.8    7.4      20
162     049     100.0   11.4     25
163     149     156.5   31.9     20

and again

634     240     63.3    7.5      20
63240   57.3    8.8      20
636     240     70.0    10.3     25

and here too

 54     453     132.0   21.5     25
 55     053     94.0    11.5     25
 56     453     145.8   19.5     16

In theory at least some these cases should have been caught by the
parser, since there is meant to be a check that ensures that no of those
coordinates are larger the the chip dimensions (which 6219 and 63240
most certainly are). The parser should most certainly be throwing its
hands up in error at this point.

In any case I would think it bad form to be distributing corrupt CEL
files as part of a package.

Going into debugger land and looking at that first cel

gdb) p buffer
$6 = " 63\b219\t133.3\t32.4\t 20\n\000 ..........

and it turns out that it takes x location as 63, y location as 133 and
the intensity value as 32.4 (so in other words it is putting an
incorrect intensity in the wrong place). Now what the heck a "\b" (aka
^H terminal code) is doing here I can't fathom, but you live with what
you get.

Now I could enforce a check that there are five distinct entries (as
separated by tabs) on each row (which is something I avoided in the past
for speed since typically npixels and stddev are ignored) and that would
catch two of those cases. There is not much that can be done about the
catching the other two unless one were to enforce that x/y locations are
sequential. The file documentation used to implement the parser does not
specify that this has to be the case for text format files.

 





On Tue, 2010-03-16 at 14:55 -0700, Seth Falcon wrote:
> On 3/16/10 2:37 PM, bmb at bmbolstad.com wrote:
> > ReadAffy uses allocMatrix() to allocate the memory into which it intends
> > to place probe intensities read from the file.
> 
> Not sure there is an issue here or not based on your comments, but just 
> so we are all on the same page, allocMatrix allocates but does not 
> initialize memory for a matrix.
> 
> + seth
> 
> 
>   Based on information in the
> > CEL file header it determines grid dimensions to determine how many probe
> > intensities are expected. Exact parsing depends on the CEL file format of
> > which there are a few, but generally speaking it is expected that there is
> > an intensity to be read for every location in the grid. Zeros can arise in
> > truncated and corrupted CEL files. But many of those types of situations
> > should be detected and an error thrown. There may be other mechanisms for
> > generation of such probe intensities which make them valid.
> >
> >
> >
> >> On 03/16/2010 08:25 AM, Javier Pérez Florido wrote:
> >>> Dear list,
> >>> I've tried to use l.farms (farms package)
> >>> http://www.bioinf.jku.at/software/farms/farms.html
> >>> for preprocessing the estrogen data set available at
> >>> http://www.bioconductor.org/packages/release/data/experiment/html/estrogen.html
> >>>
> >>>
> >>> When trying to preprocess using l.farms, the following error comes up:
> >>>
> >>> background correction: none
> >>> normalization: loess
> >>> PM/MM correction : pmonly
> >>> expression values: farms
> >>> background correcting...done.
> >>> normalizing...Error: NA/NaN/Inf in call to external function (arg 1)
> >>>
> >>> q.farms works for the same data set and l.farms works for the Dilution
> >>> data set, so, I don't know what is going on here.
> >>>
> >>> Any suggestions?
> >>
> >> Hi Javier --
> >>
> >> If I
> >>
> >>    library(farms)
> >>    library(estrogen)
> >>    setwd(system.file("extdata", package="estrogen"))
> >>    ab = ReadAffy()
> >>    l.farms(ab)
> >>
> >> I see the error you report. It is because
> >>
> >>> sum(intensity(ab) == 0)
> >> [1] 5
> >>
> >> so that log transformation generates -Inf and then the error. It doesn't
> >> seem too bad to
> >>
> >>    intensity(ab) = 1 + intensity(ab)
> >>
> >> [I don't know enough about the inner workings of affyio, but I'm
> >> surprised that I sometimes see
> >>
> >>> range(intensity(ReadAffy()))
> >> [1]     0 46131
> >>
> >>> range(intensity(ReadAffy))
> >> [1] 1.563578e-316  4.613100e+04
> >>
> >> suggesting either initialized memory or unanticipated 'empty' intensity
> >> grid coordinates)]
> >>
> >> Martin
> >>
> >>
> >>> Javier
> >>>
> >>> sessionInfo()
> >>> R version 2.10.1 (2009-12-14)
> >>> i386-pc-mingw32
> >>>
> >>> locale:
> >>> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252
> >>> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
> >>> [5] LC_TIME=Spanish_Spain.1252
> >>>
> >>> attached base packages:
> >>> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>>
> >>> other attached packages:
> >>> [1] affyPLM_1.22.0       preprocessCore_1.8.0 gcrma_2.18.1
> >>> [4] affydata_1.11.10     farms_1.4.0          MASS_7.3-4
> >>> [7] hgu95av2cdf_2.5.0    affy_1.24.2          Biobase_2.6.1
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] affyio_1.14.0      Biostrings_2.14.12 IRanges_1.4.11
> >>> splines_2.10.1
> >>> [5] tools_2.10.1
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at stat.math.ethz.ch
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> --
> >> Martin Morgan
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Seth Falcon
> Bioconductor Core Team | FHCRC
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list