[BioC] Zeros in ReadAffy was Re: farms package fails on estrogen data

Thu Mar 18 01:13:56 CET 2010

Hi Ben --

Yes the estrogen files seem to have some minor indiscretions; these seem
to be quite 'ancient' (e.g., in the 1.8 experiment data repository) and
perhaps a clean version is not recoverable (this is still being
investigated).

I guess a relatively quick way to 'validate' the files on input is to
set intensity values to NA_REAL after allocMatrix and then confirm
!any(is.na()) on exit.

Martin

On 03/16/2010 05:51 PM, Ben Bolstad wrote:
> Yes I am aware that allocMatrix does not initialize.
> 
> However, it is expected that a data point will be read from the CEL file
> for every location in this matrix.
> 
> In the case of the GCOS/XDA and Calvin/CommandConsole format CEL file
> the order in which things are in the file tells you where to place the
> data in the matrix. This should be such that for a non-corrupt file
> every entry in the matrix would get assigned an intensity value.
> 
> In the case of text format CEL file there are actually x and y
> coordinates given to each probe intensity. No specific ordering is
> specified in the file documentation, so these x and y coordinates are
> used to determine where the relevant intensity value should be placed in
> the matrix. not much checking is being done on the validity of these
> currently. Note of course that text format CEL files are starting to
> become more rare. 
> 
> Now it turns out that in estrogen there are 9 CEL files. All 9 are text
> format.
> 
>> library(affy)
>>  setwd(system.file("extdata", package="estrogen"))
>> list.celfiles()
> [1] "bad.cel"      "high10-1.cel" "high10-2.cel" "high48-1.cel"
> "high48-2.cel"
> [6] "low10-1.cel"  "low10-2.cel"  "low48-1.cel"  "low48-2.cel" 
>> ab = ReadAffy()
>> range(intensity(ab))
> [1]     0 46131
>> apply(intensity(ab),2,range)
>      bad.cel high10-1.cel high10-2.cel high48-1.cel high48-2.cel low10-1.cel
> [1,]       0         32.5         49.5        420.5          425           0
> [2,]   46131      18308.0      45794.0      46111.0        46108       16457
>      low10-2.cel low48-1.cel low48-2.cel
> [1,]           0       360.5         395
> [2,]       20763     46102.0       46115
> 
>> which(intensity(ab)==0)
> [1]  140224 2143523 2202236 2337976 2841090
>> 140224%%640
> [1] 64
>> 140224%/%640
> [1] 219
> 
> because things are zero based want to look for entry at 63 219 in
> bad.cel
> 
> Low and behold:
> 
>  61     219     129.0   25.0     25
>  62     219     84.0    14.1     25
>  6219   133.3   32.4     20
>  64     219     4241.0  634.0    25
>  65     219     3215.0  565.7    25
> 
> so corrupt x/y location. What about in "low10-1.cel"
> 
> 
>> which(intensity(ab)[,6] == 0)%%640
>  95523 154236 289976 
>    163    636     56 
>> which(intensity(ab)[,6] == 0)%/%640
>  95523 154236 289976 
>    149    240    453 
> 
> And again
> 
> 161     149     58.8    7.4      20
> 162     049     100.0   11.4     25
> 163     149     156.5   31.9     20
> 
> and again
> 
> 634     240     63.3    7.5      20
> 63240   57.3    8.8      20
> 636     240     70.0    10.3     25
> 
> and here too
> 
>  54     453     132.0   21.5     25
>  55     053     94.0    11.5     25
>  56     453     145.8   19.5     16
> 
> In theory at least some these cases should have been caught by the
> parser, since there is meant to be a check that ensures that no of those
> coordinates are larger the the chip dimensions (which 6219 and 63240
> most certainly are). The parser should most certainly be throwing its
> hands up in error at this point.
> 
> In any case I would think it bad form to be distributing corrupt CEL
> files as part of a package.
> 
> Going into debugger land and looking at that first cel
> 
> gdb) p buffer
> $6 = " 63\b219\t133.3\t32.4\t 20\n\000 ..........
> 
> and it turns out that it takes x location as 63, y location as 133 and
> the intensity value as 32.4 (so in other words it is putting an
> incorrect intensity in the wrong place). Now what the heck a "\b" (aka
> ^H terminal code) is doing here I can't fathom, but you live with what
> you get.
> 
> Now I could enforce a check that there are five distinct entries (as
> separated by tabs) on each row (which is something I avoided in the past
> for speed since typically npixels and stddev are ignored) and that would
> catch two of those cases. There is not much that can be done about the
> catching the other two unless one were to enforce that x/y locations are
> sequential. The file documentation used to implement the parser does not
> specify that this has to be the case for text format files.
> 
>  
> 
> 
> 
> 
> 
> On Tue, 2010-03-16 at 14:55 -0700, Seth Falcon wrote:
>> On 3/16/10 2:37 PM, bmb at bmbolstad.com wrote:
>>> ReadAffy uses allocMatrix() to allocate the memory into which it intends
>>> to place probe intensities read from the file.
>>
>> Not sure there is an issue here or not based on your comments, but just 
>> so we are all on the same page, allocMatrix allocates but does not 
>> initialize memory for a matrix.
>>
>> + seth
>>
>>
>>   Based on information in the
>>> CEL file header it determines grid dimensions to determine how many probe
>>> intensities are expected. Exact parsing depends on the CEL file format of
>>> which there are a few, but generally speaking it is expected that there is
>>> an intensity to be read for every location in the grid. Zeros can arise in
>>> truncated and corrupted CEL files. But many of those types of situations
>>> should be detected and an error thrown. There may be other mechanisms for
>>> generation of such probe intensities which make them valid.
>>>
>>>
>>>
>>>> On 03/16/2010 08:25 AM, Javier Pérez Florido wrote:
>>>>> Dear list,
>>>>> I've tried to use l.farms (farms package)
>>>>> http://www.bioinf.jku.at/software/farms/farms.html
>>>>> for preprocessing the estrogen data set available at
>>>>> http://www.bioconductor.org/packages/release/data/experiment/html/estrogen.html
>>>>>
>>>>>
>>>>> When trying to preprocess using l.farms, the following error comes up:
>>>>>
>>>>> background correction: none
>>>>> normalization: loess
>>>>> PM/MM correction : pmonly
>>>>> expression values: farms
>>>>> background correcting...done.
>>>>> normalizing...Error: NA/NaN/Inf in call to external function (arg 1)
>>>>>
>>>>> q.farms works for the same data set and l.farms works for the Dilution
>>>>> data set, so, I don't know what is going on here.
>>>>>
>>>>> Any suggestions?
>>>>
>>>> Hi Javier --
>>>>
>>>> If I
>>>>
>>>>    library(farms)
>>>>    library(estrogen)
>>>>    setwd(system.file("extdata", package="estrogen"))
>>>>    ab = ReadAffy()
>>>>    l.farms(ab)
>>>>
>>>> I see the error you report. It is because
>>>>
>>>>> sum(intensity(ab) == 0)
>>>> [1] 5
>>>>
>>>> so that log transformation generates -Inf and then the error. It doesn't
>>>> seem too bad to
>>>>
>>>>    intensity(ab) = 1 + intensity(ab)
>>>>
>>>> [I don't know enough about the inner workings of affyio, but I'm
>>>> surprised that I sometimes see
>>>>
>>>>> range(intensity(ReadAffy()))
>>>> [1]     0 46131
>>>>
>>>>> range(intensity(ReadAffy))
>>>> [1] 1.563578e-316  4.613100e+04
>>>>
>>>> suggesting either initialized memory or unanticipated 'empty' intensity
>>>> grid coordinates)]
>>>>
>>>> Martin
>>>>
>>>>
>>>>> Javier
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.10.1 (2009-12-14)
>>>>> i386-pc-mingw32
>>>>>
>>>>> locale:
>>>>> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252
>>>>> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
>>>>> [5] LC_TIME=Spanish_Spain.1252
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] affyPLM_1.22.0       preprocessCore_1.8.0 gcrma_2.18.1
>>>>> [4] affydata_1.11.10     farms_1.4.0          MASS_7.3-4
>>>>> [7] hgu95av2cdf_2.5.0    affy_1.24.2          Biobase_2.6.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] affyio_1.14.0      Biostrings_2.14.12 IRanges_1.4.11
>>>>> splines_2.10.1
>>>>> [5] tools_2.10.1
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>> --
>>>> Martin Morgan
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> -- 
>> Seth Falcon
>> Bioconductor Core Team | FHCRC
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793