[BioC] affxparser updateCelUnit question

Wed Jul 13 20:44:04 CEST 2011

Hi.

On Wed, Jul 13, 2011 at 10:49 AM, M Behnke <mkbehnke at gmail.com> wrote:
> Hi,
> Thanks for the reply!  I didn't think I could use updateCel since I only
> have the 247,899 PM values.

You are correct that the HG-U133A_2.cdf file specifies 247,899 PM probes;

> pathname <- "annotationData/chipTypes/HG-U133A_2/HG-U133A_2.cdf";
> cellsList <- readCdfCellIndices(pathname, stratifyBy="pm");
> cells <- unlist(cellsList, use.names=FALSE);
> str(cells);
 int [1:247899] 315119 315120 315121 315122 315123 ...

in addition to MM probes and some other control probes.

FYI, in affxparser, we prefer to refer to probes as "cells", hence the
above names.

> I did try it though and something strange is
> happening:
>
> With this code:
> files=list.files(pattern="[.](CEL)$")
> names=colnames(normed.1)
>
> pathname <- file.path(getwd(), basename(files[1]))
> intens=(normed.1[,1])
> idxs=rownames(normed.1)

Here 'idxs' is a *character* vector (because you use rownames()), but
updateCel() requires a *numeric* vector.

> updateCel(pathname,indices=idxs,intensities=intens)
>
> I get the error
> Error in updateCel(pathname, indices = idxs, intensities = intens) :
>  Argument 'indices' is out of range [1,535824]: [1000,9999]

So, however odd the error message seems*, it picks up the fact that
you pass a character vector and not a numeric vector.  (*) We don't
validate type of the 'indices' argument in affxparser, but instead
trust the user to pass the correct data type (affxparser is designed
to be a low-level API for developers, not really end users).  We might
add it to the todo list.

> Using debug, the error is occuring here:
> debug: if (r[1] < 1 || r[2] > nbrOfCells) {
>    stop("Argument 'indices' is out of range [1,", nbrOfCells,
>        "]: ", "[", r[1], ",", r[2], "]")
> }
>
> When I set up r as in the code with r=range(idxs), I do get
>> r
> [1] "1000" "9999"
> And
>> min(idxs)
> [1] "1000"
>> max(idxs)
> [1] "9999"
> BUT, random sampling of idxs shows values outside that range -
>> idxs[45]
> [1] "57961"
>>> idxs[9000]
> [1] "46169"

Yes, this is all because you ask for min and max on *character*
vectors, so it is like comparing letters in an alphabet - they are not
dealt with as numeric values.  If those row-name strings really
contains cell (=probe) indices, you can *coerce* them to integers as:

idxs <- as.integer(idxs);

and it should work.

I recommend the following updates:

pathnames <- list.files(pattern="[.](CEL)$", full.names=TRUE)
sampleNames <- colnames(normed.1)
cells <- rownames(normed.1);
cells <- as.integer(cells);

pathname <- file.path(getwd(), basename(pathname[1]))

# or if you wish to update pathnames[1], just use:
pathname <- pathnames[1]

intensities <- normed.1[,1]
updateCel(pathname, indices=cells, intensities=intensities)

Things to be careful about:

1. Make sure you do not overwrite your raw/original CEL files, but
always create new ones (e.g. by copy the old ones to a new place).
2. Make sure you write to the correct cell indices (probably the same
as you used to read the data in the first place).

Hope this helps
/Henrik

> I may well be making a basic programming error, as I am apparently a much
> better molecular biologist than I am a programmer!  Any help you can give
> would be much appreciated!
>
> I will also check out aroma.affymetrix as you suggest.
>
> thanks again,
> Mikki
> On Wed, Jul 13, 2011 at 12:31 PM, Henrik Bengtsson <hb at biostat.ucsf.edu>wrote:
>
>> Hi,
>>
>> if you are trying to write a vector of probe intensities to a CEL
>> file, and you have the corresponding probe indices, then use
>> updateCel(), e.g.
>>
>>  updateCel(pathname, indices=idxs, intensities=y)
>>
>> This does not require knowing the CDF structure.  (FYI, any function
>> with "unit" in its name, such as updateCelUnits(), utilizes a CDF
>> structure for its reading/writing.)
>>
>> FYI, if the aroma.affymetrix package [http://www.aroma-project.org/]
>> provides a higher-level API on top of affxparser for dealing with
>> single or sets of CEL files.  It also provides a standard file
>> structure for hold data sets, so you/each person don't have to invent
>> their own each time.  ...and more.  You might find it useful.  If
>> you're developing custom preprocessing methods, they can be made so
>> they plug in transparently.
>>
>> /Henrik
>>
>> On Tue, Jul 12, 2011 at 6:59 AM, M Behnke <mkbehnke at gmail.com> wrote:
>> > Hello all,
>> >
>> > I have normalized data from hgu133a2 microarray experiments using a
>> custom
>> > algorithm, in the format of a matrix with rowname=probe position and
>> columns
>> > the normalized intensity values for each sample (125 samples). I am
>> trying
>> > to write the data back out to individual CEL files using affxparser.  I
>> > successfully created the 'empty' CEL files using createCel.  Because I
>> only
>> > have the PM probes, I believe I need to use the updateCelUnits function.
>> > Following the vignette, I set up my code for the first sample as such
>> (CDF
>> > file is in the working directory):
>> >
>> > pathname <- file.path(getwd(), basename(files[1]))
>> > intens=as.list(normed.1[,1])
>> > updateCelUnits(pathname, cdf=NULL, intens)
>> > and checked my variables-
>> >> pathname
>> > [1] "L:/Normed data/CIR_5-D-309.CEL"
>> >
>> >> intens[1:4]
>> > $`129981`
>> > [1] 7.566183
>> > $`212118`
>> > [1] 8.783549
>> > $`393384`
>> > [1] 8.812677
>> > $`84268`
>> > [1] 10.39683
>> > However, when I run the function I get the following error:
>> >> updateCelUnits(pathname, cdf=NULL, intens)
>> > Error in dirname(filename) : a character vector argument expected
>> >> is.character(pathname)
>> > [1] TRUE
>> >> is.vector(pathname)
>> > [1] TRUE
>> > Since I do appear to be giving it a character vector, I'm stumped.  Can
>> > someone help?
>> > My session info appears below.
>> >
>> > Thank you very much!
>> > Mikki Behnke
>> > Doctoral Student
>> > Virginia Commonwealth University
>> >
>> > R version 2.13.0 (2011-04-13)
>> > Platform: x86_64-pc-mingw32/x64 (64-bit)
>> > locale:
>> > [1] LC_COLLATE=English_United States.1252
>> > [2] LC_CTYPE=English_United States.1252
>> > [3] LC_MONETARY=English_United States.1252
>> > [4] LC_NUMERIC=C
>> > [5] LC_TIME=English_United States.1252
>> > attached base packages:
>> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> > other attached packages:
>> >  [1] affxparser_1.24.0          hgu133a2hsrefseq.db_14.1.0
>> >  [3] org.Hs.eg.db_2.5.0         RSQLite_0.9-4
>> >  [5] DBI_0.2-5                  AnnotationDbi_1.14.1
>> >  [7] hgu133a2cdf_2.8.0          simpleaffy_2.28.0
>> >  [9] gcrma_2.24.1               genefilter_1.34.0
>> > [11] affy_1.30.0                Biobase_2.12.1
>> > loaded via a namespace (and not attached):
>> > [1] affyio_1.20.0         annotate_1.30.0       Biostrings_2.20.0
>> > [4] IRanges_1.10.0        preprocessCore_1.14.0 splines_2.13.0
>> > [7] survival_2.36-9       tools_2.13.0          xtable_1.5-6
>> >>
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>