[BioC] converting Affy indices to x,y coordinates

Henrik Bengtsson hb at biostat.ucsf.edu
Wed Feb 16 08:57:11 CET 2011


Hi.

On Tue, Feb 15, 2011 at 6:21 PM, Mounts, William <Bill.Mounts at pfizer.com> wrote:
> From the Affymetrix documentation, the following are available for each cell (probe) in the cdf file.
>
> Cell information, repeated for each cell in the block:
>
> Atom number - integer
> X coordinate - unsigned short
> Y coordinate - unsigned short
> Index position (relative to sequence for CustomSeq, Genotyping, Copy Number, Polymorphic Marker, and Multichannel Marker units, for Expression units this value is the atom number) - integer
> Base of probe at substitution position - char
> Base of target at interrogation position - char
> Length of probe sequence - unsigned short (only available in version 2 and 3)
> Physical grouping of probe - unsigned short (only available in version 2 and 3)
>
> Index position is provided and examination of various cdf files shows that index = K*y + x.

Thanks for pointing this out.  You are correct that CDF files (only)
also contain and "index" field.  You are also correct that this
redundant CDF "index" field seems to be zero-based (at least the ASCII
CDF files I've checked).  I've check the code, and it is the case that
affxparser completely ignores this (because it is redundant) and
operates only via the (x,y) coordinates.  Indeed, none of the methods
in affxparser for reading CDF files allows you to read the "index"
values.

Since it is tedious to address cells by spatial (x,y) coordinates,
linear indices are used instead. The convention in affxparser is to
use one-based indices, which we call "cell indices" as described in
[1].  All affxparser methods reading CDF files returns the one-based
"cell indices" as calculated from the (x,y) coordinates (never the
above internal CDF "index" field).

FYI, this made me go back to old email communication I had with other
affxparser authors back in 2006.  I forgot, but we then actually
discussed the above and eventually decided that the convention should
be one-based.  Early versions of affxparser did indeed use zero-based
indices (still calculated from (x,y) though).  Using zero-based
indices would be much(!) more error prone in R.  From affxparser's
NEWS file:

Version: 1.3.2 [2006-03-28]
o All cell and unit indices are now starting from one and not
  from zero.  This change requires that all code that have
  been using a previous version of this package have to be
  updated!

> Below, in point 5, you mention that "In R it is more convenient to use one-based indices instead of zero-based indices.  This is taken care of by affxparser."  Is this where the 1 comes from in the implementation in order to move the index values from 0-based to 1-based?

Correct.

In order to improve the affxparser documentation, I have added the
following section to the end of [1]:

 \section{Note on the zero-based "index" field of Affymetrix CDF files}{
   An Affymetrix CDF file provides information on which cells should be
   grouped together.  To identify these groups of cells, the cells
   are specified by their (x,y) coordinates, which are stored as
   zero-based coordinates in the CDF file.

   All methods of the \pkg{affxparser} package make use of these
   (x,y) coordinates, and some methods makes it possible to read
   them as well.  However, it is much more common that the methods
   return cell indices \emph{calculated} from the (x,y) coordinates
   as explained above.

   In order to conveniently work with cell indices in \R, the
   convention in \emph{affxparser} is to use \emph{one-based}
   indices.
   Hence the addition (and subtraction) of 1:s in the above equations.
   This is all taken care of by \pkg{affxparser}.

   Note that, in addition to (x,y) coordinates, a CDF file also contains
   a one-based "index" for each cell.  This "index" is redundant to
   the (x,y) coordinate and can be calculated analogously to the
   above \emph{cell index} while leaving out the addition (subtration)
   of 1:s.
   Importantly, since this "index" is redundant (and exists only in
   CDF files), we have decided to treat this field as an internal field.
   Methods of \pkg{affxparser} do neither provide access to nor make
   use of this internal field.
 }

Note that the other paragraphs on this help page should not need to be
updated.  Note that nowhere else in this page are we talking about the
content of a CDF.

I have also, where applicable, made it explicit in the help pages of
methods reading CDF files that the "cell indices" are one-based.  To
those help pages I have also added a short section:

 \section{Cell indices are one-based}{
   Note that in \pkg{affxparser} all \emph{cell indices} are by
   convention \emph{one-based}, which is more convenient to work
   with in \R.  For more details on one-based indices, see
   \code{\link{2. Cell coordinates and cell indices}}.
 }

I hope this will clarify things.  Any further feedback is appreciated.


Thanks for you help

Henrik


>
> On Mon, Feb 14, 2011 at 10:24 AM, Mounts, William <Bill.Mounts at pfizer.com> wrote:
>> Todd,
>>
>> It would appear that there is an error in affyxparser.  Testing a
>> number of cdf files, it appears that index = K * y + x.
>
> I doubt that.  Could you please provide complete examples illustrating the problem?  Unless proven wrong, I stand firm on the claim that both the implementation and documentation to be correct.  As Kasper pointed out, it may be that the documentation is confusing or ambiguous, but that is not to say it's wrong.  I am happy to take suggestions on how to improve the documentation.
>
>
> CLARIFICATIONS:
>
> 1. The spatial (x,y) cell coordinates are zero-based [1].  This is at least the case if you access them via Affymetrix Fusion SDK, that is,
> via affxparser.   I cannot claim that all CDF files in history have
> had zero-based (x,y) coordinates, but it does not matter because throught the Fusion SDK they are returned as such.  (Anecdotal
> evidence: Browsing through several of my (ASCII and binary) CDFs, they are indeed zero-based (x,y):s.)
>
> 2. A CDF file reference the cells (probes) by their (x,y) coordinates only [2].
>
> 3. It is more convenient to access cells by their linear indices, which is why they are provided.
>
> 4. BTW, note also the last comment on that help page [1]: If you use the affxparser methods, you don't have to worry about (x,y) indices; everything is by default done using cell (probe) indices.
>
> 5. In R it is more convenient to use one-based indices instead of zero-based indices.  This is taken care of by affxparser.
>
> 6. The affxparser documentation [1] clearly says that spatial (x,y) cell coordinates are zero-based indices and the linear cell indices are one-based.
>
> 7. Do not confuse (Bioconductor) CDF annotation packages/environments with (Affymetrix) CDF *files*; affxparser deals with the latter only.
>
>
> I think Clarification (4) is one of the most important ones.  If you stick with affxparser, you are given a well-defined self-contained and consistent access to the content of CEL and CDF files (and some other Affymetrix file types too).
>
>
> REFERENCES:
> [1] help("2. Cell coordinates and cell indices", package="affxparser")
>
> [2] Section 'Affymetrix CDF Data File Format' part of 'File Formats Documentation', Affymetrix, October 2009
> (http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no)
>
>
> /Henrik
> (wrote most of [1])
>
>>
>> Bill
>>
>> -----Original Message-----
>> From: bioconductor-bounces at r-project.org
>> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Todd Allen
>> Sent: Monday, February 14, 2011 11:19 AM
>> To: bioconductor at r-project.org
>> Subject: [BioC] converting Affy indices to x,y coordinates
>>
>> Hello all,
>>
>>   I have been reading the documentation portion of a package called
>> "affyxparser."  In the documentation there is a description of the
>> formulas needed to seemlessly convert between Affymetrix probe indices
>> and the cooresponding (x,y) coordinate of individual probes.
>>
>> Copying from the package documentation, the following information is
>> most relevant:
>>
>> 1. index = K * y + x + 1; where K is the number of columns on the chip
>> 2. y = floor ((index - 1)/K) 3. x=(index - 1) - K * y
>>
>> In my own work, I am processing a HGU133Plus 2 CDF file. The array
>> dimensions are (1164, 1164) and if I take the index of a specific
>> probe listed as 1354890, I calculate the coordinates as x = 1157 and y
>> = 1163 using the formulas above.
>>
>> The (x,y) coordinate reported from Affy's own CDF file for this probe
>> is actually x = 1158 (not 1157) and y = 1163.
>>
>> I am struggling to understand this discrepancy between the affyparser
>> documentation and the verbatim output from Affy's own CDF file.  Has
>> any run into this situation before?  Do you see any obvious problem or
>> explanation as to what is happening.
>>
>> Thank you!
>> Todd A
>> genesplicer28 at yahoo.com
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



More information about the Bioconductor mailing list