[BioC] converting Affy indices to x,y coordinates

Wed Feb 16 17:18:55 CET 2011

Hi everyone,

    Thank you for an "enlightening" conversation.  Let me describe my own motivation for asking the original question, and that may reveal the cause of confusion.

    I am currently writing a tutorial on how to process Affymetrix data using Mathematica.  Part of my tutorial covers Affy CDF files, but discovering the conversion between Affy x,y coordinates and indices in Affy CDF files has been rather cumbersome.  When I came across the excellent affxparser package, I thought I had discovered what I needed, only to then be confused by the zero or one-based index "issue".

    Sorry for the "stress", but I do believe there is real value in describing how Affymetrix handles the data verses how an independent package like affxparser handles the data.  As a teacher myself, I like to error on the side of "explain too much."

    Thank you to all and I'm sorry if I caused undue concern.

Todd

--- On Wed, 2/16/11, Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:

> From: Henrik Bengtsson <hb at biostat.ucsf.edu>
> Subject: Re: [BioC] converting Affy indices to x,y coordinates
> To: "Mounts, William" <Bill.Mounts at pfizer.com>
> Cc: "Todd Allen" <genesplicer28 at yahoo.com>, "bioconductor" <bioconductor at r-project.org>
> Date: Wednesday, February 16, 2011, 2:57 AM
> Hi.
> 
> On Tue, Feb 15, 2011 at 6:21 PM, Mounts, William <Bill.Mounts at pfizer.com>
> wrote:
> > From the Affymetrix documentation, the following are
> available for each cell (probe) in the cdf file.
> >
> > Cell information, repeated for each cell in the
> block:
> >
> > Atom number - integer
> > X coordinate - unsigned short
> > Y coordinate - unsigned short
> > Index position (relative to sequence for CustomSeq,
> Genotyping, Copy Number, Polymorphic Marker, and
> Multichannel Marker units, for Expression units this value
> is the atom number) - integer
> > Base of probe at substitution position - char
> > Base of target at interrogation position - char
> > Length of probe sequence - unsigned short (only
> available in version 2 and 3)
> > Physical grouping of probe - unsigned short (only
> available in version 2 and 3)
> >
> > Index position is provided and examination of various
> cdf files shows that index = K*y + x.
> 
> Thanks for pointing this out.  You are correct that
> CDF files (only)
> also contain and "index" field.  You are also correct
> that this
> redundant CDF "index" field seems to be zero-based (at
> least the ASCII
> CDF files I've checked).  I've check the code, and it
> is the case that
> affxparser completely ignores this (because it is
> redundant) and
> operates only via the (x,y) coordinates.  Indeed, none
> of the methods
> in affxparser for reading CDF files allows you to read the
> "index"
> values.
> 
> Since it is tedious to address cells by spatial (x,y)
> coordinates,
> linear indices are used instead. The convention in
> affxparser is to
> use one-based indices, which we call "cell indices" as
> described in
> [1].  All affxparser methods reading CDF files returns
> the one-based
> "cell indices" as calculated from the (x,y) coordinates
> (never the
> above internal CDF "index" field).
> 
> FYI, this made me go back to old email communication I had
> with other
> affxparser authors back in 2006.  I forgot, but we
> then actually
> discussed the above and eventually decided that the
> convention should
> be one-based.  Early versions of affxparser did indeed
> use zero-based
> indices (still calculated from (x,y) though).  Using
> zero-based
> indices would be much(!) more error prone in R.  From
> affxparser's
> NEWS file:
> 
> Version: 1.3.2 [2006-03-28]
> o All cell and unit indices are now starting from one and
> not
>   from zero.  This change requires that all code
> that have
>   been using a previous version of this package have
> to be
>   updated!
> 
> > Below, in point 5, you mention that "In R it is more
> convenient to use one-based indices instead of zero-based
> indices.  This is taken care of by affxparser."  Is this
> where the 1 comes from in the implementation in order to
> move the index values from 0-based to 1-based?
> 
> Correct.
> 
> In order to improve the affxparser documentation, I have
> added the
> following section to the end of [1]:
> 
>  \section{Note on the zero-based "index" field of
> Affymetrix CDF files}{
>    An Affymetrix CDF file provides
> information on which cells should be
>    grouped together.  To identify these
> groups of cells, the cells
>    are specified by their (x,y) coordinates,
> which are stored as
>    zero-based coordinates in the CDF file.
> 
>    All methods of the \pkg{affxparser}
> package make use of these
>    (x,y) coordinates, and some methods makes
> it possible to read
>    them as well.  However, it is much
> more common that the methods
>    return cell indices \emph{calculated}
> from the (x,y) coordinates
>    as explained above.
> 
>    In order to conveniently work with cell
> indices in \R, the
>    convention in \emph{affxparser} is to use
> \emph{one-based}
>    indices.
>    Hence the addition (and subtraction) of
> 1:s in the above equations.
>    This is all taken care of by
> \pkg{affxparser}.
> 
>    Note that, in addition to (x,y)
> coordinates, a CDF file also contains
>    a one-based "index" for each cell. 
> This "index" is redundant to
>    the (x,y) coordinate and can be
> calculated analogously to the
>    above \emph{cell index} while leaving out
> the addition (subtration)
>    of 1:s.
>    Importantly, since this "index" is
> redundant (and exists only in
>    CDF files), we have decided to treat this
> field as an internal field.
>    Methods of \pkg{affxparser} do neither
> provide access to nor make
>    use of this internal field.
>  }
> 
> Note that the other paragraphs on this help page should not
> need to be
> updated.  Note that nowhere else in this page are we
> talking about the
> content of a CDF.
> 
> I have also, where applicable, made it explicit in the help
> pages of
> methods reading CDF files that the "cell indices" are
> one-based.  To
> those help pages I have also added a short section:
> 
>  \section{Cell indices are one-based}{
>    Note that in \pkg{affxparser} all
> \emph{cell indices} are by
>    convention \emph{one-based}, which is
> more convenient to work
>    with in \R.  For more details on
> one-based indices, see
>    \code{\link{2. Cell coordinates and cell
> indices}}.
>  }
> 
> I hope this will clarify things.  Any further feedback
> is appreciated.
> 
> 
> Thanks for you help
> 
> Henrik
> 
> 
> >
> > On Mon, Feb 14, 2011 at 10:24 AM, Mounts, William
> <Bill.Mounts at pfizer.com>
> wrote:
> >> Todd,
> >>
> >> It would appear that there is an error in
> affyxparser.  Testing a
> >> number of cdf files, it appears that index = K * y
> + x.
> >
> > I doubt that.  Could you please provide complete
> examples illustrating the problem?  Unless proven wrong, I
> stand firm on the claim that both the implementation and
> documentation to be correct.  As Kasper pointed out, it may
> be that the documentation is confusing or ambiguous, but
> that is not to say it's wrong.  I am happy to take
> suggestions on how to improve the documentation.
> >
> >
> > CLARIFICATIONS:
> >
> > 1. The spatial (x,y) cell coordinates are zero-based
> [1].  This is at least the case if you access them via
> Affymetrix Fusion SDK, that is,
> > via affxparser.   I cannot claim that all CDF files
> in history have
> > had zero-based (x,y) coordinates, but it does not
> matter because throught the Fusion SDK they are returned as
> such.  (Anecdotal
> > evidence: Browsing through several of my (ASCII and
> binary) CDFs, they are indeed zero-based (x,y):s.)
> >
> > 2. A CDF file reference the cells (probes) by their
> (x,y) coordinates only [2].
> >
> > 3. It is more convenient to access cells by their
> linear indices, which is why they are provided.
> >
> > 4. BTW, note also the last comment on that help page
> [1]: If you use the affxparser methods, you don't have to
> worry about (x,y) indices; everything is by default done
> using cell (probe) indices.
> >
> > 5. In R it is more convenient to use one-based indices
> instead of zero-based indices.  This is taken care of by
> affxparser.
> >
> > 6. The affxparser documentation [1] clearly says that
> spatial (x,y) cell coordinates are zero-based indices and
> the linear cell indices are one-based.
> >
> > 7. Do not confuse (Bioconductor) CDF annotation
> packages/environments with (Affymetrix) CDF *files*;
> affxparser deals with the latter only.
> >
> >
> > I think Clarification (4) is one of the most important
> ones.  If you stick with affxparser, you are given a
> well-defined self-contained and consistent access to the
> content of CEL and CDF files (and some other Affymetrix file
> types too).
> >
> >
> > REFERENCES:
> > [1] help("2. Cell coordinates and cell indices",
> package="affxparser")
> >
> > [2] Section 'Affymetrix CDF Data File Format' part of
> 'File Formats Documentation', Affymetrix, October 2009
> > (http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no)
> >
> >
> > /Henrik
> > (wrote most of [1])
> >
> >>
> >> Bill
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at r-project.org
> >> [mailto:bioconductor-bounces at r-project.org]
> On Behalf Of Todd Allen
> >> Sent: Monday, February 14, 2011 11:19 AM
> >> To: bioconductor at r-project.org
> >> Subject: [BioC] converting Affy indices to x,y
> coordinates
> >>
> >> Hello all,
> >>
> >>   I have been reading the documentation portion
> of a package called
> >> "affyxparser."  In the documentation there is a
> description of the
> >> formulas needed to seemlessly convert between
> Affymetrix probe indices
> >> and the cooresponding (x,y) coordinate of
> individual probes.
> >>
> >> Copying from the package documentation, the
> following information is
> >> most relevant:
> >>
> >> 1. index = K * y + x + 1; where K is the number of
> columns on the chip
> >> 2. y = floor ((index - 1)/K) 3. x=(index - 1) - K
> * y
> >>
> >> In my own work, I am processing a HGU133Plus 2 CDF
> file. The array
> >> dimensions are (1164, 1164) and if I take the
> index of a specific
> >> probe listed as 1354890, I calculate the
> coordinates as x = 1157 and y
> >> = 1163 using the formulas above.
> >>
> >> The (x,y) coordinate reported from Affy's own CDF
> file for this probe
> >> is actually x = 1158 (not 1157) and y = 1163.
> >>
> >> I am struggling to understand this discrepancy
> between the affyparser
> >> documentation and the verbatim output from Affy's
> own CDF file.  Has
> >> any run into this situation before?  Do you see
> any obvious problem or
> >> explanation as to what is happening.
> >>
> >> Thank you!
> >> Todd A
> >> genesplicer28 at yahoo.com
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
>