[BioC] Biocore response to Affymetrix data format changes

Vincent Carey 525-2265 stvjc at channing.harvard.edu
Fri Jun 27 18:50:12 MEST 2003


D. Kulp of Affymetrix commented on the upcoming proprietary GeneChip
data formats in a Bioconductor mailing list post of 25 June 2003.
He notes that Windows/Java linkable libraries will be provided
for reading the binary GeneChip format, and that MAGE/ML
exports will be available.  He proposes
 1) Bioconductor can provide free compiled libraries using
the API and the affymetrix linkable libraries
 2) Bioconductor applications use MAGE/ML, as data bloat is
not noteworthy and the export contains 'all the CEL data you
expect'.

Kulp comments that these observations show that the details
of the change are "fairly simple".  In fact, the change has
far-reaching implications for those who work with Bioconductor
software and affymetrix data.

The Bioconductor project has adopted a policy of programming
only to public and open APIs.  Primary reasons:
 a) R is free software under the GPL.  Although we have made
an effort to release the main Bioc components under LGPL, as
a collaborative gesture towards commercial entities who wish
to use our tools, R itself is GPL.  It is not possible to
legally distribute tools that combine compilations of
non-free software with GPL software.
 b) Beyond the restrictions of the GPL in relation to R,
the Digital Millenium Copyright Act (DMCA) creates legal
complications for those who create compilations of mixed
free and proprietary software.  We have no resources to spend
on legal advice or on adapting our research to a complex
legal landscape.  Commitment to public and open APIs allows
us to carry on research in a natural and efficient way largely
independently of DMCA restriction and interpretation in
the complex area of reverse engineering.
 c) Commitment to public and open APIs leverages the user
community's capabilities to discover problems and to
fix them.  While distribution of compiled libraries with
open components as interfaces to proprietary formats may
SEEM consistent with open source software methodology,
this is an illusion.  We have benefited from user-contributed
bug fixes and would cease to do so under the regimen proposed
by Kulp, because users would lack access to key elements of
the interface.
 d) Commitment to public and open APIs sharply reduces
effort required to support multiple platforms.  When compiled
libraries are distributed one frequently encounters conflicts
with resident versions of supporting libraries and one
needs to introduce substantial technology for bridging
distributed objects to platforms whose resources may be
out of date or noncompliant with basic standards.  Time spent
on nonstandard portability methodology is time subtracted
from research on computational biology.  As researchers
we cannot accept this additional cost.
 e) Commitment to public and open APIs is the only approach
compatible with the recognition that microarray analysis
technology is immature and must be fully open to scrutiny
if science is to advance in an efficient way.  Comparisons
of MAS4, MAS5, Li and Wong's MBEI and RMA probe-level
analyses indicate that the procedures yield different results.
Users have a right to expect that results from different
methodologies can be fully rationalized, and this can only
occur with open implementations.

These five points respond to Kulp's suggestion that we
provide free binaries to the user community.  The suggestion
seems simple and positive but it is not feasible at all.

Kulp's second suggestion is to employ the MAGE-ML format.
It does appear that this constitutes a public and open API
and one that we could program to.  However it does appear
that there will be significant information restrictions and
performance costs if we are forced to go in this direction.
We have one report of significant data bloat with the
current embodiments of this technology.  A 7 megabyte
cell file had a 30 MB XML representation, and a 21 MB
CDF file had a 400 MB XML representation.  Kulp suggests
that XML bloat does not occur, and that may be due to
his access to newer forms of the transformation.  We
believe that compliant MAGE-ML representations will be
massive.  Requiring Bioconductor to work from MAGE-ML
will lead to additional burdens on users that will
impede research progress.

In summary, Bioconductor's commitment to open and public
APIs is dictated by legal and scientific considerations.
Affymetrix' transition to closed file formats is difficult
to understand.  No one questions the technical utility of
a change to a binary format.  Making it secret has no
utility that we can discern.  Bioconductor and its users
have provided R&D to affymetrix essentially free of charge.
The upcoming Affymetrix GeneChip Microarray Low-Level Workshop
( http://eci-events.com/AffyGeneChip/ ) is proof that Affymetrix
appreciates and is open to these contributions.
Accommodating a non-public, non-open API for Affymetrix data
would constitute a precedent that might impact methods
adopted by other companies in this field.  We respectfully
ask that Affymetrix make a rather different precedent:
open the new file format to support and encourage research
and development in the microarray analysis domain.
An open format will clearly benefit both Affymetrix and
the scientific community.

Sincerely,
The Bioconductor Core Team

    * Douglas Bates, University of Wisconsin, USA.
    * Vince Carey, Harvard Medical School, USA.
    * Marcel Dettling, Federal Inst. Technology, Switzerland.
    * Sandrine Dudoit, Division of Biostatistics, UC Berkeley, USA.
    * Byron Ellis, Harvard Department of Statistics, USA.
    * Laurent Gautier, Technial University of Denmark, Denmark.
    * Robert Gentleman, Harvard Medical School, USA.
    * Jeff Gentry, Dana-Farber Cancer Institute, USA.
    * Kurt Hornik, Technische Universitat Wien, Austria.
    * Torsten Hothorn, Institut fuer Medizininformatik, Biometrie und Epidemiologie, Germany.
    * Wolfgang Huber, DKFZ Heidelberg, Molecular Genome Analysis, Germany.
    * Stefano Iacus, University of Milan, Italy
    * Rafael Irizarry, Department of Biostatistics (JHU), USA.
    * Friedrich Leisch, Technische Universitat Wien, Austria.
    * Martin Maechler, Federal Inst. Technology, Switzerland.
    * Gordon Smyth, Walter and Eliza Hall Institute, Australia.
    * Anthony Rossini, University of Washington and the Fred Hutchinson Cancer Research Center, USA.
    * Gunther Sawitzki, Institute fur Angewandte Mathematik, Germany.
    * Luke Tierney, University of Iowa, USA.
    * Jean Yee Hwa Yang, University of California, San Francisco, USA.
    * Jianhua (John) Zhang, Dana-Farber Cancer Institute, USA.



More information about the Bioconductor mailing list