[BioC] Re-mapped Affy CDF files
lgautier at altern.org
Wed Jan 18 17:59:13 CET 2006
Thinking that the mapping provided by Affymetrix are to some
extent outdated is in the air for quite some time.
The number of discrepencies between one's own mapping done with a set of
recent and curated reference sequences (such as NCBI's RefSeq), as been
reported and attempts at quantification of the differences made.. but this
is no trivial task.
Some of the differences I observed when looking at that
(now a couple of years ago) were merely anecdotal, such as probe sets
built to match former hypothetical genes (back in the days were the chips
were designed) than not longer made sense because the hypothetical gene
was later believed to be artefactual, individual probes
having matches all over the place, or such as the MM (mismatch) in the
probe pair matching a reference sequence while the PM (perfect match) was
not matching anything.
What appeared happening at a large scale was that a significant number of
probes sets in an alternative mapping are in fact merges between separate
probe sets in the Affymetrix mapping... which lead to the uncanny world of
alternative splicing, and its intricate complexities.
I ended up with mostly discarding the probes found matching in several
places (which was a trivial task to automate), and curate the events I
spotted while trying to figure out where were the discrepancies between
the two mappings.
To my knowledge, the closest to experimental validation for the relevance
of new mappings have been carried out by Carter et al.. Their work
suggests that newer mappings are indeed better.
Tools for building one's own alternative mapping have been in Bioconductor
since quite some time (I am thinking of the packages "altcdfenvs" and
"matchprobes"), but building one's own mapping for the larger recent chips
from Affymetrix is admitedly a computationally expensive task (several
days of CPU time).
Dai et al. have set up the automated building of such environments, and
save people interested in making use of alternative environment the
computing effort needed to have some.
However, this is not the end of it. By offering a complete
toolkit for building alternative mapping, the rationale being the
'altcdfenvs' package was not only to make as easy as possible the building
of alternative CDF environments for mass consumption, but also to allow
customizations for specific contexts. One obvious example is the use of
stock Affymetrix chips designed for a particular specie with a sample from
a slightly different specie, or with a sample in which particular genomic
features are known to differ from the canonical case. An example I was
giving back then was for the use of the E.coli chip, when knowing that
there are quite a few E.coli strains around labs and that E.coli's genome
can be easily "engineered". An other example can be when a sample is known
to be possible a mixture of different cells and possible
cross-hybridization not wishable (and therefore some probes discarded
Handling different mappings introduces complexity (version number tracking,
etc...), and one way is to do that through packages (a la Annotation
packages). The trouble is that this requires careful operations
when replacing one mapping with an other: accidents like using the
a mapping for one chip type with an other chip type will
completely wrong results.
I had a stab at that by having classes for CDF environments
(as defined in the pack 'altcdfenvs'), together with a rewrite of 'affy',
and putting in the repository a little more than a year ago (should be in
the subversion repository under the name 'affyplus'). I am not certain
anyone picked that up since then... (and some will say I should
give up the idea someone will ;) ).
Just some thoughts,
: Gautier et al., BMC Bioinformatics. 2004; 5: 111.
: Carter et al., BMC Bioinformatics. 2005; 6: 107.
: Dai et al., Nucleic Acids Res. 2005; 33(20)
> Hi all,
> I looked at the alternative mappings a few months ago after attending a
seminar given by Stanley Watson, Director of Mental Health Research
Institute at University of Michigan. He recommended that the alternative
mappings always be used because of the large discrepancies they found
between Affymetrix's mapping and their mappings of the probes. I don't
> whether they have any documentation on whether their mappings yield
> that are more often validated through alternative methodologies or not,
> they do have quite a lot of documentation on what they did and why they
> it - see the description of custom CDF files and their new paper from
> on the page Jim put in his first post. Even if Ensembl or Affymetrix
updates their annotation based on remapping, the CDFs aren't changed, so
the summarization and statistical analysis are done using probes that may
> not all map to the same "gene" uniquely. What these alternative mapping
> is to remap each probe, then redefine probe sets based on all the probes
that map to a "gene", and that it's these re-groupings that are most
important. Many of the alternative mappings are subsets of other ones,
like taking only the first 11 probes from the 3' end in cases where there
> are more than 11 probes, so there are not quite as many alternative
mappings as it first appears.
> I do agree with Jim that coming up with a defensible rationale is
important, as I was having trouble deciding which mapping might be the best
> to use. Stan Watson would argue that any of them are better than the
outdated Affymetrix groupings. If Affy did theirs based on Unigene
clustering, then the new mapping & grouping based on Unigene might be a
defensible choice. In the end, I succumbed to historical inertia and went
> with Affymetrix's CDF, in part because I do analyses for many organisms,
and MBNI only has alternative CDFs for human, mouse, and rat. However, I
was able to get the alternative CDFs to work in Bioconductor with little
> As far as validating the genes on the magical "significant list", I did
> some advice at a recent conference to ALWAYS first check the current
> mappings for those significant genes, then only concentrate on those
> have most or all of their probes where they should be. Does anyone do
> routinely? Should we, but we don't because it is too time consuming?
> At 08:51 AM 1/11/2006, James W. MacDonald wrote:
>>Sean Davis wrote:
>> > I'm not sure what their build process is, but doesn't Ensembl do some
>>Maybe. I couldn't find anything obvious in a cursory glance at their
>>Anyway, the main question for me is not the number or type of
>>alternative mappings that exist for Affy arrays (there are 19 different
CDFs that the MBNI folks produce, including several based on Ensembl
mappings). I am more concerned with being able to establish a defensible
rationale for using a particular mapping.
>>I guess what we do right now with the Affy CDFs isn't defensible except
on a historical basis, but the weight of history is pretty strong. For
instance, attributing significance at an alpha of < 0.05 has no
>>rationale AFAIK, but is pretty much written in stone due to precedent.
OTOH, most if not all microarray data are caveat emptor - it is
>>incumbent on the end user to take the magical list of differentially
expressed genes and validate them with an alternative methodology. Given
that state of affairs, is it not reasonable to choose the probe mappings
that one uses with the same logic that one uses for choosing the
preferred way of computing expression values?
>> > Sean
>>James W. MacDonald
>>Affymetrix and cDNA Microarray Core
>>University of Michigan Cancer Center
>>1500 E. Medical Center Drive
>>Ann Arbor MI 48109
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
> Jenny Drnevich, Ph.D.
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
More information about the Bioconductor