[BioC] observations on affyprobeminer
hl224 at georgetown.edu
Mon May 5 15:23:06 CEST 2008
Thanks for your input with AffyProbeMiner (APM). The following are
answers to some of the questions (after consulting with Dr. Barry Zeeberg).
First, the difference between Affy-CDFs and APM-CDFs in number of
significant probe sets can be caused by the different number of probe sets.
Secondly, one of the motivations of our study is the inconsistency among
different microarray platforms. We hoped to use remapping to improve the
consistency (remapping did improve consistency between different
generations of Affymetrix chips). APM as well as several other remapping
tools or resources tries to make sure that probes measure the signal of
the intended transcripts: i.e., the probe sequence can be mapped to the
While our knowledge about splice variants in a specific tissue is still
limited, here, APM generates gene-consistent or transcript-consistent
probe sets, relative to the global set of all known transcripts where
were derived from RefSeq and GenBank (passing our QAs: i.e., must align
well with the genome (95% aligned with 99% identity)); therefore, APM
will not be able to address probes measuring unknown splice variants.
You can check
my colleague, Mike Ryan's Splice Center,
analyze known splice variants for each gene.
Also the probes that get discarded in APM are ones that do not fall into
a sufficiently large consistent probe set, or that do not map to any
gene at all. If the user wants to adjust the threshold to accept small
then APM will throw away only probes that map to no genes at all. If the
user wants to use probe sets that are not consistent, then APM should
not be used at all, and the original mappings can be used.
We are not saying that all of the discarded probe sets are random. What
some of them measure may potentially contain contributions from an
inconsistent set of genes. This inconsistency may degrade the reliability
of the measurement, to a greater or lesser degree, depending on the
relative expression levels of the “good” probes in that set and the
“inconsistent” probes in that set. In any given tissue, some of the
may not be expressed, so some probe sets that would be inconsistent in a
global sense are not inconsistent relative to that specific tissue. We
cannot provide all possible tissue-specific CDFs, but we provide the
software for users to be able to produce such custom CDFs as desired.
I agree with Dr. Gautier that you need to try original affy-CDFs as well
as several custom CDFs together with a mixture of microarray data
After a dozen of years of research, a lot of work is still needed.
For example, in Affymetrix probe set definition, the number of probes
per probe set is pretty consistent while in remapped probe sets, the
number of probes per probe set can vary dramatically.
So existing algorithms that work well with affymetrix-CDFs probably will
not work well with custom CDFs.
Secondly, if we randomly group probes into probe sets (11 probes per
probe set), following the usual microarray data analysis, there are
probably still a lot of significantly expressed probe sets.
Also, the biochemistry behind microarray is beyond my comprehension. I
have no problem to understand microarray can be used to compare two
groups (control and treatment) and pick probe sets that are
significantly different between
the groups. But I still have trouble to understand what exactly the
absolute measure of each probe set means; and how people can identify a
set of significantly expressed probe sets.
Best regards and welcome more discussions on this topic.
Mark Kimpel wrote:
> I have recently explored the use of alternative CDFs from
> affyprobeminer (APM) or a 36 array dataset derived using the Affy
> rat2302 chipset. I used both the Affy cdf and the transcript-level
> affyprobeminer cdf. I preprocessed using RMA, filtered using an A/P
> filter, and statistically analyzed using an appropriate lme model
> followed by qvalue FDR correction. I set my FDR threshold at 5%. I
> eliminated duplicate genes by picking the one with the lowest p-value.
> Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I
> choose only those EntrezGene identifiers present on both cdfs, my
> number sig. with the APM cdf was ~1000 and there was a 90% overlap
> with the Affy sig. list. My conclusion from the latter observation is
> that I am measuring largely the same transcripts/genes with both CDFs.
> I was interested in the ~1000 genes which are annotated with the Affy
> CDF but not the APM cdf. Following the logic behind APM, I would
> assume that these would be largely incorrectly annotated probesets or
> probesets that are not really measuring any "real" transcript. This
> list should, then, consist largely of random genes. To test this
> hypothesis, I used the Category package to test for
> over-representation of GO and KEGG categories in my various lists.
> What I found was a huge degree of overlap between: 1. the affy genes
> also annotated with APM, 2. the affy genes not annotated with APM, 3.
> the genes derived solely from APM.
> My conclusion from this latest observation is that APM is not
> annotating a large number of genes/transcripts that are in fact real.
> Assuming that APM is correctly throwing out some "junk" probesets, is
> it throwing out the baby with the bathwater?
> I'd be interested to hear the thoughts and experiences of others. I've
> certainly run into occasions where Affy annotated probesets turn out
> to represent introns or something other than they purport to be, and I
> was hoping that APM would solve this problem, but I don't want to use
> it if it means a massive loss of truly significant data.
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
> 15032 Hunter Court, Westfield, IN 46074
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 663-0513 Home (no voice mail please)
Hongfang Liu, Ph.D.
Department of Biostatistics,
Bioinformatics, and Biomathematics
Georgetown University Medical Center
More information about the Bioconductor