[BioC] observations on affyprobeminer

Mon May 5 15:23:06 CEST 2008

Dear Mark,

Thanks for your input with AffyProbeMiner (APM). The following are 
answers to some of the questions (after consulting with Dr. Barry Zeeberg).

First, the difference between Affy-CDFs and APM-CDFs in number of 
significant probe sets can be caused by the different number of probe sets.

Secondly, one of the motivations of our study is the inconsistency among 
different microarray platforms. We hoped to use remapping to improve the 
consistency (remapping did improve consistency between different 
generations of Affymetrix chips). APM as well as several other remapping 
tools or resources tries to make sure that probes measure the signal of 
the intended transcripts: i.e., the probe sequence can be mapped to the 
intended transcript.
While our knowledge about splice variants in a specific tissue is still 
limited, here, APM generates gene-consistent or transcript-consistent 
probe sets, relative to the global set of all known transcripts where 
those transcripts
were derived from RefSeq and GenBank (passing our QAs: i.e., must align 
well with the genome (95% aligned with 99% identity)); therefore, APM 
will not be able to address probes measuring unknown splice variants. 
You can check
my colleague, Mike Ryan's Splice Center, 
http://www.tigerteamconsulting.com/SpliceCenter/SpliceOverview.jsp to 
analyze known splice variants for each gene.

Also the probes that get discarded in APM are ones that do not fall into 
a sufficiently large consistent probe set, or that do not map to any 
gene at all. If the user wants to adjust the threshold to accept small 
probe sets,
then APM will throw away only probes that map to no genes at all. If the 
user wants to use probe sets that are not consistent, then APM should 
not be used at all, and the original mappings can be used.
We are not saying that all of the discarded probe sets are random. What 
some of them measure may potentially contain contributions from an 
inconsistent set of genes. This inconsistency may degrade the reliability
of the measurement, to a greater or lesser degree, depending on the 
relative expression levels of the “good” probes in that set and the 
“inconsistent” probes in that set. In any given tissue, some of the 
splice variants
may not be expressed, so some probe sets that would be inconsistent in a 
global sense are not inconsistent relative to that specific tissue. We 
cannot provide all possible tissue-specific CDFs, but we provide the
software for users to be able to produce such custom CDFs as desired.

I agree with Dr. Gautier that you need to try original affy-CDFs as well 
as several custom CDFs together with a mixture of microarray data 
analysis methods.
After a dozen of years of research, a lot of work is still needed.

For example, in Affymetrix probe set definition, the number of probes 
per probe set is pretty consistent while in remapped probe sets, the 
number of probes per probe set can vary dramatically.
So existing algorithms that work well with affymetrix-CDFs probably will 
not work well with custom CDFs.

Secondly, if we randomly group probes into probe sets (11 probes per 
probe set), following the usual microarray data analysis, there are 
probably still a lot of significantly expressed probe sets.

Also, the biochemistry behind microarray is beyond my comprehension. I 
have no problem to understand microarray can be used to compare two 
groups (control and treatment) and pick probe sets that are 
significantly different between
the groups. But I still have trouble to understand what exactly the 
absolute measure of each probe set means; and how people can identify a 
set of significantly expressed probe sets.

Best regards and welcome more discussions on this topic.

Mark Kimpel wrote:
> I have recently explored the use of alternative CDFs from 
> affyprobeminer (APM) or a 36 array dataset derived using the Affy 
> rat2302 chipset. I used both the Affy cdf and the transcript-level 
> affyprobeminer cdf. I preprocessed using RMA, filtered using an A/P 
> filter, and statistically analyzed using an appropriate lme model 
> followed by qvalue FDR correction. I set my FDR threshold at 5%. I 
> eliminated duplicate genes by picking the one with the lowest p-value.
>
> Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I 
> choose only those EntrezGene identifiers present on both cdfs, my 
> number sig. with the APM cdf was ~1000 and there was a 90% overlap 
> with the Affy sig. list. My conclusion from the latter observation is 
> that I am measuring largely the same transcripts/genes with both CDFs.
>
> I was interested in the ~1000 genes which are annotated with the Affy 
> CDF but not the APM cdf. Following the logic behind APM, I would 
> assume that these would be largely incorrectly annotated probesets or 
> probesets that are not really measuring any "real" transcript. This 
> list should, then, consist largely of random genes. To test this 
> hypothesis, I used the Category package to test for 
> over-representation of GO and KEGG categories in my various lists. 
> What I found was a huge degree of overlap between: 1. the affy genes 
> also annotated with APM, 2. the affy genes not annotated with APM, 3. 
> the genes derived solely from APM.
>
> My conclusion from this latest observation is that APM is not 
> annotating a large number of genes/transcripts that are in fact real. 
> Assuming that APM is correctly throwing out some "junk" probesets, is 
> it throwing out the baby with the bathwater?
>
> I'd be interested to hear the thoughts and experiences of others. I've 
> certainly run into occasions where Affy annotated probesets turn out 
> to represent introns or something other than they purport to be, and I 
> was hoping that APM would solve this problem, but I don't want to use 
> it if it means a massive loss of truly significant data.
>
> Mark
>
>
>
> -- 
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 663-0513 Home (no voice mail please)
>
> ************************************************************** 

-- 
===========================
Hongfang Liu, Ph.D.
Department of Biostatistics,
Bioinformatics, and Biomathematics
Georgetown University Medical Center
Phone: 202-687-7933
Fax: 202-687-2581