[BioC] Affymetrix: RMA probe summarization for identical probe sequences

Wed Jul 20 15:54:53 CEST 2011

On Wed, Jul 20, 2011 at 9:45 AM, Gaj Stan (BIGCAT)
<stan.gaj at maastrichtuniversity.nl> wrote:
> Hello all,
>
> I have a very specific question regarding the working mechanisms behind the probe summarization step during RMA normalization. Let’s say that I have a reannotated probeset (customCDF) on an Affymetrix chip that looks like this: http://arrayanalysis.mbni.med.umich.edu/probeset/ps_pb.jsp?p=ENSG00000087076&c=HGU133Plus2_Hs_ENSG_13
>
> This reannotated probeset seems to contain several identical probes, but these are located on a (physically) different location on the chip (although they’re not that far away from each other). How are these identical probes handled during the probe summarization step? As independant measurements? Averaged prior to summarization? Or anything else?
>
> In the end, what effect would these repeated sequences have on the calculated (median-polished) probeset intensity? If I understand the approach correctly, it will not have a drastic effect on the outcome, since the assumption is that all probes in a given probeset should measure the same intensity...
>
> Many thanks in advance!

When you do median polish, each probe you pass in through the CDF
environment will be treated as an independent measurement.  If you
have many probes that are exactly equal, they will just give more
weight to that sequence's behaviour.  Note that in the example you
have above, when the same sequence is spotted multiple times, it is
adjacent to each other, which implies that the probe intensities will
be more similar (less spatial variation across the chip) - I think
this is a bit weird.  I would (perhaps) worry more about this if you
have a probeset with sequence 1 spotted 10 times and sequence 2-5 each
spotted 1 time.  In that case, median polish will put a lot of weight
on sequence 1.

You could argue that identical sequences should be filtered out of the
CDF file, but that will need to be done when you create the CDF
environment, this is not something rma does.  In fact, rma does not
"know" about the actual probe sequence.

> In the end, what effect would these repeated sequences have on the calculated (median-polished) probeset intensity? If I understand the approach correctly, it will not have a drastic effect on the outcome, since the assumption is that all probes in a given probeset should measure the same intensity...

This is an assumption.  This may or may not be true in the real world.

Kasper