[BioC] Illumina Methylation. Normalization and statistics

Thu Nov 20 15:43:03 CET 2008

Sean Davis <sdavis2 at ...> writes:

> 
> On Wed, Nov 19, 2008 at 10:18 AM, Michael Walter <
> michael.walter <at> med.uni-tuebingen.de> wrote:
> 
> > Dear List,
> >
> > We run our first slide of illumina's infinium methylation arrays. After
> > searching the archive, I still have some general questions how to best
> > analyze the data.
> >
> > First of all, I'm would like to know some opinion on normalization. In my
> > personal and probably simplistic view I'd think that normalization is not
> > necessary since the value you get from the array is a ratio which is sample
> > inherent (unlike a classical two-color expression array where you mix two
> > samples to generate the expression ratio). Is this assumption correct or am
> > I missing some important aspect?
> >
> 
> Unfortunately, there is a significant dye-bias issue.  That is, there is a
> propensity for one dye to be brighter than the other and it appears that
> Illumina does not adequately correct for this bias.
> 
> >

I agree, the dye-bias should be a problem, but in my case, confirmatory
bisulfite sequencing of interesting probes reported methylation values very
close to those from the methylation array, so I stopped worrying about this.

> > Anyway, I'd like to perform background normalization which results as usual
> > with illumina arrays in some negative values. Does anyone one have a neat
> > solution for this problem or shall I just skip the probes?
> >
> 
> I have been just ignoring those probes.
> 
> >
> > Do I have to correct for some dye effect like for the golden gate
> > methylation assay? Since the probes for methylated and unmethylated DNA
> > incorporate the same dye this shouldn't be an issue?
> >
> 
> See above.
> 
> >
> > My final question is basically the most pressing: What kind of statistic
> > test should I use? Since all the values are ratios between 0 and 1 I have a
> > real bad feeling by simply running some t-tests. And if a t-test is the
> > proper choice, shall I log-transform the data?
> >
> 
> The t-statistic should still be valid, I think.  The assumptions that go
> into statistics like the t-stat are not based on the distribution of the
> data, but on differences between values.  I think these assumption probably
> still holds in practice for these data.  However, I have not tried to prove
> things one way or the other.  Of course, if you are concerned about,
> non-parametric testing will alleviate these concerns.
> 
> Sean
> 

I too had the same dilemma over which statistic to use with the GoldenGate
Methylation array. As I understand it, for a t test to be valid, the underlying
population has to be normally distributed and this is manifestly not the case
for the majority of probes, at least on the GoldenGate methylation array, not
least because, as you say, the distribution is constrained between values of 0
and 1, with most probes being unmethylated (close to 0) or methylated (close to
1). The distribution is best described by a beta distribution or a mix of beta
distributions, depending on whether probe distribution is uni- or bi-modal.

Therefore, I used a Mann-Whitney test followed by filtering on the basis of the
magnitude of difference in methylation value between clusters to identify
interesting probes.

Having said that, carrying out t tests did identify essentially the same set of
interesting probes...

Good luck with your analysis, 

and I'd be interested in hearing whether people agree/disagree with the t test
question.

Best wishes,

Ed Schwalbe

> >
> > Any input and shared experience with this type of array is highly
> > appreciated.
> >
> >
> > Best Regards,
> >
> >
> > Mike
> > --
> > Dr. Michael Walter