[BioC] expresso: performing RMA on NON-Affy data?

Robert Castelo robert.castelo at upf.edu
Tue Apr 28 08:17:28 CEST 2009

hi Jim,

On Mon, 2009-04-27 at 09:36 -0400, James W. MacDonald wrote:
> Hi Robert,
>
> Robert Castelo wrote:
> > hi Jim,
> >
> > the reason is that i'm teaching a course on microarray data analysis to
> > students who are not familiar with statistics beyond the basic
> > descriptive ones. in front of such audience it has been helpful for me
> > to simulate some data and apply to it the corresponding analysis
> > technique when illustrating how the technique works (after that, then we
> > use it on real data). by simulating data, people sees explicitly the
> > assumptions made behind the mechanism generating these data so that a
> > fraction of them (which makes me already happy) gets to understand why a
> > particular method works better than other one.
>
> That seems a bit backwards to me - there are no assumptions behind the
> mechanism generating these data. They just are what they are. The only
> assumptions being made would be that the data are of a certain
> distribution (or convolution of one or more distributions) when you were
> simulating.

ups..sorry, i meant "people see explicitly the assumptions made behind
the mechanism generating these *simulated* data". these assumptions, as
you point out, will be that the data are of a certain distribution (or
convolution...) and also will be those related to the (in)dependencies
among the random variables that are employed to sample the data.

then of course microarray data are just what they are, but the point i
try to make to my students is that if, for instance, they learn about
assessing differential expression with a Students t-test (i know they
should use a modified one, etc.) then simulating the data that meet the
Students t-test assumptions would be sampling from independent normal
densities, one for each gene.

this kind of exercise, i believe, helps them in understanding the
assumptions behind the Student's t-test and helps me in illustrating
them the possible pitfalls and disadvantages of that particular
technique. obviously all this is not necessary for people with a good
knowledge of statistics.

> > in the particular case below i'd like to make the point on why the
> > median polish summarization method works better than the taking the
> > arithmetic mean, illustrating somehow what you wisely said about the
> > mean being not robust to outliers but being uniformly more powerful for
> > Gaussian data, etc etc.
>
> Why not just show some examples of what real data look like? The
> Dilution series contains some of the cleanest data around (as it was a
> spike-in data set run as carefully as possible), and you can easily see
> what I am talking about just by randomly picking a probeset:
>
> library(affy)
> library(affydata)
> library(lattice)
> data(Dilution)
> a <- pm(Dilution, "1007_s_at")
> boxplot(a)
> points(1:4, colMeans(a), pch = 20, col="red", cex=1.2)
> nam <- factor(rep(colnames(a), each = dim(a)))
> probes <- rep(1:dim(a), 4)
> dim(a) <- NULL
> b <- data.frame(Values = a, Chip = nam, Probes = factor(probes))
> barchart(Values~Probes |Chip , data=b)

thanks for the example, we already work with spike-in data sets when
looking at why we need to do background correction.

i did not have any such strategy in my mind for illustrating the different
approaches to summarization and that's why i got interested when i saw the
email form Mark with some code illustrating the simulation of probe data for
its summarization.

cheers,
robert.

> >
> > i know i can download lots of real data, but i don't know how could i
> > demonstrate that a summarization method is better than other one with
> > real data. using some QC technique (MA plots..) ?? i'll appreciate any