[BioC] Large # of significant genes with SAM

Tue May 10 18:59:58 CEST 2005

Dear all,

A few precisions regarding my previous posting

1- our cDNA data correlates at 0.72 (comparing >6000 genes averaged
over patients) with data from another completely independent group
using Affy U133 chips and different lab technicians, pathologists and
sammples.

2- running SAM @ q<0.05, more than 30% of the genes are called
significant in *both* data sets. Affy data were normalised with MAS5,
I don't have access to the CEL files.

3- 7/7 genes with average fold-change >2.0, and high SAM rank, were
confirmed by RT-PCR in our data set

4- RT-PCR gave mixed results for two genes ranking high with SAM but
with fold-change 1.6. By mixed result I mean that RT-PCR data are
clearly correlated with microarray but give lower fold-change

5- I searched for spatial biased with box-plots, and did find some,
but much below the magnitude that could explain the 30% result.

6- we are talking about paired sample SAM comparisons. I call SAM with

cl <- rep(1, N) #paired samples
sam1 <- sam(exprs, cl, B=1000, rand=123, q.version=1)

To summarize, the data seems correct. The questions are whether SAM is
appropriate on these data sets and others, whether q-values mean what
they are supposed to, what is the relevance of calling a gene
regulated on a purely statistical basis, etc.

>Since SAM computes a regularised t-statistic, I think, you should
>also check that the normal-distribution assumption does at least
>approximately hold.

I though SAM use a computed permutation-based null distribution of the
moderated t-statistics in order to avoid hypothesis about this
distribution? Am I missing somenthing here?

Thanks you all for your input!

Vincent

On Tue, 10 May 2005, Sean Davis wrote:

> > settings or hybridisation protocols etc. I would check if after
> > normalisation such large differences between the groups are obvious by
> > using boxplots, Scatter-Plots etc. (many examples for such control
> > procedures can be found on the Bioconductor website , especially on
> > the pages containing material for courses and workshops). If so, you
> > might think about other methods for normalisation or combining the two
> > groups data in another way, if they happen to be too different.
> > Another reason for large differences could be that there might really
> > be huge biological differences between the two groups. For instance,
> > when analyzing T- versus B-lymphocytes, one usually observes large
> > percentages > 20% of differentially expressed genes, since in that
> > case we were comparing very different cell types with each other.
> > However, I would not expect such striking differences between a tumour
> > and the related physiological tissue.
>
> Vincent,
>
> Actually, having a large proportion of differentially-expressed genes
> between tumor and normal is certainly possible.  You got the same
> results with two different data sets if I read your original post
> correctly, so go back to check quality of data, statistical biases,
> etc., but it seems quite possible that your results are correct.  You
> will, of course, have to think about validation strategies, but....
>
> Sean
>

Vincent Detours, Ph.D.
IRIBHM
Bldg C, room C.4.116
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4220
Fax: +32-2-555 4655

E-mail: vdetours at ulb.ac.be

URL: http://homepages.ulb.ac.be/~vdetours/