[BioC] replicates and low expression levels

Mon Jun 2 10:25:57 MEST 2003

On Mon, 2 Jun 2003, Gordon Smyth wrote:

> 
> >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote:
> > > Hi,
> > > Just a quick question about low expression levels on Affy systems - I 
> > hope it's not too off-topic; it is about normalisation and data analysis...
> > > I've heard a lot of people advocating that it's a good idea to perform 
> > an initial filtering on either Present Marginal or Absent calls, or on 
> > gene-expression levels (so that only genes with an expression > 40, say, 
> > after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the 
> > further analysis). Firstly, am I right in thinking that this is to 
> > eliminate data that are too close to the background noise level of the system.
> > >
> > > I wanted to canvas opinion as to whether people feel we need to do this 
> > if we have replicates and are using statistical tests - rather than just 
> > fold-changes - to identify 'interesting' genes. Does the statistical 
> > testing do this job for us?
> >
> >Hi,
> >   In my opinion you should always do some sort of non-specific
> >   filtering. What you have described is one form of it, others include
> >   removing genes that show little or no variability across samples.
> >   I think of non-specific filtering as filtering without reference to
> >   phenotype (of any sort).
> >
> >   There are a number of reasons for doing this, some motivated by the
> >   biology and some by the statistics.
> >
> >   First off, especially for Affy, the chip is designed for all tissue
> >   types but a commonly held belief is that only about 40% of the genome
> >   is expressed in any specific tissue type. So, for any experiment you
> >   will have a pretty large number of probes for genes that are not
> >   expressed in the tissue you are looking at.
> >
> >   From a statistical perspective you need to be a little bit cautious
> >   if you are going to standardize genes across samples (this is pretty
> >   common). If you do not remove those genes that show little
> >   variability before standardization then you have just elevated the
> >   noise to the same status as the signal (and if the 40% estimate is
> >   right then you actually have more noise than signal - not too
> >   pleasant).
> >
> >   Using a test statistic (such as a t-test) does not help, since that
> >   measures the between group differences relative to the variation (so
> >   if there is very little variation and a small difference in mean,
> >   well you get an enormous t-statistic and a small p-value; of course
> >   in this case looking at the "fold-change" or the size of the effect
> >   will indicate a problem, but not many people check all the things
> >   that need checking (and what to check depends on the test that
> >   you have just carried out). It seems to me to be much easier to just
> >   filter those genes with no expression or little variation out at the
> >   very start.
> 
> All good points. One thing that does help though is to use a t-statistic 
> (or F or posterior odds or whatever) in which some form of shrinkage to a 
> common value has been applied to the standard deviations. This has the 
> effect of offsetting the smaller sample variances to be not less than a 
> certain size. We have found that empirical Bayes t-statistics do a good job 
> of eliminating the low-signal, low-variability genes without needing an 
> explicit filtering step.
> 
> I have also wondered about the biological arguement that many genes might 
> be not represented in a particular sample, and whether this means that 
> non-specific filtering should be applied. I guess the reason that I don't 
> do it at the moment is that I'm somewhat uneasy about possible selection 
> bias in the filtered intensities and standard deviations. Another factor 
> which allows us to avoid non-specific filtering is the use of background 
> correction methods which ensure that the lower intensities are not 
> especially variable.

I agree with gordon. Even if filtering is the best thing to do in theory,
their is a possibility that the statistical error introduced by filtering
makes the overall error worst. if you want to see this for yourself run
AFfymetrix P/A calls (or any other filtering techinique you know off) on
the genes known to be spiked in Affymetrix Spike-In data. You will see a
considerble amount of those with low nominal concentrations get A calls.
also, some of the probesets known not to be present
(spiked-in concentration = 0 pM) get P calls.  

when using MAS 5.0 signal and fold-change as cut-off you absolutely need
the P/A calls. in my opinion this is a problem with the statistics behind
MAS 5.0 signal that results in many large fold chages that shouldnt be for
low expressed genes. the suggestions given my gordon, as well as those
given in my previous respons to crispin's msg, solve this problem, not by
using P/A calls but by using better pre-processing and/or better
statistical tests.

> 
> Just some other thoughts.
> 
> Cheers
> Gordon
> 
> >   If they don't show any variation across samples they can't help to
> >   classify or to cluster (there is no information about any phenotype
> >   contained in them).
> >
> >   Robert
> >
> >
> > >
> > > Crispin
> > >
> > > --------------------------------------------------------
> > >
> > >
> > > This email is confidential and intended solely for the use of th... 
> > {{dropped}}
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > 
> > <https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>Bioconductor 
> > at stat.math.ethz.ch
> > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> >
> >--
> >+---------------------------------------------------------------------------+
> >| Robert Gentleman                 phone : (617) 632-5250                   |
> >| Associate Professor              fax:   (617)  632-2444                   |
> >| Department of Biostatistics      office: M1B20                            |
> >| Harvard School of Public Health  email: 
> ><https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor>rgentlem at 
> >jimmy.harvard.edu        |
> >+---------------------------------------------------------------------------+
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>