[BioC] filtering

Fri Jul 13 18:02:56 CEST 2007

Hi Lev,

>   I would like to make some further points regarding
>   filtering. Firstly, the bimodal behaviour of log
>   transformed signals shown in the plots that I have
>   posted (raw and filtered raw,
>   http://tmgarden.cloud.prohosting.com/images/) is
>   probably something specific to AB1700 and some other
>   platforms, not Affymetrix though. Therefore,
>   filtering of Affy data may not be a good idea.
>   Secondly, it just happens that by filtering on
>   signal-to-noise >=3 (threshold specified by ABI to
>   distinguish badly measured signals) I remove the
>   first peak of the distribution. I have observed this
>   phenomenon for many AB1700 datasets and thus think
>   that this first peak corresponding to low
>   signal-to-noise probes is artificial and does not
>   reflect real signal (I may be wrong here). 

Actually, a bimodal distribution is exactly what I would expect to see if a goodly percentage of probes on the array were not expressed in your particular sample. This is very common for whole genome arrays, and I often see this on Affymetrix arrays when the total percent present can be as low as 30-40%. Thus your two distributions are the unexpressed probes (effectively "zero" but measured with error) and the expressed probes, which might or might not have a normal distribution. I don't think this is particular to AB1700 datasets, and I don't think the peak is "artificial", but instead represents probes that are not expressed.

>   Thirdly,
>   as I pointed before, low signal-to-noise does not
>   always indicate low raw signal for a probe. My plots
>   clearly show this. Therefore, this is not the case
>   of discarding low expressed probes from the
>   analysis. I understand that filtering might lead to
>   loosing some interesting probes, but this is a trade
>   off between false positive and false negative
>   results. So, it may be better for you to save some
>   money and effort during validation stages.

Again, I would argue that you are throwing out "zeros", not low-expressed probes. If you were to count for each probe how many arrays it was below your filter criteria, what you would probably find is an extreme bi-modal distribution, where most probes are either above background on all arrays or below background on all arrays. I think it's fine to filter (after normalization) out those that are below background on ALL arrays, which can cut out a substantial chuck of probes and save on the FDR correction. Usually there is only a small percentage of probes that are above background on some arrays and below on other arrays. To be conservative, I leave these in because they will not affect the FDR calculations all that much and I don't want to lose probes that may be off in one treatment and on in another treatment. Sorry I don't have a graph of a typical bi-modal distribution of "present" calls to show you, but I'm at home today.

>   Also, it is often assumed that log transformed raw
>   signal is roughly Normal. Is this assumption
>   required for normalization stage? If yes than
>   removing the peak corresponding to low
>   signal-to-noise should be advantageous.

The log- transformation does help to compress the range of expression values and decreases the mean-variance problem, but I can't remember anywhere it's been said that it should be normal after transformation. Furthermore, normality is not an assumption for normalization, only that the distributions for each array should be the SAME, whatever the shape of the distribution.  Unless there is something special about AB1700 arrays (I confess I don't have any experience with them), I think the bimodality represents real measured signal for all arrays, and it's better to use all available data for the pre-processing steps, but after normalization it's fine to remove probes that fail to pass a conservative filter on ALL arrays. Even if you want to use your filter of removing "probes that have >50% of "bad" signals within a treatment", use it only if the probe has >50% "bad" signals for ALL treatments.

Cheers,
Jenny

>    
>    
>   Jenny Drnevich <drnevich at uiuc.edu> wrote:
>
>     Hi Lev,
>
>     There have been several discussions about when to
>     filter out data on
>     this list previously, and the consensus has been
>     to NOT filter until
>     after all pre-processing steps (e.g.,
>     normalization) have been done.
>     One reason is that one array may have had a higher
>     background than
>     others, and so more data values would be removed
>     in your scheme,
>     which can be problematic for many normalization
>     routines. I also
>     would caution you against removing "badly measured
>     signals" from your
>     data set even after pre-processing. While these
>     numbers may not be as
>     accurate as larger numbers, they represent very
>     low expression or no
>     expression. Would you remove all the zeros from
>     any set of data? My
>     rationale is that had there been distinct
>     expression, you would have
>     measured it, therefore the low values near
>     background are valid, if
>     not as completely accurate. In the worst case
>     scenario, you would
>     miss genes that weren't expressed in one treatment
>     but were expressed
>     in another treatment because you were throwing out
>     all the data from
>     the non-expressed treatment. If the signals were
>     "badly measured" in
>     ALL samples, then I would remove that entire probe
>     from the analysis
>     (after pre-processing), but not if they were badly
>     measured in only a
>     few samples.
>
>     That's my two cents,
>     Jenny
>
>     At 08:59 AM 7/12/2007, Lev Soinov wrote:
>     > Dear List,
>     > I have posted a similar question before, but
>     would like to ask you again
>     > about filtering strategies. I have some AB1700
>     data and filter on signal to
>     > noise ratios before normalization. The rationale
>     is to get rid of badly
>     > measured signals before actual processing of the
>     data. Two jpg
>     > histograms of
>     > log2 signal distributions, before (raw.jpg) and
>     after (filtered.jpg)
>     > filtering, can be seen in this location:
>     > http://tmgarden.cloud.prohosting.com/images/
>     > Could you please have a look at the
>     distributions and comment on whether
>     > this is correct to filter before normalization
>     as this changes
>     > the distribution of
>     > signals a lot?
>     > Thank you very much for your help.
>     > Lev.
>     >
>     >
>     >---------------------------------
>     >
>     > [[alternative HTML version deleted]]
>     >
>     >_______________________________________________
>     >Bioconductor mailing list
>     >Bioconductor at stat.math.ethz.ch
>     >https://stat.ethz.ch/mailman/listinfo/bioconductor
>     >Search the archives:
>     >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>     Jenny Drnevich, Ph.D.
>
>     Functional Genomics Bioinformatics Specialist
>     W.M. Keck Center for Comparative and Functional
>     Genomics
>     Roy J. Carver Biotechnology Center
>     University of Illinois, Urbana-Champaign
>
>     330 ERML
>     1201 W. Gregory Dr.
>     Urbana, IL 61801
>     USA
>
>     ph: 217-244-7355
>     fax: 217-265-5066
>     e-mail: drnevich at uiuc.edu
>
>
>
>     ------------------------------------------------
>
>   Yahoo! Mail is the world's favourite email. Don't
>   settle for less, sign up for your free account
>   today.
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801

ph: 217-244-7355
fax: 217-265-5066 
e-mail: drnevich at uiuc.edu