[BioC] edgeR outlier question

Tue May 8 14:35:36 CEST 2012

Hi - actually this problem pops out even with as low as two replicates, 
and I would
tend to attribute it to a technical feature of NGS (at least the ones in 
which there is
an em-PCR step such as 454 and SOLiD) - which is over-amplification of a 
certain
sequence set in a single sample. And this could be called shot noise I 
guess .. I sqw
it both in SAGE and miRNA sequencing in multiple samples.

I agree of course in principle on not throwing away genes for what 
happens sporadically in
one sample. However, in my experience these 'read shots' always happens 
in the very grey area
of few reads per samples, and if you reason in cpm this will be the area 
of less than 10 count
per millions - I don't know it this is the same situation for you

So, these are genes usually located in the area where biological 
variance is well hidden below
technical variance.  I guess that these will not be your most 
significant findings and the solution of
reasoning with edgeR in terms of cmp for the threshold selection - 
rather than read counts even
in normalized libraries - worked nicely for my miRNAs when I went back 
to MDS plots to explore
the situation...

This is only my experience, though, so I would be interested to know if 
this 'read shot noise' happens also
in areas where there are large counts

Regards,

Alessandro

On 5/8/2012 3:22 AM, Simon Melov wrote:
> Hi Alessandro,
> I don't think this helps me, as I'm not looking to eliminate an entire gene based on a single replicate. I mentioned in my original post that I had applied the filtering discussed at length in the guide, (allowing genes with at least one read, in a minimum of 8 samples was my filtering criteria). But this doesn't address the problem of a very high level of reads in a single sample. This issue of variance should be incorporated into the analysis, and not result in genes being listed as significant due to a high levels in a single sample. This sort of problem is not unusual in the genomics world, and I think the microarray literature had numerous solutions to this sort of problem. I'm surprised it popped up so early in my analysis, as I thought this would have been "solved" by now. As a later poster alluded to, perhaps its due to a relatively "high" number of biological replicates (N=10 per group). This number of replicates going forward is going to be commonplace as sequencing costs tumble. So some guidance as to how to deal with this in edgeR would be very welcome.
>
> thanks
>
> Simon.
>

-- 

Alessandro Guffanti - Head, Bioinformatics, Genomnia srl
  Via Nerviano, 31 - 20020 Lainate, Milano, Italy
     Ph: +39-0293305.702 Fax: +39-0293305.777
             http://www.genomnia.com
"When you're curious, you find lots of interesting things to do."
(Walt Disney)

-----------------------------------------------------------
Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei
soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati 
di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o 
ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da 
considerarsi vietato ed abusivo.

The information transmitted is intended only for the per...{{dropped:12}}