[BioC] Data filtering

Fri Oct 19 07:42:42 CEST 2012

Dear Anand,

One subtle thing … from the scale of the plot, my guess is that you ran plotMDS() on a matrix of counts?  Note that this could be different (and is very different computationally) from running plotMDS() on a DGEList object, which does a ("normalized") count-specific calculation.  Here is an example:

library(edgeR)
counts <- matrix(rnbinom(6000, size = 1/2, mu = 10),1000,6)
counts[1:200,4:6] <- counts[1:200,4:6] + 10
y <- DGEList(counts)
cols <- rep(c("black","blue"),each=3)
par(mfrow=c(1,2))
plotMDS(y,col=cols)
plotMDS(y$counts,col=cols)

I'm not sure how much that will change your result, but at least it's on a more appropriate scale to think about the decision to exclude samples and so on.

Mark

On 16.10.2012, at 00:50, Anand K S Rao wrote:

> 
> 
> On Wed, Oct 10, 2012 at 3:06 AM, Mark Robinson <mark.robinson at imls.uzh.ch> wrote:
> Hi Anand,
> 
> I've added a few "reactions" below; I hope it can help.
> 
> 
> > Greetings friends!
> >
> > I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries
> >
> > The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.
> 
> > After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.
> 
> What did you actually plot?  What did an MDS plot look like?
> 
> 
> 
> Hello Mark,
> 
> Per your advice, we ended up making the MDS plots. The MDS plot is attached for one genotype only, 9 time points and 4 replicates per time point.
> We have not generated MDS plot for across genotypes data as well - that is our next step.
> 
> But even now, it looks like there is quite a bit of variability of libraries across replicates.
> 
> It looks that for each time point we need to remove one or more libraries that are outliers. In order to do that I suppose there are a few different ways to do accomplish this :
> 
> 1. Remove just one library that is an outlier, like T0.2?
> 
> 2. Remove entire time points because of the scatter of the reps, like T0.1, T0.2, T0.3 and T0.4 each of which are quite distant from each other on this MDS plot?
> 
> 3. Remove an entire replicate and retain others, in our data I think replicate 2 is different from the other three reps, but I dont think this MDS plot shows that, does it? A simple heirarchical clustering of the 9 time points * 4 reps = 36 libraries is attached. Here you can see similar behavior as seen in the MDS plot, though the visualizations are different.
> 
> How do you reckon we should remove the 'noisy' data, if we should do it at all?
> 
> Thanks again.
> 
> - Anand
>  
> <MDS_plots_A17_9timepoints_4repseach.pdf><heatmap_gini_a17.pdf>