[BioC] EdgeR GOF non-linear

Sat Oct 20 02:59:08 CEST 2012

Dear Thomas,

Could you please upgrade to the current Bioconductor release?  Your 
questions might not go way, but you will find some improvements and, as a 
general principle, I don't like to spend time discussing older software.

I think you're over-interpreting the gof plots.  You worry that the second 
gof plot looks nonlinear, but to me it's beautifully linear, or as close 
to linear as real data gets.

The gof plots are intended to judge whether the prior.df parameter has 
been chosen too large or small.  They are not intended to judge whether 
you have outlier samples -- this is better done from the MDS plot.  They 
may help to judge whether you have outlier genes -- but this is more 
clearly done from the BCV plot.

The most notable feature of your data appears to be that there are five 
outlier genes, as shown clearly by the BCV plot.  To my mind, the 
appropriate next step would be to identify those genes, to look at the 
counts for these genes, to try to identify why they are outliers, and to 
note which samples they are outliers in.  In other words, the analysis has 
identified features of the biology and the experimental procedure that 
require more thought and attention.

It is these outliers that cause the slight nonlinearity on the far 
right on your gof plots.

We have some methods under development that can automatically identify and 
downweight outlier genes but, with the current software, an alternative 
strategy might be to simply filter out the 5 outlier transcripts and 
repeat the analysis without them.

You might also consider a smaller value for prior.df, although the default 
value might be fine with the outliers gone.

Best wishes
Gordon

PS. I feel that I have already answered your third question, see:

https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-October/048504.html

If you want me to comment more on this, you would need to respond to my 
earlier comments.

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Fri, 19 Oct 2012, Thomas Frederick Willems wrote:

> Hi Gordon,
>
> I've been trying to use edgeR to analyze some mRNA-seq data. I've begun 
> looking at a data set where the cells are either unstimulated or 
> stimulated with one single treatment and I have three replicates for 
> each class. When I use the GLM framework with tagwise dispersion to fit 
> a model for the effects of the single treatment and plot the GOF, it is 
> distinctly non-linear. When I remove one replicate from each class by 
> looking at the replicate most distant from its counterparts in the 
> plotMDS graph, I obtain a plot that appears more linear but still 
> deviates from the diagonal.

> I've attached these gof plot as well as the results from plotBCV for the 
> situation in which 3 replicates were used. I have a few questions 
> regarding this plot:

> 1) Do the combination of gof plots suggest that there are gross batch 
> differences between the replicates, or have you seen other RNA-seq 
> datasets with a fit similar to the first gof plot?

> 2) Is the fit sufficiently poor that any conclusions drawn from the GLM 
> coefficients are likely of little use?

> 3) In another dataset I'm analyzing, the line also lies below the y=x 
> plot but is linear and has the same slope. What sort of issue might this 
> suggest?
>
>
> My sequencing depths for the samples, as given by samples.libsize, are:
> 26693672 61963337 64385230 65093734 79193403 57877925
>
> My code is as follows:
>
> library(edgeR)
> # File containing file names where sample counts can be read
> targets <- read.delim("/home/twillems/Desktop/IL6-paired-end/GLM/files.txt", stringsAsFactors = FALSE)
>
> treatment <- c(0,0,0,1,1,1)
> design <- model.matrix(~treatment)
>
> y <- readDGE(targets)
> colnames(y) <- y$samples[['description']]
> min_counts_per_million <- 0.2
> min_samples <- 3
> y <- y[rowSums(cpm(y) > min_counts_per_million) >= nrow(y$samples), ]
> y$samples$lib.size <- colSums(y$counts)
> y <- calcNormFactors(y)
>
> y <- estimateGLMCommonDisp(y, design)
> y <- estimateGLMTrendedDisp(y, design)
> y <- estimateGLMTagwiseDisp(y,design)
> plotBCV(y)
>
> plotMDS(y)
>
> fit1 <- glmFit(y, design)
> g <- gof(fit1, plot=TRUE)
>
> And the result of sessionInfo():
> R version 2.14.1 (2011-12-22)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
> [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
> [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] edgeR_2.6.12 limma_3.10.3 rkward_0.5.6
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.1
>
> Thanks for your help
>
> Thomas Willems
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}