[BioC] Checking RNAseq DESeq pipeline

nathalie nac at sanger.ac.uk
Tue Mar 6 16:54:06 CET 2012


Hi all,
I have done an differential expression analysis using DESeq on a set of 
data and got a lot of differential expressed genes at the end. I would 
like people to check the code and the assumptions I made.

A-Samples
I am analysing 6 samples in total from 3 different cell types (E, S and H),
  S1and S2 ( biological replicates) H1, H2 and H3 ( biological 
replicate) E (no replicate)
We took from one animal  cells called S ( sample S1) we passaged several 
time these cells and isolated samples we called S2 ( sample S2)
from the cell population S ,we took subclones which we cultivated with 
factors to obtain 3 independant cell lines of the same cell type which 
were derived independently from the S population, called H1 H2 and H3 
(biological replicates).
AT last, another cell line was derived from clone H1 which we call E1 
(only sample).

B-standard RNA seq was performed in these 6 samples
AFter sequencing, we aligned the samples using topHat and count reads 
with HTSEQ python script.
Duplicates haven't been removed in this process. I plan to do one 
analysis with removed duplicates.

C-then once in DESeq I have created my count table and followed the 
vignette-I have test the diff Exp of condition F vs H
 > conds
[1] E F F H H H
Levels: E F H
 > head(countsTable)
     E   F F.1    H H.1  H.2
1 789 195 235 1019 938 1048
2   1   0   0   11  17   31
3 543 325 460  818 883  644
4 103  89 103  114  87   96
5 281  41 102  328 245  387
6   1   2   1    9   8    7
 > cds <- newCountDataSet( countsTable, conds )

 > sizeFactors(cds)
         E         F       F.1         H       H.1       H.2
1.0417242 0.6740896 0.8994357 1.2268668 1.1499580 1.1831515

 > cds<-estimateDispersions(cds)
 > str( fitInfo(cds) )
List of 5
  $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 
0.055592 ...
  $ dispFunc       :function (q)
   ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
   .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
   ..- attr(*, "fitType")= chr "parametric"
  $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
  $ df             : int 3
  $ sharingMode    : chr "maximum"

 >head(fData(cds))
   disp_pooled
1  0.06096110
2  0.59234620
3  0.06124489
4  0.07657296
5  0.06688364

 > str( fitInfo(cds) )
List of 5
  $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 
0.055592 ...
  $ dispFunc       :function (q)
   ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
   .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
   ..- attr(*, "fitType")= chr "parametric"
  $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
  $ df             : int 3
  $ sharingMode    : chr "maximum"

 > res <- nbinomTest( cds, "H", "F" )
 > head(res)
   id   baseMean  baseMeanA  baseMeanB foldChange log2FoldChange         
pval
1  1 616.515373 844.007629 275.276988  0.3261546     -1.6163720 6.817493e-06
2  2   9.990057  16.650096   0.000000  0.0000000           -Inf 2.662914e-03
3  3 594.493135 659.634053 496.781759  0.7531172     -0.4090537 2.522510e-01
4  4  99.251992  83.237929 123.273085  1.4809725      0.5665449 1.666241e-01
5  5 196.343734 269.163821  87.113605  0.3236453     -1.6275146 6.472665e-05
6  6   4.857542   6.736313   2.039386  0.3027452     -1.7238239 2.449359e-01
           padj
1 4.775233e-05
2 1.088826e-02
3 5.047015e-01
4 3.642458e-01
5 3.807362e-04
6 4.934803e-01


  plotDE <- function( res )
    plot(
       res$baseMean,
       res$log2FoldChange,
       log="x", pch=20, cex=.3,
       col = ifelse( res$padj < .1, "red", "black" ) )


I have attached  As pdf the plotDE( res ) , hist(res$pval, breaks=100, 
col="skyblue", border="slateblue", main="") and plotDispEsts(cds) for a 
basic idea of the data.


I have ~10k genes which show a v low pvalue for  DExpression which seem 
a lot.

I have several questions,
1-Concerning the experimental design I have described in A , can the 
samples be considered as real bio replicate?
2-Would it be "Better" to get rid of duplicated reads prior to HTSeq
3-How can I be more stringent on the expression?



many thanks
Nat




-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.pdf
Type: application/pdf
Size: 307013 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120306/ffd12aef/attachment.pdf>


More information about the Bioconductor mailing list