[BioC] False positives due to GC content correction - DESeq2

Sat Aug 9 20:05:10 CEST 2014

Hi Michael,

Yes the regions that are added after GC correction are mostly regions with very low read count and while some correspond to genes/regions I know from beforehand are not different, others mark regions that on looking at the bedgraph tracks show no difference in the read count.

Aditi
________________________________________
From: Michael Love [michaelisaiahlove at gmail.com]
Sent: Saturday, August 09, 2014 5:31 AM
To: QAMRA Aditi (GIS)
Cc: bioconductor at r-project.org
Subject: Re: False positives due to GC content correction - DESeq2

hi Aditi,

Your code looks correct to me. Also the normalization factors are
correctly taking into account sequencing depth, which is what I wanted
to check on by looking at scatterplots for normalized counts of pairs
of samples. I took a look at the results, and I also see as you say,
the additional genes after using GC correction:

> res <- results(dds)
> res2 <- results(dds2_nongc)
> table(gc.correct=res$padj < .1, no.correct=res2$padj < .1)
          no.correct
gc.correct FALSE  TRUE
     FALSE 20810   143
     TRUE    368   472

Ideally, we can have additional genes showing up as significant if we
have reduced technical noise through modeling the normalization
factors using the technical covariates like GC content. But you
suspect these new genes. Can you explain how you know that these are
false positive? And is it just the genes which are added after GC
correction which are enriched with FP?

Mike

On Fri, Aug 8, 2014 at 2:29 PM, QAMRA Aditi (GIS)
<qamraa99 at gis.a-star.edu.sg> wrote:
> Hi Mike,
>
> Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed.
>
> I've attached the code for the analysis as well the dispersion plots.
>
> Session Info -
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>  [1] EDASeq_1.10.0           aroma.light_2.0.0       matrixStats_0.10.0
>  [4] ShortRead_1.22.0        GenomicAlignments_1.0.3 BSgenome_1.32.0
>  [7] Rsamtools_1.16.1        Biostrings_2.32.1       XVector_0.4.0
> [10] BiocParallel_0.6.1      Biobase_2.24.0          DESeq2_1.4.5
> [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2             GenomicRanges_1.16.3
> [16] GenomeInfoDb_1.0.2      IRanges_1.22.10         BiocGenerics_0.10.0
> [19] BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
>  [1] annotate_1.42.1      AnnotationDbi_1.26.0 BatchJobs_1.3
>  [4] BBmisc_1.7           bitops_1.0-6         brew_1.0-6
>  [7] checkmate_1.2        codetools_0.2-8      DBI_0.2-7
> [10] DESeq_1.16.0         digest_0.6.4         fail_1.2
> [13] foreach_1.4.2        genefilter_1.46.1    geneplotter_1.42.0
> [16] grid_3.1.0           hwriter_1.3          iterators_1.0.7
> [19] lattice_0.20-29      latticeExtra_0.6-26  locfit_1.5-9.1
> [22] RColorBrewer_1.0-5   R.methodsS3_1.6.1    R.oo_1.18.0
> [25] RSQLite_0.11.4       sendmailR_1.1-2      splines_3.1.0
> [28] stats4_3.1.0         stringr_0.6.2        survival_2.37-7
> [31] tools_3.1.0          XML_3.98-1.1         xtable_1.7-3
> [34] zlibbioc_1.10.0
>
>
>
>
>
>
> ________________________________________
> From: Michael Love [michaelisaiahlove at gmail.com]
> Sent: Saturday, August 09, 2014 2:11 AM
> To: Aditi [guest]
> Cc: bioconductor at r-project.org; QAMRA Aditi (GIS)
> Subject: Re: False positives due to GC content correction - DESeq2
>
> hi Aditi,
>
> Please include all the code you used for EDAseq and DESeq2, and the
> sessionInfo()
>
> How do you know there are false positive? Are these genes which you
> know are not differentially expressed?
>
> Your dispersion plots didn't come through. You can email those
> attachments to my email address, and we will continue discussion on
> the Bioc list.
>
> Mike
>
> On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at bioconductor.org> wrote:
>> Hi Mike,
>>
>> I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model.
>>
>> What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why
>>
>>
>>  -- output of sessionInfo():
>>
>> -
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>
> -------------------------------
> This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act.
> -------------------------------

-------------------------------
This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act.