[BioC] DESeq2: Paired Test

Wed Apr 30 23:14:58 CEST 2014

hi Pankaj,

The output of summarizeOverlaps() is a SummarizedExperiment, which can
be used in object construction with:

dds <- DESeqDataSet(se, ~ condition)

for a SummarizedExperiment, se.

See the first part of either of the vignettes, and the manual page for
?DESeqDataSet.

Or if you can read in any arbitrary counts from a file (using
read.table, read.csv, etc.), then check the manual page for
?DESeqDataSetFromMatrix

Mike

On Wed, Apr 30, 2014 at 4:37 PM, Pankaj Agarwal <p.agarwal at duke.edu> wrote:
> Hi,
>
> I am trying to use DESeq2 for a set of counts obtained via
> summerizedOverlaps and have the counts for 4 samples in one file.  Earlier I
> had used DESeq2 with counts from htseq-count, which reports the counts for
> each samples in a separate file.  Now that I have these in one file, is
> there a method analogous to "DESeqDataSetFromHTSeqCount" for constructing
> the dds object.  With htseq-count I had used the following set of commands:
>
> "directory" contains the 4 count files.
>
>> sampleFiles=list.files(directory)
>> sampleFiles
>> [1] "abc.htseq.count"         "def.htseq.count"
>> [3] "pqr.htseq.count"          "xyz.htseq.count"
>>
>> sampleCondition=c("Tumor","Tumor","Normal","Normal")
>> samplePatient=c("A","B","A","B")
>> sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles,
>> condition=sampleCondition, patient=samplePatient)
>> dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
>> directory=directory, design= ~ patient + condition)
>
> Thanks,
>
> - Pankaj
> ________________________________
> From: Michael Love [michaelisaiahlove at gmail.com]
>
> Sent: Thursday, April 17, 2014 9:18 AM
> To: Pankaj Agarwal
> Subject: RE: [BioC] DESeq2: Paired Test
>
> Hi Pankaj
>
> For future questions to Bioconductor maintainers, always cc the mailing
> list. This helps avoids duplicate questions, which we get a lot of. Answer
> below,
>
> On Apr 17, 2014 8:49 AM, "Pankaj Agarwal" <p.agarwal at duke.edu> wrote:
>>
>> Thanks for your help and response.
>>
>> >> I did not get the normalization factors in the output, whereas it does
>> >> show up in the results in the vignette, so I am not sure if I missed
>> >> some step.
>>
>> >I'm confused, which section of the vignette are you referencing?
>>
>> I was referring to Section 1.5 which report "sizeFactor" as the third
>> column in the table, but when I execute the same function I don’t see that
>> column:
>
> You have to run DESeq or estimateSizeFactor first, see ?DESeq
>
>>
>>  colData(dds)
>>  DataFrame with 4 rows and 2 columns
>>                                      condition  patient
>>                                       <factor> <factor>
>>  abc.htseq.count             Tumor        A
>>  def.htseq.count              Tumor        B
>>  pqr.htseq.count              Normal      A
>>  xyz.htseq.countt            Normal       B
>>
>> Is the third column in Section 1.5 the output of  estimateSizeFactors( dds
>> )?
>
> Yes
>
>>
>> Thanks,
>>
>> - Pankaj
>>
>> -----Original Message-----
>> From: Michael Love [mailto:michaelisaiahlove at gmail.com]
>> Sent: Wednesday, April 16, 2014 4:37 PM
>> To: Pankaj Agarwal
>> Subject: Re: [BioC] DESeq2: Paired Test
>>
>> hi Pankaj,
>>
>> On Wed, Apr 16, 2014 at 3:59 PM, Pankaj Agarwal <p.agarwal at duke.edu>
>> wrote:
>> > Hi Michael,
>> >
>> > I followed the vignette for the section on htseq-count and your
>> > suggestion as best as I could and got some results.
>> > But I am not sure if I did everything right, so I would really
>> > appreciate if you could verify if the following steps is the correct
>> > way to analyze this data set:
>> >
>> > There are 4 samples for two patients:
>> > Patient A: abc.htseq.count (Tumor), pqr.htseq.count (Normal) Patient
>> > B: def.htseq.count (Tumor), xyz.htseq.count (Normal)
>> >
>> >> sampleFiles=list.files(directory)
>> >> sampleFiles
>> > [1] "abc.htseq.count"         "def.htseq.count"
>> > [3] "pqr.htseq.count"          "xyz.htseq.count"
>> >>
>> >> sampleCondition=c("Tumor","Tumor","Normal","Normal")
>> >> samplePatient=c("A","B","A","B")
>> >> sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles,
>> >> condition=sampleCondition, patient=samplePatient)
>> >> dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
>> >> directory=directory, design= ~ patient + condition)
>> >>
>> >> colData(dds)
>> > DataFrame with 4 rows and 2 columns
>> >                                     condition  patient
>> >                                      <factor> <factor>
>> > abc.htseq.count             Tumor        A
>> > def.htseq.count              Tumor        B
>> > pqr.htseq.count              Normal      A
>> > xyz.htseq.countt            Normal       B
>> >>
>> >> dds$condition
>> > [1] Tumor  Tumor  Normal Normal
>> > Levels: Normal Tumor
>> >>
>> >> dds$patient
>> > [1] A B A B
>> > Levels: A B
>> >>
>> >> dds = DESeq(dds)
>> > estimating size factors
>> > estimating dispersions
>> > gene-wise dispersion estimates
>> > mean-dispersion relationship
>> > final dispersion estimates
>> > fitting model and testing
>> >>
>> >> result=results(dds)
>> >>
>> >> resOrdered2=result[order(result$padj),]
>> >>
>> >> head(resOrdered2,20)
>> > DataFrame with 20 rows and 6 columns
>> >          baseMean log2FoldChange     lfcSE      stat       pvalue
>> > padj
>> >         <numeric>      <numeric> <numeric> <numeric>    <numeric>
>> > <numeric>
>> > GAL1     873531.57     -11.645190 0.7175710 -16.22863 3.164692e-59
>> > 5.627771e-55
>> > ...
>> > ...
>> >
>>
>> Your lines of code look correct.
>>
>> > I did not get the normalization factors in the output, whereas it does
>> > show up in the results in the vignette, so I am not sure if I missed
>> > some step.
>>
>> I'm confused, which section of the vignette are you referencing?
>>
>> >
>> > Also, would you expect to get meaningful results with just two matched
>> > pairs of samples with this analysis?  I am wondering if it is better
>> > to just report the FC rather than the p-values.
>>
>> The analysis has 4 samples and 3 parameters to estimate (intercept,
>> patient, condition), so you have one residual degree of freedom. So you can
>> estimate the dispersion and the p-values should be meaningful.
>>
>> One drawback with having only two matched pairs is that there are too few
>> samples to look for outliers. DESeq2 looks for outliers when there are at
>> least 3 samples per condition (actually, per unique combination of factor
>> variables in the design). I would take a look at the genes with the highest
>> counts and see if these should be filtered manually.
>> For example, the first gene in your ordered results table looks to me like
>> it just contains an outliers count. However with no replicates, you have to
>> perform such filtering manually.
>>
>> Mike
>>
>> >
>> > Thank you,
>> >
>> > - Pankaj
>> >
>> >
>> > ________________________________
>> > From: Michael Love [michaelisaiahlove at gmail.com]
>> > Sent: Thursday, April 10, 2014 4:08 PM
>> > To: Pankaj Agarwal
>> > Cc: bioconductor at r-project.org
>> > Subject: Re: [BioC] DESeq2: Paired Test
>> >
>> > hi Pankaj,
>> >
>> > You should follow the multifactor analysis section of the vignette.
>> > You would include the patient effect in the design like so:
>> >
>> > design(dds) <- ~ patient + condition
>> >
>> > Where 'patient' and 'condition' are factors which are columns in
>> > colData(dds), and condition has levels: normal, tumor (where normal is
>> > the base level).
>> >
>> > Then results(dds) will build a result table for tumor vs normal,
>> > controlling for the patient effect.
>> >
>> > Mike
>> >
>> >
>> > On Thu, Apr 10, 2014 at 3:52 PM, Pankaj Agarwal <p.agarwal at duke.edu>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> This query is in reference to the following post
>> >> https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html
>> >> where it was mentioned that "DESeq does not support paired tests
>> >> (yet)".
>> >> I have a need for performing rna-seq data analysis with matched
>> >> tumor/normal pairs of data and was wondering if this capability is
>> >> now available in the latest version of DESeq2.  I searched the list
>> >> site and also the latest vignette, but could not find anything
>> >> indicating this capability is now available in DEseq.  I would be
>> >> grateful if, in case this feature is not yet available, someone could
>> >> advice me how to do the analysis.  I have samples from two patients,
>> >> and from each patient there is a "normal" and a "tumor"
>> >> rna-seq sample data.
>> >>
>> >> Thanks,
>> >>
>> >> - Pankaj
>> >>
>> >>         [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioconductor mailing list
>> >> Bioconductor at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> Search the archives:
>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> >