[BioC] Low number of replicates DESeq

Federico Gaiti f.gaiti at uq.edu.au
Tue Feb 25 02:18:36 CET 2014


Thanks Steve -- I'll have a look at the DESeq2 vignette.

Here is what I've done, let me know if it's still unclear

samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count -s no - gtfFile > stranded_counts_noS.txt 

head(Counts)

               1.1   1.2    1.3    1.4    2.1   2.2    2.3    2.4    3.1    3.2   3.3    3.4    4.1    4.2   4.3   4.4

XXXX	9	0	24	48	30	5	1	1	21	15	8	6	28	28	27	47

XXXX	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

XXXX	16	0	0	0	19	4	0	0	40	1	2	3	78	5	5	7

XXXX	0	8	0	7	5	5	19	7	14	4	4	7	9	4	12	1
......

and then in R:

data<-read.table("Counts",header=TRUE,row.names=1)
head(data)
Design=data.frame(
   row.names=colnames(data),
   condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4','4','4'))
cds=newCountDataSet(data,Design)
cds=estimateSizeFactors(cds)
sizeFactors(cds)
All_samples_normalized<-counts(cds,normalized=TRUE)
table(rowSums(All_samples_normalized)==0)
data<-All_samples_normalized[rowSums(All_samples_normalized)>0,]
data_subset_matrix<-as.matrix(data)
Spearman_normalized<-rcorr(data_subset_matrix,type="spearman")
A<-Spearman_normalized$r
write.table(A,file="Spearman_normalized_allsamples.txt")
Pearson_normalized<-rcorr(data_subset_matrix,type="pearson")
B<-Pearson_normalized$r
write.table(B,file="Pearson_normalized_allsamples.txt")
A_matrix<-as.matrix(A)
B_matrix<-as.matrix(B)
corrgram(B_matrix,order="PCA")
corrgram(A_matrix,order="PCA")
pca3=PCA(A_matrix,graph=TRUE)
pca3=PCA(B_matrix,graph=TRUE)

This gave me a nice correlation but I'd still like to to use the stranded counts for the DGE. 
The issue is that if I only use stranded data I don't have replicates. If I include the 3 unstranded replicates I need to use the option -s no for the stranded data because otherwise stranded and unstranded do not correlate.

So my ideas was to use the unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values.

Would it be possible? Or would it be better to stick to the -s no option for the DGE? 

Thanks
Federico




________________________________________
From: mailinglist.honeypot at gmail.com [mailinglist.honeypot at gmail.com] on behalf of Steve Lianoglou [lianoglou.steve at gene.com]
Sent: Tuesday, 25 February 2014 10:34 AM
To: Federico Gaiti
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Low number of replicates DESeq

Hi,

Since you are just starting your analysis and are in the world of
DESeq, you should probably switch to DESeq2.

You mention things about "low correlation" but it's not clear what
conditions you are comparing where. Instead of describing your
analysis at a high level, showing the code that you used would be more
helpful.

That having been said, the first thing I would do is to perform the
steps outlined in the DESeq2 vignette under the section "Data quality
assessment by sample clustering and visualization" to see if your
replicate data cluster closely together in meaningful ways using the
heatmaps and PCA plots outlined there.

HTH,
-steve

On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti at uq.edu.au> wrote:
> Hi all,
>
> I am using DESEq for a DGE analysis.
>
> I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.
>
> So my data situation at the moment is:
>
> STAGE 1     stranded
> STAGE 1.1  unstranded
> STAGE 1.2  unstranded
> STAGE 1.3  unstranded
> STAGE 2     stranded
> STAGE 2.1  unstranded
> STAGE 2.2  unstranded
> STAGE 2.3  unstranded
> STAGE 3     stranded
> STAGE 3.1  unstranded
> STAGE 3.2  unstranded
> STAGE 3.3  unstranded
> STAGE 4     stranded
> STAGE 4.1  unstranded
> STAGE 4.2  unstranded
> STAGE 4.3  unstranded
>
> Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples "cluster" together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage.
>
> I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr-unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.
>
> I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number).
>
> I included the metadata
>
>
>> Design
>             condition
> ADULT        ADULT
> ADULT1       ADULT
> ADULT2       ADULT
> ADULT3       ADULT
> JUV            JUV
> JUV1           JUV
> JUV2           JUV
> JUV3           JUV
> COMP          COMP
> COMP1         COMP
> COMP2         COMP
> COMP3         COMP
> PRECOMP    PRECOMP
> PRECOMP1   PRECOMP
> PRECOMP2   PRECOMP
> PRECOMP3   PRECOMP
>
> and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data.
>
>
> What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.
>
>
> I would like to hear from you if you have any thoughts about this.
>
>
> Let me know if you need any further details to better understand the issue.
>
>
> Thanks in advance,
>
> Federico
>
> Federico Gaiti
> Ph.D. Candidate
> School of Biological Sciences
> University of Queensland
> St Lucia QLD 4072
> Australia
> f.gaiti at uq.edu.au
>
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



--
Steve Lianoglou
Computational Biologist
Genentech



More information about the Bioconductor mailing list