[BioC] EdgeR: general advice on using edgeR for sRNA analysis

Gordon K Smyth smyth at wehi.EDU.AU
Tue Aug 6 01:14:30 CEST 2013


Dear Ken,

> Date: Mon, 5 Aug 2013 04:56:09 +0000
> From: Kenlee Nakasugi <kenlee.nakasugi at sydney.edu.au>
> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
> Subject: [BioC] EdgeR: general advice on using edgeR for sRNA analysis
>
> Hi,
>
> I was hoping someone would be able to provide me with some general 
> advice on using EdgeR for some sRNA datasets I have received.
>
> I have 3 sRNA datasets, and I have calculated all abundances (just read 
> counts) of every sequence in each dataset. Unfortunately, there are no 
> replicates. The goal is to find specific sRNA sequences that are higher 
> in abundance in dataset1 and dataset2 compared to dataset3. As there are 
> no replicates, I understand that no stats analyses with confidence can 
> be done on them, and so just want to first get a 'general' indication of 
> what sequences may be higher in abundance in datasets 1 and 2, and 
> follow up with other experiments.
>
> I have already generated a subset of 'common' sRNA sequences that are 
> present in dataset1, 2 and 3, along with their counts. Because the 
> original library sizes are different between the three, and also there 
> will be high level of duplicate sequences as these are sRNA sequences,
>
> 1. I am not sure if I should just use the edgeR setting to calculate the 
> library sizes via the sum of the column of the read counts, or use the 
> actual library size of each dataset, prior to normalization. Because I 
> am working on just the 'common' subset of sRNA sequences between the 
> datasets, there may be highly abundant sRNA sequences unique to each 
> dataset that are missing, and which may have skewed the distribution of 
> sRNA abundances within each dataset.

You should recompute the lib.sizes from the column sums for the sequences 
that you are analysing, and then run calcNormFactors().

I am unclear why you are restricting to common sRNA sequences.  Doesn't 
this exclude the most differentially expressed sequences, which might have 
zero counts in one or two libraries, which you might want to know about?

> 2. what dispersion value should I use - these are plant sRNA sequences, 
> so from experience, can someone suggest a number and I will go from 
> there
>
> Apart from this, are there any other issues I need to be concerned about 
> when analyzing such data in edgeR?

I haven't analysed plant sRNA, so cannot give any general advice for this 
type of data.  You could try a few dispersion values and go from there.

Alternatively, here is a conservative way to estimate the dispersion 
without replicates:

   dge2 <- dge
   dge2$samples$group <- rep(1,3)
   dge2 <- estimateDisp(dge2,robust=TRUE,winsor.tail=c(0.05,0.2))
   plotBCV(dge2)

This will estimate the dispersions allowing for about 20% of the 
sequences to be differentially expressed (treated as outliers).  Then

   results <- exactTest(dge, dispersion=dge2$trended.dispersion)

etc.

Best wishes
Gordon

> Any advice greatly appreciated!
> Best regards,
> Ken
>
> ---
> School of Molecular Biosciences
> University of Sydney

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list