[BioC] EdgeR: general advice on using edgeR for sRNA analysis

Mon Aug 5 06:56:09 CEST 2013

Hi, 

I was hoping someone would be able to provide me with some general advice on using EdgeR for some sRNA datasets I have received.

I have 3 sRNA datasets, and I have calculated all abundances (just read counts) of every sequence in each dataset. Unfortunately, there are no replicates. 
The goal is to find specific sRNA sequences that are higher in abundance in dataset1 and dataset2 compared to dataset3. As there are no replicates, I understand that no stats analyses with confidence can be done on them, and so just want to first get a 'general' indication of what sequences may be higher in abundance in datasets 1 and 2, and follow up with other experiments. 

I have already generated a subset of 'common' sRNA sequences that are present in dataset1, 2 and 3, along with their counts. Because the original library sizes are different between the three, and also there will be high level of duplicate sequences as these are sRNA sequences, 

1. I am not sure if I should just use the edgeR setting to calculate the library sizes via the sum of the column of the read counts, or use the actual library size of each dataset, prior to normalization. Because I am working on just the 'common' subset of sRNA sequences between the datasets, there may be highly abundant sRNA sequences unique to each dataset that are missing, and which may have skewed the distribution of sRNA abundances within each dataset. 

2. what dispersion value should I use - these are plant sRNA sequences, so from experience, can someone suggest a number and I will go from there

Apart from this, are there any other issues I need to be concerned about when analyzing such data in edgeR? 

Any advice greatly appreciated! 
Best regards, 
Ken

--- 
School of Molecular Biosciences
University of Sydney