[BioC] Analysis of public GEO datasets - NGS

Jill [guest] guest at bioconductor.org
Sat Sep 22 09:06:13 CEST 2012

I am writing as I am trying to analyse NGS data from public data (GEO)  specifically datasets such as one sample per time point.  The raw (somewhat processed data) is 3 samples at different time points where ‘The read count at exon, splice-junction, transcript and gene levels were summarized and normalized to relative abundance in Fragments Per Kilobase of exon model per Million (FPKM) in order to compare transcription level among samples.’
The authors of this paper then used The differentially expressed transcripts were identified using M-A based random sampling method implemented in DEGseq package in BioConductor (http://bioconductor.org/packages/2.5/bioc/html/DEGseq.html). The transcripts were further filtered at > 2-fold change and a minimum read count of 50 in either condition.
I have read through some of your posts where Gordon suggested using a simple excel formula to achieve fold changes when you don’t have replicates
lib.size1 <- sum(y1)
>>   lib.size2 <- sum(y2)
>>   logFC <- log2((y1+0.5)/(lib.size1+0.5)/(y2+0.5)*(lib.size2+0.5))
Is this something I could apply to the current analysis?  I have 3 files - with gene ID and counts (one for each sample) and if genes are not listed in the sample files – I assume the counts are zero.  Would you have any suggestions as to what to do with these zero count reads?

I am trying to avoid learning how to script write at the moment to see if this analysis works and obviously when I come to more complicated public data with replicates I will have to invest some time in learning the bioconductor program! 
Many thanks

 -- output of sessionInfo(): 


Sent via the guest posting facility at bioconductor.org.

More information about the Bioconductor mailing list