[BioC] Analysis of public GEO datasets - NGS

Sat Sep 22 09:06:13 CEST 2012

Hi

I am writing as I am trying to analyse NGS data from public data (GEO)  specifically datasets such as one sample per time point.  The raw (somewhat processed data) is 3 samples at different time points where â€˜The read count at exon, splice-junction, transcript and gene levels were summarized and normalized to relative abundance in Fragments Per Kilobase of exon model per Million (FPKM) in order to compare transcription level among samples.â€™

The authors of this paper then used The differentially expressed transcripts were identified using M-A based random sampling method implemented in DEGseq package in BioConductor (http://bioconductor.org/packages/2.5/bioc/html/DEGseq.html). The transcripts were further filtered at > 2-fold change and a minimum read count of 50 in either condition.

I have read through some of your posts where Gordon suggested using a simple excel formula to achieve fold changes when you donâ€™t have replicates
lib.size1 <- sum(y1)
>>   lib.size2 <- sum(y2)
>>   logFC <- log2((y1+0.5)/(lib.size1+0.5)/(y2+0.5)*(lib.size2+0.5))

Is this something I could apply to the current analysis?  I have 3 files - with gene ID and counts (one for each sample) and if genes are not listed in the sample files â€“ I assume the counts are zero.  Would you have any suggestions as to what to do with these zero count reads?

I am trying to avoid learning how to script write at the moment to see if this analysis works and obviously when I come to more complicated public data with replicates I will have to invest some time in learning the bioconductor program! 

Many thanks

JILL

 -- output of sessionInfo(): 

w

--
Sent via the guest posting facility at bioconductor.org.