[BioC] Differential expression analysis of CAGE data

Fri Sep 21 05:20:01 CEST 2012

Hi,

On Thu, Sep 20, 2012 at 10:17 PM, Makis Motakis <emotakis at hotmail.com> wrote:
> Hello list,
> My name is Makis, I am a bioinformatics post-doc fellow and, recently, I started working on a CAGE project. I am new on the field and trying to figure out some things, so I would much appreciate any comments on my problem. I am sorry if my question sounds a bit confusing...
>
> I have a Treatment vs Control (independent samples) experimental design with 3 replicates in each condition. Usually what people do is:
>
> 1. Use samtools to keep the reads that exhibit a high mapping quality score (here "-q 20") and convert the .bam files (containing the quality data) into bed format:
>
> samtools view -q 20 -F 768 -u input.bam | ~/bin/bamToBed -i stdin | gzip -c > output.bed.gz
>
> 2. summarizing all count data (generate transcription start sites clusters (CTSS))
>
> 3. use intersectBed to extract only promoter/gene overlapping entries from a known (e.g. Refseq) or a custom annotation. The counts of a set of ctss_ids (step 2) belonging to a specific gene region are summarized (for each sample). This is the (gene) expression to be analyzed.
>
> 4. Do differential expression analysis using the total counts for genes from step 3. This can be done by edgeR, DESeq etc.
>
>
> I would like to ask (i) is it (biologically or mathematically) meaningful to perform differential expression analysis for the ctss ids (data of step 2) by an appropriate method and then finding genes that contain many differentially expressed ctss_ids (with an enrichment test)? (ii) Does anyone have any idea how different my results would be from the ones obtained by the above (steps 1-4) pipeline (perhaps FDR adj p-values would be very different due to higher dimensionality...)? (iii) If my strategy is meaningful (DE on ctss_ids), should I use a version of cufflinks DE analysis (since I do not have total counts any more)? I am a bit confused because for CAGE data the gene length (isoform length) does not matter. The data I see are just transcription start sites.

I'm not sure what the standard CAGE analysis is, but if I get you
correctly, the difference between your analysis and what you describe
as the 'current' pipeline is that "currently" all transcript start
site clusters are "clustered even further" (due to proximity of
annotated TSS of a gene) to just measure expression of the downstream
gene, but you are rather interested in checking differential
expression of the cluster itself?

If that's the case:

(i) Sure, I think it can be biologically reasonable. I can imagine
setting up a supervised learning problem to try to identify why
certain TSS clusters are differentially expressed between conditions.

(ii) Impossible to say how different. For instance, tt could be that
some tss clusters are differentially expressed while the transcript
"in total" isn't, so ...

(iii) I don't see why you'd use cufflinks here, though. The TSS
clusters are still "regions of interest" and they should fit just fine
into the edgeR/DESeq mojo (some QC would of course be required) -- in
fact, I can imagine a DEXSeq-like approach can be appropriate here,
even.

HTH (and that I haven't misunderstood your goal),
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact