[BioC] goseq analysis

Thu Nov 1 13:41:16 CET 2012

Hi,

I'm not the author of this package, but I'll give you my best guess:

On Thu, Nov 1, 2012 at 5:18 AM, Dave Tang <davetingpongtang at gmail.com> wrote:
> Hello,
>
> In the vignette (17th March 2012) of the goseq package (page 6), a list of
> differentially expressed genes produced by edgeR is used as input into
> goseq. However if I were interested in over represented GO terms in either
> UP or DOWN regulated genes, I should just input genes that have a POSITIVE
> or NEGATIVE fold change (with an adjusted p-value < 0.05) into goseq? It
> sounds obvious, but I'm not sure.

Right.

> Also I have some questions regarding the graph on page 9. The x-axis is
> bias.data, which according to the vignette is usually the "gene length" or
> "number of counts". I can understand "gene length" but I don't understand
> what "number of counts" refers to.

It is likely referring to the number of reads assigned to that gene
when it was assessed for DE. By the structure of the data.frame, it
seems that you might just add up the number of read counts between the
two conditions under test and put it there. While I haven't tried
this, I'd suspect that this has some (strong(?)) correlation with the
gene's length.

> I hand picked two genes and it seems that
> bias.data is the gene length for these two genes. Therefore my
> interpretation of the graph on page 9 is that longer genes are
> proportionally more differentially expressed; is this correct?

Also correct.

> And lastly I'm working with a list of differentially expressed features
> (CAGE tags), which can be annotated to genes based on genome mapping.
> However a small subset of these features cannot be annotated and I have
> discarded them from the analysis since they cannot be associated to GO
> terms. Is this potentially disastrous?

Depends on how you define disaster?

If I were you, I'd try and assign tags that are intergenic but only
slightly 5' upstream from annotated genes to the downstream gene. I'm
leaving the definition of "slightly" undefined, though :-)

Also, a disaster you might try to avert is whether or not using goseq
is appropriate for your analysis to begin with.

If you are using goseq to correct for a length bias in detecting
differential expression, you might explore whether or not CAGE data is
subject to this bias at all. I think the "common" understanding is
that tag-based methods generally don't suffer from this, see:

Protocol Dependence of Sequencing-Based Gene Expression Measurements
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0019287

And in the discussion of:

Transcript length bias in RNA-seq data confounds systems biology
http://www.biology-direct.com/content/4/1/14

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact