[BioC] Measuring shifts in intronic expression independently of exons

Fri Aug 30 14:15:46 CEST 2013

On 08/29/2013 12:27 PM, James Perkins wrote:
> Hi Wolfgang, list
>
> Thank you very much for the helpful tip.
>
> I did exactly as you suggested, the code is below. I wanted to do it using
> the pasilla package in order to create an instructive example, and so that
> you could potentially recreate the analysis, but strangely sometimes the
> gene read counts from pasillaGenes are lower than the summed exon counts
> from the pasillaExon object. Or at least that's the case according to my
> code!
>
> If you (or anyone else who is reading this) can think of a good public data
> set for this please let me know, ideally one that is already in
> bioconductor and fairly easy to manipulate. Otherwise I am happy to submit
> my current dataset to bioc or make available some other way once its
> corresponding paper has been published

Just a plug for a different counting approach, where the gene model is 
transformed into a 'splicing graph' and reads are counted based on whether they 
overlap edges on the graph. This would for instance collapse several sequential 
exons belonging to a single transcript into a single counting bin,

Exon       1      2    3   4    5
GeneA:  ------   ---  --- --- -----
GeneB:  ------                -----

would end up with 3 edges consisting of exons 1, 2-4, and 5, with edge 2-4 being 
the only edge subject to differential expression analysis in DEXSeq. It's 
interesting to think of the intron complement of this figure, and the intron 
regions and reads that would support them.

The idea would be to provide the 'intron model' (which might require a bit of 
GRanges-foo to construct from a TxDb package or other annotation source, like 
the gtf's available in AnnotationHub) as input to the splicing graph algorithm.

This is implemented in the SplicingGraphs package, and illustrated using the 
experiment data package RNAseqData.HNRNPC.bam.chr14

Martin

>
> So, code below. Any comments or suggestions would be appreciated, if you
> think there are ways I could tune DEXSeq for this purpose:
>
> To start with I loaded in all the necessary files and made a counts object
> with Gend ID:E001 for intronic and GeneID:E002 for exonic.
>
> # c50tx - table of counts using reads mapping to anywhere in gene c50ex -
> reads mapping to exons only
> intMat <- as.data.frame(c50tx) - as.data.frame(c50ex)
> exMat <- as.data.frame(c50ex)
>
> # Remove genes for which intron count is below 100
> intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ]
> exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ]
> intMat <- intMat.2
>
> row.names(intMat) <- paste(row.names(intMat), ":E001", sep="")
> row.names(exMat) <- paste(row.names(exMat), ":E002", sep="")
> # Now I combine the different tables in order to get a combined table as so:
> both <- rbind(intMat, exMat)
> both <- both[order(row.names(both)), ]
> head(both)
>
> # Now I turn this into an exon count set
> gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2,
> byrow=TRUE)[,1]
> names(gIDs) <- row.names(both)
> gIDs <- as.factor(gIDs)
> eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2,
> byrow=TRUE)[,2]
> names(eIDs) <- row.names(both)
> des <- data.frame(condition=c(rep("case",3), rep("control", 3)))
> row.names(des) <- colnames(both)
> ecS <- newExonCountSet(
>    countData = both,
>    design = des,
>    geneIDs=gIDs,
>    exonIDs=eIDs)
>
> Once I had this I ran normalisation etc. and looked for genes with
> 'differential exon/intron usage':
>
> ecS <- estimateSizeFactors(ecS)
> ecS <- estimateDispersions(ecS)
> ecS <- fitDispersionFunction(ecS)
> head(fData(ecS)$dispBeforeSharing)
>
> ecS <- testForDEU(ecS)
> head(fData(ecS)[, c("pvalue", "padjust")])
> ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE)
> res2 <- DEUresultTable(ecS)
> sum(ecS$pad
>
>
> So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for at
> least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems quite
> a high number of genes, so I wonder if the method is being conservative
> enough.
>
> That said, tried swapping the sample labels a few times, which lead to 0
> genes being called as DEU, which is a good sign I think.
>
> Interestingly, comparing intron:exon for case and controls for the
> significant DEU genes shows a 4:1 ratio of genes that with an increased
> intronic expression in case samples, suggesting the experimental
> intervention is leading to increased intronic expression.
>
> So, does this approach seem valid? My main question is whether it is a
> little unconservative. I'm currently using ggbio to plot the read
> alignments for the significant genes and looking for interesting looking
> patterns of intronic expression that are consistent across replicates of
> the same class (and that differ between classes). I've found a few things
> that look interesting for follow up.
>
> Thanks a lot for your help,
>
> Jim
>
> On 9 August 2013 16:02, Wolfgang Huber <whuber at embl.de> wrote:
>
>> Dear James
>> have you already checked the DEXSeq package and the paper
>> http://genome.cshlp.org/content/22/10/2008.long
>> This does not apply 1:1 to what you are asking for, but afaIcs the main
>> modification would be for you to define "counting bins" (on which the input
>> to DEXSeq is computed by read overlap counting) that represent (i) exons
>> and (ii) introns and then check for changes in relative intro usage (the
>> 'ratios' you mention below).
>> Let me know how it goes
>>          Wolfgang.
>>
>> On 9 Aug 2013, at 10:02, James Perkins <j.perkins at ucl.ac.uk> wrote:
>>
>>> Dear list,
>>>
>>> I would like to know if an experimental treatment leads to a significant
>>> shift in intronic expression for some genes.
>>>
>>> Imagine I have an experiment with 6 biological replicates of a given
>>> tissue. I believe that the treatment might lead to an increased intronic
>>> expression for some genes, unrelated to exonic expression.
>>>
>>> 3 of these receive no treament, they are used as control. The other 3
>>> receive experimental treatment.
>>>
>>> I then sequence the mRNA from these samples (Illumina, single end reads,
>>> ~40 million reads per sample), to obtain 6 fastq files, I align these to
>> a
>>> refernce genome and get bam files.
>>>
>>> I was thinking that a fairly easy way to see if some genes show a
>>> consistent increased intronic expression following treatment would be to
>>> count intronically aligning reads for each gene (e.g. using
>> GenomicRanges)
>>> and use something like DESeq to look for genes showing a significant
>> change
>>> in intronic "expression".
>>>
>>> However, the problem is that this might be due to exonic expression, due
>> to
>>> premature mRNAs etc., so I might end up finding genes that are
>>> differentially expressed at the exon level, and as a result the increased
>>> exon expression has caused increased intronic expression as a by product.
>>> Obviously I am not so interested in these genes wrt this method, I can
>> find
>>> these using "traditional" DE methods.
>>>
>>> In addition, when I tried profiling intronic regions using reads mapping
>> to
>>> introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line was
>>> quite far above the minimum mean expression point, i.e. it didn't go
>>> through the middle of the clump of data points (if that makes sense). I
>>> wonder if this is due to the size factor calculation being based on
>>> intronic "expression" (well, reads mapping to introns), which is
>> generally
>>> much lower than exon expression and therefore being somewhat unreliable.
>>>
>>> So  I would like to take this exon expression out of the equation, so to
>>> speak.
>>>
>>> I thought that one way might be to compare the ratios of exonic to
>> intronic
>>> reads between samples, for each gene.
>>>
>>> For example one gene might have 30, 35 and 33 exonically mapping reads
>> and
>>> 10,11 and 12 intronically mapping reads for control samples
>>>
>>> For case samples it might have 33, 32 and 34 exonically mapping reads;
>>> 20,21 and 19 intronically mapping reads.
>>>
>>> So we could compare 10/30, 11/35 and 12/33 for control to 20/33, 21/32
>>   and
>>> 19/34 for case.
>>>
>>> Does this methodology sound reasonable? It is necessarily based on the
>>> assumption that intronic "expression" due to unspliced RNAs is correlated
>>> with exon expression.
>>>
>>> If it sounds reasonable, is there a test that is recommended to compare
>> the
>>> ratios in such a way, that takes into account the biological replication
>> of
>>> samples? I could do a simple test (chi squared) to compare the relative
>>> frequencies, but this wouldn't take into account the replicates.
>>>
>>> I realise this isn't really a specific bioconductor question, but
>> hopefully
>>> it might be of interest to some of the list subscribers.
>>>
>>> Many thanks,
>>>
>>> Jim
>>>
>>> --
>>> James R Perkins, PhD
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793