[BioC] Interspecies differential expression of orthologs with Edger

Tue Sep 2 14:10:09 CEST 2014

Thanks Tim ,  quickly going over the article, it looks like a kind of
cross-species-WGCNA ? very intriguing.

Does Edger DE analysis is built on the assumption that most genes are not
differentially expressed,
 and that only a small portion of them do (say <20%)  ?
I mean, in cross-species studies, or when comparing different tissues of
the same organism, if this assumption doesn't hold, should it be a serious
concern ?

Assaf

On Thu, Aug 28, 2014 at 7:42 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

> This is super helpful.  Just to be clear, the most robust solution is to
> use edgeR and offset for putative gene length, TMM & library size while
> using raw counts (not effective counts) estimated by e.g. RSEM, eXpress, or
> the like?
>
> Also re: cross-species comparisons, while my experience is that it is
> indeed a can of worms, Mark Gerstein's group recently published a method
> that might interest others working on non-model or incompletely annotated
> organisms:
>
> http://genomebiology.com/2014/15/8/R100
>
> Any thoughts on applicability of the method for kooky experiments such as
> comparing Drosophila hemocytes, zebrafish vascular endothelial progenitors
> and the same in mice?  Or for that matter, alligator differentiation.
>
> I never realized how hard RNAseq in non-model organisms was until I tried
> it.
>
>
> Statistics is the grammar of science.
> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>
>
> On Thu, Aug 28, 2014 at 8:37 AM, assaf www <assafwww at gmail.com> wrote:
>
>> I checked, it's true, the results look the same.
>> As for FPKM, of course you are right.
>>
>> Thanks a lot
>> Assaf
>>
>>
>>
>> On Thu, Aug 28, 2014 at 2:47 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> wrote:
>>
>> > The code should have been:
>> >
>> >  offset <- expandAsMatrix(getOffset(y),dim(y))
>> >  offset <- offset + gl
>> >
>> > This should give same result as your code.
>> >
>> > rpkm() corrects for gene length as well as library size -- that's the
>> > whole purpose of RPKMs:
>> >
>> >   rpkm(y, gene.length=geneLength)
>> >
>> > should work for you without any modification.
>> >
>> > Gordon
>> >
>> >
>> > On Wed, 27 Aug 2014, assaf www wrote:
>> >
>> >  This is very helpful for me, thanks.
>> >>
>> >> A slight change that I made in the code you sent, to avoid some R
>> erros,
>> >> is
>> >>
>> >> # to replace:
>> >> offset = offset + gl
>> >> # with:
>> >> offset = sweep(gl,2,offset,"+")
>> >>
>> >> In addition to differential expression tests, I wanted also to extract
>> >> FPKMs values (and/or normalized CPM values), that would take into
>> account
>> >> all components of the offset (which if I'm not mistaken are:
>> library_size
>> >> +
>> >> TMM + gene_size).
>> >> I assume rpkm()/cpm() should correct only for library_size + TMM.
>> >> Is there a possibly "decent" solution for that ?
>> >>
>> >> all the best, and thanks,
>> >> Assaf
>> >>
>> >>
>> >>
>> >> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> >> wrote:
>> >>
>> >>  It works something like this:
>> >>>
>> >>>   library(edgeR)
>> >>>   y <- DGEList(counts=counts)
>> >>>   y <- calcNormFactors(y)
>> >>>
>> >>> # Column correct log gene lengths
>> >>> # Columns of gl should add to zero
>> >>>
>> >>>   gl <- log(geneLength)
>> >>>   gl <- t(t(gl)-colMeans(gl))
>> >>>
>> >>> # Combine library sizes, norm factors and gene lengths:
>> >>>
>> >>>   offset <- expandAsMatrix(getOffset(y))
>> >>>   offset <- offset + gl
>> >>>
>> >>> Then
>> >>>
>> >>>   y$offset <- offset
>> >>>   y <- estimateGLMCommonDisp(y,design)
>> >>>
>> >>> etc.
>> >>>
>> >>> Note that I have not tried this myself.  It should work in principle
>> from
>> >>> a differential expression point of view.
>> >>>
>> >>> On the other hand, there may be side effects regarding dispersion
>> trend
>> >>> estimation -- I do not have time to explore this.
>> >>>
>> >>> Gordon
>> >>>
>> >>> ---------------------------------------------
>> >>> Professor Gordon K Smyth,
>> >>> Bioinformatics Division,
>> >>> Walter and Eliza Hall Institute of Medical Research,
>> >>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> >>> http://www.statsci.org/smyth
>> >>>
>> >>> On Wed, 27 Aug 2014, assaf www wrote:
>> >>>
>> >>>  Probably wrong, but the reason I thought of using quantile
>> normalization
>> >>>
>> >>>> is
>> >>>> the need to correct both for the species-length, and library size.
>> >>>>
>> >>>>
>> >>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at wehi.edu.au>
>> >>>> wrote:
>> >>>>
>> >>>>  That doesn't look helpful to me.  I suggested that you incorporate
>> gene
>> >>>>
>> >>>>> lengths into the offsets, not do quantile normalization of cpms.
>> >>>>>
>> >>>>> Sorry, I just don't have time to develop a code example for you.  I
>> >>>>> hope
>> >>>>> someone else will help.
>> >>>>>
>> >>>>> The whole topic of interspecies differential expression is a can of
>> >>>>> worms.
>> >>>>> Even if you adjust for gene length, there will still be differences
>> in
>> >>>>> gc
>> >>>>> content and mappability between the species.
>> >>>>>
>> >>>>> Gordon
>> >>>>>
>> >>>>>
>> >>>>> On Wed, 27 Aug 2014, assaf www wrote:
>> >>>>>
>> >>>>>  Dear Gordon thanks,
>> >>>>>
>> >>>>>
>> >>>>>> Suppose I start with the following matrices:
>> >>>>>>
>> >>>>>> # 'counts' is the Rsem filtered counts
>> >>>>>>
>> >>>>>>  counts[1:4,]
>> >>>>>>
>> >>>>>>>
>> >>>>>>>                      h0  h1  h2  n0  n1  n2
>> >>>>>>>
>> >>>>>> ENSRNOG00000000021   36  17  20  10  25  38
>> >>>>>> ENSRNOG00000000024 1283 615 731 644 807 991
>> >>>>>> ENSRNOG00000000028   26  12  11  18  23  28
>> >>>>>> ENSRNOG00000000029   22  13  12  16  17  15
>> >>>>>>
>> >>>>>> # 'geneLength' is the species-specific gene lengths, for species
>> 'h'
>> >>>>>> and
>> >>>>>> 'n':
>> >>>>>>
>> >>>>>>  geneLength[1:3,]
>> >>>>>>
>> >>>>>>>
>> >>>>>>>                    h0.length h1.length h2.length n0.length
>> n1.length
>> >>>>>>>
>> >>>>>> n2.length
>> >>>>>> ENSRNOG00000000021      1200      1200      1200      1303
>> 1303
>> >>>>>> 1303
>> >>>>>> ENSRNOG00000000024      1050      1050      1050      3080
>> 3080
>> >>>>>> 3080
>> >>>>>> ENSRNOG00000000028      1047      1047      1047      1121
>> 1121
>> >>>>>> 1121
>> >>>>>>
>> >>>>>>
>> >>>>>> does the following code look correct, and may allow allows across
>> >>>>>> species
>> >>>>>> analysis ?:
>> >>>>>> (technically it works, I checked)
>> >>>>>>
>> >>>>>> # quantile normalization: (as in here:
>> >>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+
>> >>>>>> digital+gene+expression
>> >>>>>> )
>> >>>>>>
>> >>>>>> x1 = counts*1000/geneLength
>> >>>>>> library(limma)
>> >>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile")
>> >>>>>> offset = log(counts+0.1)-log(x2+0.1)
>> >>>>>>
>> >>>>>> ...
>> >>>>>>
>> >>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset)
>> >>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset)
>> >>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset)
>> >>>>>> fit <- glmFit(y,design,offset=offset)
>> >>>>>>
>> >>>>>> ...
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks in advance for any help..,
>> >>>>>> Assaf
>> >>>>>>
>> >>>>>>
>> >>>>>  ____________________________________________________________
>> >>> __________
>> >>> The information in this email is confidential and intended solely for
>> the
>> >>> addressee.
>> >>> You must not disclose, forward, print or use it without the
>> permission of
>> >>> the sender.
>> >>> ______________________________________________________________________
>> >>>
>> >>>
>> >>
>> > ______________________________________________________________________
>> > The information in this email is confidential and inte...{{dropped:10}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

	[[alternative HTML version deleted]]