[BioC] EdgeR GLM
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Dec 15 23:43:01 CET 2011
Dear Shona,
One last suggestion. Try typing
contrasts(age)
This tells you how the coefficients are determined.
I would actually suggest you simply put age in as a numericl column
(1=young,2=middle,3=old) into your design matrix, instead of defining an
ordinal factor.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Fri, 16 Dec 2011, Gordon K Smyth wrote:
> Dear Shona,
>
> Of course linear regression would use all the data points if you had 50 ages,
> but you have in fact only three.
>
> If there are an odd number of equally spaced ages, the middle one will not
> contribute to the coefficient. That's just how the math works.
>
> Polynomial regression isn't an invention of the edgeR package, and how it
> works in R is determined by the base functions in R, not those in edgeR. If
> you have further questions about linear or quadratic regression itself, it
> might be best to read separately, or else to ask for help on a more general
> list, probably the main R help list.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Thu, 15 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me and giving me those explanations.
>
> The only thing that is still confusing me is the linear coefficient. If it
> compares young and old without regard to middle isn't it basically the same
> as a pairwise comparison? Also it seems strange that it only uses the two
> extremities - I thought linear regression should use all the data points. For
> example if we had 50 different ages would 48 data points be excluded to
> compare the first and last only?
>
> Thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 13 December 2011 23:18
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: RE: EdgeR GLM
>
> Dear Shona,
>
> I can only give you brief pointers.
>
> There are only three time points in your data. The linear coefficient in
> the model simply compares old with young, without regard to middle. The
> quadratic term compares middle with the average of young and old. So if
> expression goes 2-4-2 for a particular gene over the three times then you
> get a positive quadratic coefficient and zero linear coefficient.
>
> It is perfectly possible for the likelihood ratio test of both linear
> and quadratic together:
>
> glmLRT(d, glmfit, coef=2:3)
>
> to detect some genes not in the separate linear or quadratic lists,
> because the linear or quadratic terms might be not quite significant
> separately, but might be so when considered together. For example, the
> expression might go 2-4-3. Is this linear or quadratic? Not quite sure.
> But we might be confident that the three times are not all equal.
>
> If you only want to report genes that have increasing expression with age,
> then I would have thought you want to fit the linear term only, since the
> quadratic effect can never be guaranteed to be increasing.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Tue, 13 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me. I am not a statistician so forgive me if
> I am asking basic questions. I was just reading about quadratic equations.
> Does a significant hit in the quadratic term mean that the rate of change
> first increases and then decreases over time (or visa versa)? Therefore
> hits which require the quadratic term are not a linear up or down with age
> they are only "linear" after middle age?
>
> If this is the case what does the logFC for age.Q represent, is it the
> fold change between middle age and old? And how is it calculated? Does the
> gene expression at young age affect the quadratic calculation? For example
> if the expression at young age is 2 and then 4 at middle and then 1.99 at
> old would the fact that old age is basically back to the young expression
> level stop it being a significant quadratic result or would it be
> significant because of the change from middle age.
>
> I have done the quadratic term separately as suggested but I also did the
> linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the
> results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some
> significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in
> either of the separate quadratic or linear tests - why would this be?
>
> Finally if i want to report changes with increasing age would it be best
> to present the results from the quadratic and linear tests separately or
> would it be better to present the combined test (glmLRT(d, glmfit, coef=2,
> 3)) and point out which ones rely on the quadratic term?
>
> Many thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 10 December 2011 07:10
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: EdgeR GLM
>
> Dear Shona,
>
> Yes, age.L is the log2-fold-change for increasing age.
>
> However your topTags table is testing for the linear and quadratic terms
> together, so it is testing the significance of any differences between the
> three age groups. I'd suggest you test whether the quadratic term is
> required using
>
> glmLRT(d, glmfit, coef=3)
>
> I'd also generally recommend you go on to estimate the trended and tagwise
> dispersions, not just the common dispersion.
>
> Best wishes
> Gordon
>
>> Date: Thu, 8 Dec 2011 17:41:07 +0000
>> From: "Wood, Shona" <Shona.Wood at liverpool.ac.uk>
>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: [BioC] EdgeR GLM
>>
>> Hi,
>>
>> I am trying to look at gene expression changes with increasing age. I
>> have three time points young, middle age and old. I have been using the
>> following, ordered factors and coefficient in order to get a fold change
>> and FDR for a linear change in gene expression with increasing age. Can
>> you look over my code and see if I have done this right? Is age.L the
>> fold change in gene expression with increasing age?
>>
>>> library(edgeR)
>>> library(limma)
>>> targets<-read.delim (file="coding_targets.txt")
>>> targets$age<-factor(ordered (targets$age))
>>> targets$sample<-factor(targets$sample)
>>> targets
>> X files age sample
>> 1 p1 p1_coding_counts.txt a-young 1
>> 2 p2 p2_coding_counts.txt a-young 2
>> 3 p3 p3_coding_counts.txt a-young 3
>> 4 p7 p7_coding_counts.txt b-middle 7
>> 5 p8 p8_coding_counts.txt b-middle 8
>> 6 p9 p9_coding_counts.txt b-middle 9
>> 7 p4 p4_coding_counts.txt c-old 4
>> 8 p5 p5_coding_counts.txt c-old 5
>> 9 p6 p6_coding_counts.txt c-old 6
>>
>>> d<-readDGE(targets, skip=1, comment.char='#')
>>> colnames(d)<-row.names(targets)
>>> d<- calcNormFactors(d)
>>> d
>> An object of class "DGEList"
>> $samples
>> X files age sample lib.size norm.factors
>> 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724
>> 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573
>> 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044
>> 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660
>> 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893
>> 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573
>> 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553
>> 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109
>> 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077
>>
>> $counts
>> 1 2 3 4 5 6 7 8 9
>> ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960
>> ENSRNOG00000000008 0 0 0 0 1 0 0 1 0
>> ENSRNOG00000000009 0 0 0 3 0 0 1 0 0
>> ENSRNOG00000000010 38 405 50 18 105 405 372 42 282
>> ENSRNOG00000000012 0 0 0 0 0 0 0 0 0
>> 22932 more rows ...
>>
>>> d<-d[rowSums(d$counts)>9,]
>>> design<- model.matrix(~ age, data = targets)
>>> design
>> (Intercept) age.L age.Q
>> 1 1 -7.071068e-01 0.4082483
>> 2 1 -7.071068e-01 0.4082483
>> 3 1 -7.071068e-01 0.4082483
>> 4 1 -7.850462e-17 -0.8164966
>> 5 1 -7.850462e-17 -0.8164966
>> 6 1 -7.850462e-17 -0.8164966
>> 7 1 7.071068e-01 0.4082483
>> 8 1 7.071068e-01 0.4082483
>> 9 1 7.071068e-01 0.4082483
>> attr(,"assign")
>> [1] 0 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$age
>> [1] "contr.poly"
>>
>>> d<-estimateGLMCommonDisp(d, design)
>>> names(d)
>> [1] "samples" "counts" "common.dispersion"
>>> glmfit<- glmFit(d, design, dispersion=d$common.dispersion)
>>> results<- glmLRT(d, glmfit, coef=c(2,3))
>>> topTags(results)
>> Coefficient: age.L age.Q
>> logConc age.L age.Q LR P.Value
>> ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24
>>
>> Many Thanks
>>
>> Shona
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list