[BioC] EdgeR GLM gene expression analysis of 3 groups

Davis McCarthy dmccarthy at wehi.EDU.AU
Fri Mar 25 01:18:44 CET 2011


Hi Shona

I'll do my best to help with your query. From looking at the output you gave there are a few problems. 

First, I am concerned that you have non-integer counts. The methods in edgeR are designed to work on raw counts, that is the raw number of mapped reads that align to your features of interest (whether they be genes, exons or other genomic regions of interest). Where do your counts come from in this dataset? You should absolutely not do any RPKM normalization or anything like that before differential expression  (DE) analysis in edgeR.

Second, the model will not fit because your design matrix is overspecified. With that design matrix you are actually trying to fit more parameters (columns of the design matrix, 11), than you have samples (9). I hope you can recognise that this would be problematic! You should not be trying to fit a parameter for each sample. All you need to do here is drop the sample indicator from your call to model.matrix. So you would have:
R> design <- model.matrix(~ age, data=targets)

This fits has a parameter for each of your age groups (so three parameters/columns in design matrix, one of which is the intercept term). Do you have any other covariate information? If so, this could be included in the design matrix, but the design simply with age will allow you to perform the analysis that you have said you want to do.

Third, you were worried that you seemed to be missing "middle" in the design matrix. This is actually correct. It is to do with the default way in which R forms the design matrix, using "treatment contrasts" (google-able). See the help for model.matrix and contrasts. From the contrasts help file: "contr.treatment contrasts each level with the baseline level (specified by base): the baseline level is omitted." This means that your design matrix is set up to contrast (compare) each age with the baseline level, which here is "middle". So the "ageold" column of the design contrasts "old" with "middle" and "ageyoung" contrasts "young" with "middle". When you have an intercept in the model, as you do here, the "middle" baseline level is absorbed into that and so 'disappears' from the design matrix. It is nothing to worry about.

So, to test the effect of having three different ages as opposed to having no different ages, you would form the design matrix as I've described above and then call the following to estimate the tagwise dispersions.
R> d <-estimateCRDisp(d, design)  ## NB: this function is being deprecated in the next release of edgeR, coming up in about 2 weeks---we will have new, better GLM methods for RNA-seq analysis! ##

Apologies for the uninformative error message here - we're working hard on the package to improve it.

To carry out your actual test, you need to fit the full model:
R> glmfit <- glmFit(d, design, dispersion=d$CR.tagwise.dispersion)
Then you can conduct the likelihood-ratio test (chi-square on 2 degrees of freedom):
R> results <- glmLRT(d, glmfit, contrast=c(0,1,1))
Then you can look at the top DE genes:
R> topTags(results)

And so on.

Hope this helps with your analysis.

Best wishes
Davis



On Mar 24, 2011, at 10:35 PM, Shona Wood wrote:

> Hi,
> 
> I have been trying to compare RNA-seq data 3 groups; young, middle aged and old. 
> I have done pairwise comparisons between these groups but i would like to 
> compare all three together. I have been trying to use the GLM method to do this, 
> following the tutorial, 11 Case study: Oral carcinomas vs matched normal tissue. 
> i understand that this only compares two conditions but i tried to adapt it to 
> my needs. Being new at this I am unsure if I am doing it right and I get an 
> error when trying to estimateCRdisp. Though I think the problem is actually to 
> do with the way i have specified the design because "middle" doesnt seem to be 
> in the design. Please see below:
> 
> R version 2.12.2 (2011-02-25)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: i386-pc-mingw32/i386 (32-bit)
> 
> library(edgeR)
> setwd ("/Users/swood/Documents/rna-seq analysis")
> targets<-read.delim (file="targets.txt",
> + stringsAsFactors = FALSE, row.names = "label")
> Error in data[[rowvar]] : attempt to select less than one element
> targets<-read.delim (file="targets.txt")
> targets$age<-factor(targets$age)
> targets$sample<-factor(targets$sample)
> targets
>   X  files    age sample
> 1 p1 p1.txt  young      1
> 2 p2 p2.txt  young      2
> 3 p3 p3.txt  young      3
> 4 p4 p4.txt    old      4
> 5 p5 p5.txt    old      5
> 6 p6 p6.txt    old      6
> 7 p7 p7.txt middle      7
> 8 p8 p8.txt middle      8
> 9 p9 p9.txt middle      9
> d<-readDGE(targets, skip=1, comment.char='#')
> colnames(d)<-row.names(targets)
> d<-calcNormFactors(d)
> d
> An object of class "DGEList"
> $samples
>   X  files    age sample lib.size norm.factors
> 1 p1 p1.txt  young      1 674816.7    0.7598895
> 2 p2 p2.txt  young      2 648856.9    0.9095203
> 3 p3 p3.txt  young      3 449797.3    1.2774867
> 4 p4 p4.txt    old      4 699036.0    0.8532141
> 5 p5 p5.txt    old      5 441649.7    1.1919854
> 6 p6 p6.txt    old      6 670755.9    0.8147330
> 7 p7 p7.txt middle      7 699630.1    1.0262252
> 8 p8 p8.txt middle      8 505795.8    1.2979023
> 9 p9 p9.txt middle      9 478527.1    1.0262468
> 
> $counts
>                          1        2        3       4        5       6       7
> ENSRNOG00000000007 38.23700 37.89900 34.85820 43.6703 40.81260 43.4009 42.4969
> ENSRNOG00000000008  0.00000  0.00000  0.00000  0.0000  0.00000  0.0000  0.0000
> ENSRNOG00000000009  0.00000  0.00000  0.00000  0.0000  0.00000  0.0000  0.0000
> ENSRNOG00000000010  1.40963  8.28797  2.09491 17.1853  1.76635 14.7567  1.0840
> ENSRNOG00000000012  0.00000  0.00000  0.00000  0.0000  0.00000  0.0000  0.0000
>                          8       9
> ENSRNOG00000000007 45.81270 48.6937
> ENSRNOG00000000008  0.00000  0.0000
> ENSRNOG00000000009  0.00000  0.0000
> ENSRNOG00000000010  5.44889 17.7295
> ENSRNOG00000000012  0.00000  0.0000
> 29510 more rows ...
> 
> d<-d[rowSums(d$counts)>9,]
> design <- model.matrix(~ sample + age, data = targets)
> design
>  (Intercept) sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
> 1           1       0       0       0       0       0       0       0       0
> 2           1       1       0       0       0       0       0       0       0
> 3           1       0       1       0       0       0       0       0       0
> 4           1       0       0       1       0       0       0       0       0
> 5           1       0       0       0       1       0       0       0       0
> 6           1       0       0       0       0       1       0       0       0
> 7           1       0       0       0       0       0       1       0       0
> 8           1       0       0       0       0       0       0       1       0
> 9           1       0       0       0       0       0       0       0       1
>  ageold ageyoung
> 1      0        1
> 2      0        1
> 3      0        1
> 4      1        0
> 5      1        0
> 6      1        0
> 7      0        0
> 8      0        0
> 9      0        0
> attr(,"assign")
> [1] 0 1 1 1 1 1 1 1 1 2 2
> attr(,"contrasts")
> attr(,"contrasts")$sample
> [1] "contr.treatment"
> 
> attr(,"contrasts")$age
> [1] "contr.treatment"
> 
> d<-estimateCRDisp(d, design)
> Error in beta[i, ] <- z %*% X : 
>  number of items to replace is not a multiple of replacement length
> 
> 
> If you could help me out with the design or suggest a reason for the error i 
> would really appreciate it.
> 
> Thanks!
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
dmccarthy at wehi.edu.au
http://www.wehi.edu.au




______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.
______________________________________________________________________


More information about the Bioconductor mailing list