[BioC] edgeR : input and design matrix help

Mark Robinson mark.robinson at imls.uzh.ch
Fri May 18 22:31:35 CEST 2012



On 18.05.2012, at 12:11, KJ Lim wrote:

> Dear Prof Mark Robinson,
> 
> Good day.
> 
>> treat <-
> 
>> 
>> factor(c("C1","C2","3h1","3h2","24h1","24h2","48h1","48h2","C1","C2","3h1","3h2","24h1","24h2","48h1","48h2"))
>> 
>> Maybe it's better to call this 'time' and you'll need to change this to
>> something like:
>> time <- factor( rep(c("C","3h","24h","48h"), each=4) )
>> 
> 
> I assigned the time factor as suggested and it showed me as:
> 
>> time
>  [1] C  C  C  C  3h 3h 3h 3h 24h 24h 24h 24h 96h 96h 96h 96h
>  Levels: 24h 3h 96h C
> 
> However, may I ask, does this time factor vector matches the read counts of
> time in the input file?
> The read counts of time in the input file is: C  C  3h 3h 24h 24h 96h 96h
> C  C  3h 3h 24h 24h 96h 96h
> 
> Should I make some adjustment in my input file in order to suit the time
> factor vector?  Please forgive me, if this is a stupid question.


Apologies, my mistake … change that to:

 time <- factor( rep(rep(c("C","3h","24h","48h"),each=2),2) )

Best,
Mark






> If I understand your description, you are interested primarily in the
>> differences in genotypes.  I would suggest starting with:
>> 
>> design <- model.matrix(~time+geno)
>> 
> 
> I would like to study the differentially expressed genes in both 2 HS and
> LS genotypes of tress across time course experiment.
> 
> If I would like to study the differentially expressed genes in HS genotype
> tress across time course, can I have the design matrix as below?
> 
> design <- model.matrix(~time, data=rawdata)
> 
> Thank you very much for your time and help.
> 
> Best regards,
> KJ Lim
> 
> 
> 
> On 18 May 2012 11:32, Mark Robinson <mark.robinson at imls.uzh.ch> wrote:
> 
>> Hi KJ Lim,
>> 
>>> trees <-
>>> 
>> factor(c("HS","HS","HS","HS","HS","HS","HS","HS","LS","LS","LS","LS","LS","LS","LS","LS"))
>> 
>> This one seems ok, although could be written more compactly and I'll give
>> it a different name:
>> 
>> geno <- factor( rep(c("HS","LS"),each=8))
>> 
>>> treat <-
>>> 
>> factor(c("C1","C2","3h1","3h2","24h1","24h2","48h1","48h2","C1","C2","3h1","3h2","24h1","24h2","48h1","48h2"))
>> 
>> Maybe it's better to call this 'time' and you'll need to change this to
>> something like:
>> 
>> time <- factor( rep(c("C","3h","24h","48h"), each=4) )
>> 
>> These two factor vectors match the columns of your count table, so you
>> should be ok there.
>> 
>> If I understand your description, you are interested primarily in the
>> differences in genotypes.  I would suggest starting with:
>> 
>> design <- model.matrix(~time+geno)
>> 
>>> design
>>  (Intercept) time3h time48h timeC genoLS
>> 1            1      0       0     1      0
>> 2            1      0       0     1      0
>> 3            1      0       0     1      0
>> 4            1      0       0     1      0
>> 5            1      1       0     0      0
>> 6            1      1       0     0      0
>> 7            1      1       0     0      0
>> 8            1      1       0     0      0
>> 9            1      0       0     0      1
>> 10           1      0       0     0      1
>> 11           1      0       0     0      1
>> 12           1      0       0     0      1
>> 13           1      0       1     0      1
>> 14           1      0       1     0      1
>> 15           1      0       1     0      1
>> 16           1      0       1     0      1
>> attr(,"assign")
>> [1] 0 1 1 1 2
>> attr(,"contrasts")
>> attr(,"contrasts")$time
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$geno
>> [1] "contr.treatment"
>> 
>> Then, you can follow the usual steps as per the edgeR user's guide --
>> estimate dispersion, fit the glm with glmFit() and do the LR testing with
>> glmLRT() and so on.
>> 
>> Hope that gets you started.
>> 
>> Best,
>> Mark
>> 
>> 
>> 
>> 
>> On 18.05.2012, at 05:15, KJ Lim wrote:
>> 
>>> Dear edgeR users,
>>> 
>>> I am not a statistician nor R programming geek, please forgive me if I
>> ask
>>> stupid question.
>>> 
>>> I have RNA-seq data for 2 different genotype of trees with different time
>>> points 0hour(control),3hr,24hours,and 48hours. Each time point has two
>>> replicates.  The experiment design like following:
>>> 
>>>                        Sample harvested after treatment at
>>> Tree H1     Ctrl    3hrs  24hrs  48hrs
>>> Tree H2     Ctrl    3hrs  24hrs  48hrs
>>> 
>>> Tree L1     Ctrl    3hrs  24hrs  48hrs
>>> Tree L2     Ctrl    3hrs  24hrs  48hrs
>>> 
>>> I would like to study genes that are differentially expressed (DE)
>>> throughout the time points in these 2 genotype of trees with edgeR. I
>> read
>>> from the edgeR user guide, the suitable DE analysis method for my
>> expriment
>>> is GLM likelihood ratio test.
>>> 
>>> After read case study in the user guide, I have the RNA-Seq counts in a
>>> file as below in order to input into the edgeR package.
>>> 
>>> Ref Tags H1_C H2_C H1_3H H2_3H H1_1D H2_1D H1_2D H2_2D L1_C L2_C L1_3H
>>> L2_3H L1_1D L2_1D L1_2D L2_2D
>>> AA212259 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
>>> AB216460 1 0 0 1 2 1 6 2 1 1 1 0 5 1 3 1
>>> AC221873 3 0 1 0 3 0 0 2 1 0 1 0 2 1 2 0
>>> AD235900 3 1 6 0 5 4 4 4 7 2 4 3 3 0 8 0
>>> 
>>> I used the following commands to input the counts file into edgeR:
>>> 
>>> rawdata <- read.delim("file.txt", sep="\t", check.name=FALSE,
>>> stringsAsFactor=FALSE)
>>> trees <-
>>> 
>> factor(c("HS","HS","HS","HS","HS","HS","HS","HS","LS","LS","LS","LS","LS","LS","LS","LS"))
>>> treat <-
>>> 
>> factor(c("C1","C2","3h1","3h2","24h1","24h2","48h1","48h2","C1","C2","3h1","3h2","24h1","24h2","48h1","48h2"))
>>> 
>>> I'm not good in R programming, thus, having this file input into the
>> edgeR
>>> and assign the factors as well as design-matrix is a challenge. I'm stuck
>>> how to tell the edgeR about my design matrix.
>>> 
>>> Could you guys kindly help me on this? Have I input my RNA-Seq counts
>>> correctly? Please correct me if there is any mistake I have done in my
>>> edgeR input.
>>> 
>>> I appreciate very much for your help and time. Have a nice day.
>>> 
>>> Best regards,
>>> KJ Lim
>>> 
>>>      [[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> ----------
>> Prof. Dr. Mark Robinson
>> Bioinformatics
>> Institute of Molecular Life Sciences
>> University of Zurich
>> Winterthurerstrasse 190
>> 8057 Zurich
>> Switzerland
>> 
>> v: +41 44 635 4848
>> f: +41 44 635 6898
>> e: mark.robinson at imls.uzh.ch
>> o: Y11-J-16
>> w: http://tiny.cc/mrobin
>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list