[BioC] edgeR design matrix

Fri Oct 14 17:13:36 CEST 2011

Hi Gordon,

I try it like you said, but I get a mistake, I do not know how to solve 
it??

Thanks in advance

Laura

 > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
+ )
 > design <-model.matrix(~cultivar)
 > design
    (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7
1            1         0         0         0         0         0         0
2            1         0         0         0         0         0         0
3            1         1         0         0         0         0         0
4            1         1         0         0         0         0         0
5            1         0         1         0         0         0         0
6            1         0         1         0         0         0         0
7            1         0         0         1         0         0         0
8            1         0         0         1         0         0         0
9            1         0         0         0         1         0         0
10           1         0         0         0         1         0         0
11           1         0         0         0         0         1         0
12           1         0         0         0         0         1         0
13           1         0         0         0         0         0         1
14           1         0         0         0         0         0         1
15           1         0         0         0         0         0         0
16           1         0         0         0         0         0         0
    cultivar8
1          0
2          0
3          0
4          0
5          0
6          0
7          0
8          0
9          0
10         0
11         0
12         0
13         0
14         0
15         1
16         1
attr(,"assign")
[1] 0 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cultivar
[1] "contr.treatment"

 > d <- estimateGLMCommonDisp(d, design)
 > d
An object of class "DGEList"
$samples
                   files group description lib.size norm.factors
1  s_1_CE_1_1counts.txt   CER   Cervil_CE 54428965            1
2  s_2_CE_1_2counts.txt   CER   Cervil_CE 47946147            1
3  s_6_CE_6_1counts.txt   CRI  Criollo_CE 49025258            1
4  s_7_CE_6_2counts.txt   CRI  Criollo_CE 43322654            1
5 s_1_CE_10_1counts.txt   FER    Ferum_CE 53574521            1
11 more rows ...

$counts
                        1   2   3   4   5   6   7   8   9  10  11  12  
13  14
GATCAAGAAAATAAAATGAA 203 136 344 195 286 182 438 492 418 273  97 208 256 345
GATCTCTGTCTCATCTTTCG 122  87 113 131 210 332 384 399 221 309 138 127 329 272
GATCTTCTTTTGAAGTAAAC 116 141 125  52 158 248 170 141 137 135 166 196 111 108
GATCAGCCAGGTCCAAGGCC  56  15  41  44  36  47  40  39  80  56  26  46  31  46
GATCCTCCGGAACCACAAGA 370 226  11   4  12   0  23   5 129   1  38  22   5   9
                       15 16
GATCAAGAAAATAAAATGAA 154 69
GATCTCTGTCTCATCTTTCG 112 94
GATCTTCTTTTGAAGTAAAC 134 74
GATCAGCCAGGTCCAAGGCC  81 66
GATCCTCCGGAACCACAAGA 151  7
73310 more rows ...

$common.dispersion
[1] 0.1259328

 > fit <- glmFit(d,design,dispersion = d$common.dispersion)
 > lrt <- glmLRT(d,fit,design,coef=2:8)
Error in glmfit$coefficients %*% contrast : non-conformable arguments

On 10/14/2011 07:00 AM, Gordon K Smyth wrote:
> Dear Laura,
>
>> Date: Wed, 12 Oct 2011 14:23:26 +0200
>> From: lpascual <Laura.pascual at avignon.inra.fr>
>> To: Bioconductor at r-project.org
>> Subject: [BioC] edgeR design matrix
>>
>> Hi all,
>>
>> I am new with RNA-seq data analysis. I have data from illumina DGE 
>> experiments from 8 different cultivars (duplicates). I also have 4 
>> extra samples, the F1 hybrids (duplicates), obtained from crossing 
>> the 8 cultivars as follows (1x2, 3x4, 5x6, 7x8). In total I have 24 
>> DGE illumina reactions, where all the samples came from the same 
>> developmental stage and treatment. So the only different factor I 
>> have is the genotype.
>>
>> I am trying to analyze the data with the edgeR package, however I am 
>> not sure if I can use the GML function for multiple testings or how 
>> should I made my design matrix.
>>
>> To find the differences among the 8 parental lines, I know I can made 
>> two ways testing between the different samples, but I was wondering, 
>> if I could be able to do a test were the null hypothesis will be 
>> there is no difference between the cultivars, and the alternative 
>> hypothesis, at least there is difference for one cultivar.
>>
>> To do that, it is possible to use one design matrix like these:
>>
>> Cultivar1 Cultivar2 Cultivar3 cultivar4 Cultivar5 Cultivar6 Cultivar7 
>> Cultivar8
>> 1-rep1 1 0 0 0 0 0 0 0
>> 1-rep2 1 0 0 0 0 0 0 0
>> 2-rep1 0 1 0 0 0 0 0 0
>> 2-rep2 0 1 0 0 0 0 0 0
>> 3-rep1 0 0 1 0 0 0 0 0
>> 3-rep2 0 0 1 0 0 0 0 0
>> 4-rep1 0 0 0 1 0 0 0 0
>> 4-rep2 0 0 0 1 0 0 0 0
>> 5-rep1 0 0 0 0 1 0 0 0
>> 5-rep2 0 0 0 0 1 0 0 0
>> 6-rep1 0 0 0 0 0 1 0 0
>> 6-rep2 0 0 0 0 0 1 0 0
>> 7-rep1 0 0 0 0 0 0 1 0
>> 7-rep2 0 0 0 0 0 0 1 0
>> 8-rep1 0 0 0 0 0 0 0 1
>> 8-rep2 0 0 0 0 0 0 0 1
>>
>> I have try it, but to calculate the null hypothesis the gmlLRT seams
>> just to remove the last column??
>
> You could use that design matrix, but it's inconvenient.  If you want 
> to find transcripts that are different between any of the cultivars, 
> you would use:
>
>   design <- model.matrix(~Cultivar)
>   fit <- glmFit(y,design)
>   lrt <- glmLRT(y,fit,design,coef=2:8)
>   topTags(lrt)
>
> This gives you a test on 7 degrees of freedom (one for each 
> coefficient being tested) for differences between the 8 cultivars.
>
> Of course you have to estimate the dispersions as well, but I assume 
> you know how to do that.
>
>
>> Besides I also want to test the differences between two parent lines and
>> the F1 that I have derived from them I have try a design matrix as 
>> follows,
>>
>>
>> Cultivar1 Cultivar2 F1
>> 1-rep1 1 0 0
>> 1-rep2 1 0 0
>> 2-rep1 0 1 0
>> 2-rep2 0 1 0
>> F1-rep1 0 0 1
>> F1-rep2 0 0 1
>>
>> But I again face the above mentioned problem.
>
> The above example should give you a clue as to how to do this.
>
>
>> Besides I was wondering if I can employ a design matrix where I will be
>> modeling how many alleles of each cultivar have my sample, doing
>> something like these:
>>
>> Cultivar1 Cultivar2
>> 1-rep1 2 0
>> 1-rep2 2 0
>> 2-rep1 0 2
>> 2-rep2 0 2
>> F1-rep1 0 1
>> F1-rep2 0 1
>>
>> I will appreciate any advice about the proper way to face these 
>> analysis.
>
> You can put a covariate for allele number as a column in your design 
> matrix, but you obviously need a column for the intercept term as 
> well. Basically you need to start using the model.matrix() function to 
> build your design matrices.
>
>
>> An just an extra question, in the manual it is recommended to eliminate
>> all the tags with low counts, as they are no going to be sadistically
>> significance and they will slow the process. First it is said that the
>> tags with less than 5 counts are not useful, but then all the tags with
>> lees than one tag per million (TPM) are removed, however in my case
>> having 50 million sequences that is all the tags with less than 50 
>> counts.
>>
>> In fact when I do it my tags decrease a lot
>>
>> > dim(d)
>> [1] 2029802 16
>>
>> > d <-d[rowSums(1e+06 * d$counts/expandAsMatrix(d$samples$lib.size,
>> dim(d)) >1) >= 2, ]
>
> You can just use cpm(d) to compute the counts-per-million.
>
>> > dim (d)
>>
>> [1] 73310 16
>>
>> That is normal? Why did you choose to remove less than 1TPM?
>
> It depends on your library size.  A cutoff of 1 cpm was appropriate 
> for the case study, but if you library sizes are smaller, then you 
> will need to use a smaller cutoff.
>
> Best wishes
> Gordon
>
>> Thanks in advance
>> Laura Pascual
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}