[BioC] EdgeR repeated measurment

Fri Apr 4 10:06:47 CEST 2014

I have a miRNA data from an experiment, which contain repeated measurements. The experiment set up is as follows: 12 indviduals were assigned into two groups, Good and Low based on previous observations; where Good means those produce good quality and Low means those produce low quality. Samples were taken from each individual in a week interval for a month. The miRNA-seq data has three of these time points (0-day, 14-day and 28-day) for the eleven individuals, unfortunetly we lost one of the samples from Good-0-day group. Our aim is to check whether miRNA are involved in determining the quality and whether the time of sampling affect the quality through miRNAs. 
Hypotheses of test are:

1.Mean miRNA expression is the same regardless of quality at any time point 
2.Mean miRNA expression is the same at different time point within the same quality group  

Below is the detail of the analysis, which I used in edgeR. 

My question is, am I in the write track? How can I check whether the assumption of repeated measurments met or not?  
As you can see from the head of the data table individual S4 looks an outlier, which is also clear after normalization in a plot (plot is not shown). My question is that should I remove it? Is there any effect in case of removing or missing samples? Does sample imbalance affect the model?   

R version 3.1.0 beta (2014-03-28 r65330) 
Platform: x86_64-pc-linux-gnu (64-bit)

> y <- read.table(file="salmon",row.names="miRNA", header=TRUE)
> head(y)
           S2 S12 S13 S19 S20  S4 S6 S9 S15 S23 S30 S42 S47 S52 S53 S59 S60 S44
let-7a 50  17   7  13  18 570 40 25  89 126  36  22  12  21  23  59   6  27
let-7b  3   3   1   2   0  54  6  2  11   7   7   0   4   5   3  22   0   8
let-7d  0   1   0   0   1  21  0  2   0   0   1   0   0   1   0   6   0   1
let-7e  0   0   1   3   1 491  4  2  15   8  11   0   6   1   0   3   0   2
let-7g 40   1   1   1   1 162  3  4  10  57  13   2   1   8   2   8   0   7
let-7h  0   0   0   0   0 271  1  2   0   1   2   0   1   0   0   2   2   0
           S46 S49 S55 S43 S50 S82 S87 S92 S93 S99 S100 S84 S85 S89 S95 S83 S90
let-7a 110  18   5  42   1  24  14   2  11  47    5  48  28  30  26   6  18
let-7b  15   1   4   5   1   3   0   0   2  11    0  30   5   1   1   0   3
let-7d   1   0   0   0   0   0   2   0   0   1    0   0   2   0   0   0   0
let-7e  12   2   2  33   0   7   5   0   3  21    3   1  18  12   2   0  11
let-7g   4   1   0  11   2   9   6   0   0  19    5  10   7   5   4   2   1
let-7h   0   0   1   9   0   0   0   0   2   3    1   0   0   0   0   0   3
> time <- factor(c("0day", "0day", "0day","0day","0day","0day","0day","0day","0day","0day","0day","14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day"))
> quality <- factor(c("Good", "Good","Good","Good","Good","Low","Low","Low","Low","Low","Low","Good","Good","Good","Good","Good","Good","Low","Low","Low","Low","Low","Low","Good", "Good","Good","Good","Good","Good","Low","Low","Low","Low","Low","Low"))
> indv <- factor(c("F427","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448","F427","F432","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448","F427","F432","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448"))
> dim(y)
[1] 140  35
> dge <- DGEList(counts=y)
> dge <- calcNormFactors(dge)
> m <- 1e6 * t(t(dge$counts)/dge$samples$lib.size)
> keep <- rowSums(m > 1) >= 3
> dge <- dge[keep,]
> dim(dge)
[1] 135  35
> indv  <- factor(c("1","2","3","4","5","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6"))
> targets <- data.frame(quality,indv,time)
> targets
   quality indv  time
1     Good    1  0day
2     Good    2  0day
3     Good    3  0day
4     Good    4  0day
5     Good    5  0day
6      Low    1  0day
7      Low    2  0day
8      Low    3  0day
9      Low    4  0day
10     Low    5  0day
11     Low    6  0day
12    Good    1 14day
13    Good    2 14day
14    Good    3 14day
15    Good    4 14day
16    Good    5 14day
17    Good    6 14day
18     Low    1 14day
19     Low    2 14day
20     Low    3 14day
21     Low    4 14day
22     Low    5 14day
23     Low    6 14day
24    Good    1 28day
25    Good    2 28day
26    Good    3 28day
27    Good    4 28day
28    Good    5 28day
29    Good    6 28day
30     Low    1 28day
31     Low    2 28day
32     Low    3 28day
33     Low    4 28day
34     Low    5 28day
35     Low    6 28day
> quality <- factor(targets$quality, levels=c("Good","Low"))
> time <- factor(targets$time, levels=c("0day","14day","28day"))
> design <- model.matrix(~quality+quality:indv+quality:time) 
> dge <- estimateGLMCommonDisp(dge,design)
> dge <- estimateGLMTrendedDisp(dge,design)
> dge <- estimateGLMTagwiseDisp(dge,design)
> fit <- glmFit(dge, design)
> lrt <- glmLRT(fit, coef="qualityGood:time14day")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_14dayvs0day.csv")

> lrt <- glmLRT(fit, coef="qualityGood:time28day")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_28dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_28dayvs14day.csv")

> lrt <- glmLRT(fit, contrast=c(0,-1,0,0,0,0,0,0,0,0,0,0,0,1,0,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_14dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_28dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_28dayvs14day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="28day_LowvsGood.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,-1,1,0,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="14day_LowvsGood.csv")

> lrt <- glmLRT(fit, coef="qualityLow")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="0day_LowvsGood.csv")
I have a miRNA data from an experiment, which contain repeated measurements. The experiment set up is as follows: 12 indviduals were assigned into two groups, Good and Low based on previous observations; where Good means those produce good quality and Low means those produce low quality. Samples were taken from each individual in a week interval for a month. The miRNA-seq data has three of these time points (0-day, 14-day and 28-day) for the eleven individuals, unfortunetly we lost one of the samples from Good-0-day group. Our aim is to check whether miRNA are involved in determining the quality and whether the time of sampling affect the quality through miRNAs. 
Hypotheses of test are:

1.Mean miRNA expression is the same regardless of quality at any time point 
2.Mean miRNA expression is the same at different time point within the same quality group  

Below is the detail of the analysis, which I used in edgeR. 

My question is, am I in the write track? How can I check whether the assumption of repeated measurments met or not?  
As you can see from the head of the data table individual S4 looks an outlier, which is also clear after normalization in a plot (plot is not shown). My question is that should I remove it? Is there any effect in case of removing or missing samples? Does sample imbalance affect the model?   

R version 3.1.0 beta (2014-03-28 r65330) 
Platform: x86_64-pc-linux-gnu (64-bit)

> y <- read.table(file="salmon",row.names="miRNA", header=TRUE)
> head(y)
           S2 S12 S13 S19 S20  S4 S6 S9 S15 S23 S30 S42 S47 S52 S53 S59 S60 S44
let-7a 50  17   7  13  18 570 40 25  89 126  36  22  12  21  23  59   6  27
let-7b  3   3   1   2   0  54  6  2  11   7   7   0   4   5   3  22   0   8
let-7d  0   1   0   0   1  21  0  2   0   0   1   0   0   1   0   6   0   1
let-7e  0   0   1   3   1 491  4  2  15   8  11   0   6   1   0   3   0   2
let-7g 40   1   1   1   1 162  3  4  10  57  13   2   1   8   2   8   0   7
let-7h  0   0   0   0   0 271  1  2   0   1   2   0   1   0   0   2   2   0
           S46 S49 S55 S43 S50 S82 S87 S92 S93 S99 S100 S84 S85 S89 S95 S83 S90
let-7a 110  18   5  42   1  24  14   2  11  47    5  48  28  30  26   6  18
let-7b  15   1   4   5   1   3   0   0   2  11    0  30   5   1   1   0   3
let-7d   1   0   0   0   0   0   2   0   0   1    0   0   2   0   0   0   0
let-7e  12   2   2  33   0   7   5   0   3  21    3   1  18  12   2   0  11
let-7g   4   1   0  11   2   9   6   0   0  19    5  10   7   5   4   2   1
let-7h   0   0   1   9   0   0   0   0   2   3    1   0   0   0   0   0   3
> time <- factor(c("0day", "0day", "0day","0day","0day","0day","0day","0day","0day","0day","0day","14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "14day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day", "28day"))
> quality <- factor(c("Good", "Good","Good","Good","Good","Low","Low","Low","Low","Low","Low","Good","Good","Good","Good","Good","Good","Low","Low","Low","Low","Low","Low","Good", "Good","Good","Good","Good","Good","Low","Low","Low","Low","Low","Low"))
> indv <- factor(c("F427","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448","F427","F432","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448","F427","F432","F437","F438","F445","F446","F429","F431","F434","F440","F447","F448"))
> dim(y)
[1] 140  35
> dge <- DGEList(counts=y)
> dge <- calcNormFactors(dge)
> m <- 1e6 * t(t(dge$counts)/dge$samples$lib.size)
> keep <- rowSums(m > 1) >= 3
> dge <- dge[keep,]
> dim(dge)
[1] 135  35
> indv  <- factor(c("1","2","3","4","5","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6","1","2","3","4","5","6"))
> targets <- data.frame(quality,indv,time)
> targets
   quality indv  time
1     Good    1  0day
2     Good    2  0day
3     Good    3  0day
4     Good    4  0day
5     Good    5  0day
6      Low    1  0day
7      Low    2  0day
8      Low    3  0day
9      Low    4  0day
10     Low    5  0day
11     Low    6  0day
12    Good    1 14day
13    Good    2 14day
14    Good    3 14day
15    Good    4 14day
16    Good    5 14day
17    Good    6 14day
18     Low    1 14day
19     Low    2 14day
20     Low    3 14day
21     Low    4 14day
22     Low    5 14day
23     Low    6 14day
24    Good    1 28day
25    Good    2 28day
26    Good    3 28day
27    Good    4 28day
28    Good    5 28day
29    Good    6 28day
30     Low    1 28day
31     Low    2 28day
32     Low    3 28day
33     Low    4 28day
34     Low    5 28day
35     Low    6 28day
> quality <- factor(targets$quality, levels=c("Good","Low"))
> time <- factor(targets$time, levels=c("0day","14day","28day"))
> design <- model.matrix(~quality+quality:indv+quality:time) 
> dge <- estimateGLMCommonDisp(dge,design)
> dge <- estimateGLMTrendedDisp(dge,design)
> dge <- estimateGLMTagwiseDisp(dge,design)
> fit <- glmFit(dge, design)
> lrt <- glmLRT(fit, coef="qualityGood:time14day")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_14dayvs0day.csv")

> lrt <- glmLRT(fit, coef="qualityGood:time28day")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_28dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Good_28dayvs14day.csv")

> lrt <- glmLRT(fit, contrast=c(0,-1,0,0,0,0,0,0,0,0,0,0,0,1,0,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_14dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_28dayvs0day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="Low_28dayvs14day.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="28day_LowvsGood.csv")

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,-1,1,0,0))
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="14day_LowvsGood.csv")

> lrt <- glmLRT(fit, coef="qualityLow")
> topTags(lrt)
> tab <- topTags(lrt, n=Inf)
> write.table(tab, file="0day_LowvsGood.csv")

 -- output of sessionInfo(): 

R version 3.1.0 beta (2014-03-28 r65330)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.4.2   limma_3.18.13

loaded via a namespace (and not attached):
[1] tools_3.1.0

--
Sent via the guest posting facility at bioconductor.org.