[BioC] edgeR GLM : family based experimental design

Wed Feb 13 21:17:52 CET 2013

Hi,

I would be grateful for any comments/criticism/suggestions on my approach to the following datasets.

I have RNA-seq data from four families, with incomplete genotype. One member of each family is a proband. I am interested in Proband vs Control kind on analysis. I am using the latest release of edgeR.
=====================================================
Samples structure as follow:
Family 1 (COMPLETE)
Father, Mother, Sibling (Male), Proband

Family 2 (COMPLETE)
Father, Mother, Proband

Family 3 (INCOMPLETE)
Sibling (Female), Proband

Family 4 (INCOMPLETE)
Mother, Proband
=====================================================
FIRST CASE
In the first batch of data, I have RNA-seq from three clones each individual. That makes all together 33 samples. Family 1 and Family 2 were processed in  two different flow cells. Most of family 3 and 4 were together on the same flow cell. Therefore there is overlap between flow cells (or batches) and families. No females proband, so likely some sex bias.

I added all the clones, and ended up with 11 samples, 7 normal controls and 4 probands. Samples are obviously not independent. Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 4 

Family <- c(1,1,1,1,2,2,2,3,3,4,4)
Group <- c("C","C","C","P","C","C","P","C","P","C","P")

Defined a design matrix to account for family blocks:
design <- model.matrix(~Family+Group)

and ran the analysis
data <- estimateGLMCommonDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion)
de.MPtag <- glmLRT(d.glmfit)

data <- estimateGLMTrendedDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion)
de.MPtag <- glmLRT(d.glmfit)

data <- estimateGLMTagwiseDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion)
de.MPtag <- glmLRT(d.glmfit)

and got my diff expr calls for common, trended and tagwise dispersion estimates.
=====================================================
SECOND CASE
In the first batch of data, I have RNA-seq from one clones each individual. But cells have been taken at different days of culture: 15, 26, 46. I have incomplete data and family genotypes each day. Here I would also be interested in day by day differences or sort of time course, but I am describing only the Proband vs Control analysis.

Day 15
Family 1: Sibling, Proband
Family 2: Father, Mother, Proband

Day 26
Family 2: Father, Mother, Proband
Family 3: Mother, Proband
Family 4: Sibling (Female), Proband

Day 46
Family 2: Father, Mother, Proband
Family 3: Mother, Proband
Family 4: Sibling (Female), Proband

That makes all together 19 samples, 11 normal controld and 8 probands. Samples are again not independent. Cannot comment on batches at this point. No females proband, so likely some sex bias.

Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 8 

Family <- c(1,1,1,1,2,2,2,3,3,4,4)
Day <- c(15,26,46)
Group <- c("C","P","C","C","P","C","C","P","C","P","C","P","C","C","P","C","P","C","P")

Defined a design matrix to account for family and day blocks:
design <- model.matrix(~Family+Day+Group)

and ran the analysis
data <- estimateGLMCommonDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion)
de.MPtag <- glmLRT(d.glmfit)

data <- estimateGLMTrendedDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion)
de.MPtag <- glmLRT(d.glmfit)

data <- estimateGLMTagwiseDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion)
de.MPtag <- glmLRT(d.glmfit)

and got my diff expr calls for common, trended and tagwise dispersion estimates.

NOTE
Here I need to add that I put all together, in this specific case, as I am thinking I will have more power to get DEG. That said, given the I am also expecting biological differences from day to day, I am probably getting only what is common to the different stages, in spite of the Day blocking.
=====================================================

QUESTIONS:
Is there anything fundamentally wrong in what I did ?

Is the design matrix reasonable ?

I used two values for CUTOFF: 0.01 and 1. I get quite more DEG with CUTOFF=0.01 than with CUTOFF=1, while I would expect the opposite, since in the first case I am processing about 28000 genes, while in the second case I am processing about 18000 genes. Unless, the many more lowly (or zero) expressed genes affect the dispersion estimate. Is this the case ?

Thanks
Gianfilippo