[BioC] Data filtering
Anand K.S.Rao [guest]
guest at bioconductor.org
Tue Oct 9 09:42:14 CEST 2012
Greetings friends!
I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries
The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.
After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.
So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
removing genes that have low expression (count) levels
removing genes that have high variance across replicates
removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)
So my question to you is what stage of my analysis do I employ these filters?
On the raw data itself, prior to normalization?
Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)
<CODE>
count = read.table("Input.txt", sep="\t", header=T)
#$#$ read in raw count mapped data
f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]
#$#$ filter ou genes with total read count < 27 across all libraries
f.dat = f.count[,-c(1,ncol(count))]
#$#$ select only read count, not rest of data frame
S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))
#$#$ define group
Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))
#$#$ define time
Time.rep = rep(1:3,9)
#$#$ define replicate
Group = paste(S,Time,Time.rep,sep="_")
#$#$ define group_time_replicate
library(edgeR)
#$#$ load edgeR package
f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))
#$#$ make data for edgeR method
count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))
#$#$ make data for edgeR method
design = model.matrix(~ Time + S)
#$#$ make design data for edgeR method
count.d = calcNormFactors(count.d)
#$#$ Normalize TMM
glmfit.d = glmFit(count.d, design, dispersion = 0.1)
#$#$ Fit the Negative Binomial Gen Lin Models
lrt.count = glmLRT(count.d, glmfit.d)
#$#$ Likelihood ratio tests
result.count = data.frame(f.count, lrt.count$table)
#$#$ combining raw data and results from edgeR
result.count$FDR = p.adjust(result.count$p.value,method="BH")
#$#$ calculating the False Discovery Rate
write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)
#$#$ saving the combined data set
</CODE>
-- output of sessionInfo():
.
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list