[BioC] Data filtering

Mark Robinson mark.robinson at imls.uzh.ch
Wed Oct 10 12:06:38 CEST 2012

Hi Anand,

I've added a few "reactions" below; I hope it can help.

> Greetings friends!
> I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries
> The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.

> After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.

What did you actually plot?  What did an MDS plot look like?

> So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are 
> removing genes that have low expression (count) levels
> removing genes that have high variance across replicates
> removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)

I understand the first one (removing low counts) and would recommend it, but the statistics should somewhat take care of highlighting which genes are differential, between the contrast of interest that you specify.  So, are your second and third really necessary?

> So my question to you is what stage of my analysis do I employ these filters?
> On the raw data itself, prior to normalization?
> Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)
> <CODE>
> count = read.table("Input.txt", sep="\t", header=T)                     					
> #$#$ read in raw count mapped data
> f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]                               
> #$#$ filter ou genes with total read count < 27 across all libraries

I'm not sure where this comes from.  The edgeR manual shows something similar using the cpm() function as opposed to total read count.  It's still a somewhat arbitrary cutoff, but see "3.3.4 Normalization and Filtering":

> f.dat = f.count[,-c(1,ncol(count))]                                                         
> #$#$ select only read count, not rest of data frame
> S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))                                           
> #$#$ define group
> Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))         								
> #$#$ define time
> Time.rep = rep(1:3,9)                                                                        
> #$#$ define replicate
> Group = paste(S,Time,Time.rep,sep="_")                                                         
> #$#$ define group_time_replicate
> library(edgeR)                                                                              
> #$#$ load edgeR package
> f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))  
> #$#$  make data for edgeR method
> count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))               
> #$#$  make data for edgeR method

I *think* your manual construction is ok here, but I would suggest in practice to just use the standard workflow:

d <- DGEList(counts=<>,group=<>)
d <- d[<mysubset>,]
d <- calcNormFactors(d)

You can always annotate d$samples data.frame afterwards if necessary ... but you don't even need to, since none of this is used in the (design matrix) construction below.

Note also that, in general, rowSums(.)/colSums(.) is preferred to apply(.,1,sum)/apply(.,2,sum) ... faster and (for me at least) easier to read.

> design = model.matrix(~ Time + S)                                                           
> #$#$  make design data for edgeR method
> count.d = calcNormFactors(count.d)                                                          
> #$#$  Normalize TMM

Maybe do your MDS plot here before proceeding?

> glmfit.d = glmFit(count.d, design, dispersion = 0.1)                                        
> #$#$  Fit the Negative Binomial Gen Lin Models

You have replicates and a good number of degrees of freedom.  Why not use the data to estimate dispersion instead of hard-coding it like this?  And, where did 0.1 come from?

> lrt.count = glmLRT(count.d, glmfit.d)                                                       
> #$#$  Likelihood ratio tests

Here, you need to be careful what glmLRT is actually testing.  By default, this testing the last column of the design matrix (read ?glmLRT).  Is this what you intend?

Hope that helps.

Best regards,

> result.count = data.frame(f.count, lrt.count$table)                                         
> #$#$  combining raw data and results from edgeR
> result.count$FDR = p.adjust(result.count$p.value,method="BH")                               
> #$#$  calculating the False Discovery Rate
> write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)           
> #$#$  saving the combined data set
> </CODE>
> -- output of sessionInfo(): 
> .
> --
> Sent via the guest posting facility at bioconductor.org.
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Prof. Dr. Mark Robinson
Institute of Molecular Life Sciences
University of Zurich
Winterthurerstrasse 190
8057 Zurich

v: +41 44 635 4848
f: +41 44 635 6898
e: mark.robinson at imls.uzh.ch
o: Y11-J-16
w: http://tiny.cc/mrobin


More information about the Bioconductor mailing list