[BioC] [bioc] correlation analysis of RNA-seq

Wolfgang Huber whuber at embl.de
Mon Jul 1 21:57:32 CEST 2013


Dear Makis

with 300 samples, it should not matter much which correlation coefficient (Kendall, Pearson, Spearman, some other definition) you use, unless you really want to depend on the behaviour of the data at the extreme tails and have the measure be dominated by few (outlier?) samples. However, it will be necessary to work on normalised counts [others are more qualified to tell you how to do that in edgeR; in DESeq2, look at 'counts', 'rlogTransformation' or 'varianceStabilizingTransformation.] 

I am not sure I got the point you were trying to make with the below code example - can you expand on that?

Also, note that the quantity you compute from summary(glm.nb(gene2 ~ gene1)) would not be symmetric under exchange gene1 and gene2, which might hinder its interpretation as a measure of correlation.

Best wishes
	Wolfgang



On Jul 1, 2013, at 6:48 am, Makis Motakis <emotakis at hotmail.com> wrote:

> 
> 
> 
> Dear list,
> 
> I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example,
> 
> gene1<-rnbinom(300,mu=50,size=5)
> gene2<-rnbinom(300,mu=50,size=5)
> plot(gene1,gene2)
> cor.test(gene1,gene2,method="kendall")
> summary(glm.nb(gene2 ~ gene1))
> 
> I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size *  normalization factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R"  function  (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset)
> 
> In my case, would it be correct if I estimated the offset by 
> 
> offset <- getOffset() 
> 
> and then I estimated the gene / gene associations by glm using something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is there any reason for this model to follow the edgeR dispersion estimation method or simply glm.nb would do what I want?
> 
> Thank you in advance,
> Makis
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 		 	   		  
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list