[BioC] limma-voom

Gordon K Smyth smyth at wehi.EDU.AU
Wed Feb 5 23:38:20 CET 2014


> Hi, Dr. Gordon and all:
>
> Just follow up on my RNAseq analysis on the TCGA RSEM processed 
> so-called raw counts using limma-voom package as Dr. Gordon suggested 
> due to the issue of so-called RSEM raw counts but not true raw counts 
> that edgeR required.

edgeR will work OK if you simply round the RSEM expected counts to 
integers before running edgeR.  However voom is slightly preferable 
because it can use the fractional counts as they are.

> just try to get your opinion on it. Here are the 
> commands I used (only show most relevant ones)
>
>> library(edgeR);
>> library(limma)
>> show(head(targets));
>  SampleName      Subject   Type
> 1  T4626_01A TCGA_38_4626  Tumor
> 2  N4626_11A TCGA_38_4626 Normal
> 3  T2668_01A TCGA_44_2668  Tumor
> 4  N2668_11A TCGA_44_2668 Normal
>> Patient<-factor(targets$Subject);
>> Type<-factor(targets$Type);
>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>> show(dim(y));
> [1] 20531    58
>
> ##Keep genes with least 1 count-per-million reads (cpm) in at least 20 samples:
>> isexpr <- rowSums(cpm(y)>1) >= 20;
>> y2 <- y[isexpr,]
>> show(dim(y2));
> [1] 14735    58
>> #Recompute the library sizes:
>> y2$samples$lib.size <- colSums(y2$counts)
>> #Use Entrez Gene IDs as row names:
>> rownames(y2$counts) <- rownames(y2$genes) <- y2$genes$entrezID;
>
>> ##Apply TMM normalization
>> y2 <- calcNormFactors(y2,method=c("TMM"));
>> plotMDS(y2,main="y2");
>
>> show(data.frame(Sample=colnames(y2),Patient,Type));
>      Sample      Patient   Type
> 1  T4626_01A TCGA_38_4626  Tumor
> 2  N4626_11A TCGA_38_4626 Normal
> 3  T2668_01A TCGA_44_2668  Tumor
> 4  N2668_11A TCGA_44_2668 Normal
> ...
>> design<-model.matrix(~Patient+Type);
>> rownames(design) <- colnames(y2);
>> show(design[1:6,]);
>          (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627
> T4626_01A           1                   1                   0
> N4626_11A           1                   1                   0
> T2668_01A           1                   0                   0
> ...
>> v <- voom(y2,design,plot=TRUE);
>> plotMDS(v, top=50,main="common for v of y2",labels=paste(substr(Type,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),gene.selection="common");
>> plotMDS(v, top=50,main="pairwise for v of y2",labels=paste(substr(Type,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),gene.selection="pairwise");
>> fit <- lmFit(v,design)
>> fit <- eBayes(fit)
>> options(digits=3)
>> Rslt<-topTable(fit,coef="TypeTumor",n=Inf,sort="p");
>> show(summary(de <- decideTests(fit)));
>   (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 PatientTCGA_38_4632
> -1         126                   1                   3                   1
> 0         1805               14732               14725               14720
> 1        12804                   2                   7                  14
> ...
>   PatientTCGA_91_6836 TypeTumor
> -1                   4      3917
> 0                14718      5426
> 1                   13      5392
>
>
> Here are my questions:
>
> 1. I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the 
> filtering (based on one example in the package guide). I have 29 paired 
> of matched tumor and normal samples (total 58 samples), is this 
> reasonble to use for filtering, I asked this last time as well? I have a 
> bit concern, since after filtering, the number of genes went from 20531 
> to 14735.

Why would this be a concern?  One wouldn't expect the whole genome to be 
expressed at worthwhile levels in each particular cell type.  Some genes 
are specific to cell types.

> Shall I use something like y3<-y[rowSums(y$counts)>=50,] (y3 
> filtering) to filter (this filtering leaves about 18215 genes, about 4k 
> more genes than the 14735 from the other filtering way)?
>
> 2. I did plot with plotMDS(y2,main="y2"); v <- 
> voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all 
> show very nice separation of tumors vs normals in dim1.

I have analysed similar data from TCGA, and the separation is really much 
too good.  Furthermore there is little evidence of matching.  It would 
appear that the matched normal cells are not really comparable cells to 
the tumours.  The matched normal samples may actually be profiles of whole 
blood, a cell type which has a radically different expression profile to 
epithelial cells.  I really wonder what one can hope to learn about cancer 
by comparing epithelial tumours to blood.

> However, the v <- voom(y2,design,plot=TRUE) plot slightly different 
> between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering 
> with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got 
> similar plot as the one on page 116 of the user guide, but the second 
> way seems got bending toward to the down side around log2(count size 
> +0.5) between 0 to 5, and the data points also shift to 0 side due to 
> more of lower reads data points (the first option only have very few 
> data points between 0 to 5 in x-axis). Does this suggest the first 
> filtering way (y2) is better?

Yes, but one will good results either way.

> 3. the above command show(summary(de <- decideTests(fit))); showed y2 
> filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas 
> the y3 filtering (similar commands as y2, not shown) got final 
> 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many 
> DEGs, any issues here?

This is what I would expect to get if I compared epithelial tumour cells 
to normal blood.  If this is "too many DEGs", then the problem is with the 
data rather than the analysis.

> 4. using edgeR and the same data, I got more than 13k DEGs (total 18215 
> genes in the dataset after filtering) with FDR <0.05 (filtering like 
> y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs 
> at FDR5%) as shown in my question 3, although slight less than edgeR. 
> Any potenital issue?

The edgeR analysis on fractional counts is not correct.  edgeR is a 
likelihood based package.  It evaluates the negative binomial likelihood, 
which is always exactly zero at non-integers.  So there is no point in 
comparing the edgeR results to the voom results.

Best wishes
Gordon

> Thanks so much in advance for your input and advice!
>
> Ming
>
>
>
>
>
>
>> From: yi02 at hotmail.com
>> To: smyth at wehi.edu.au
>> Date: Thu, 30 Jan 2014 03:07:56 +0000
>> CC: bioconductor at r-project.org
>> Subject: Re: [BioC] limma-voom
>>
>> Hi, Dr. Smyth:
>>
>> Thanks for the info. I used limma a lot but not much on voom. I will give a shot on voom. Thanks again and best Ming
>>
>>> Date: Thu, 30 Jan 2014 13:05:57 +1100
>>> From: smyth at wehi.EDU.AU
>>> To: yi02 at hotmail.com
>>> CC: bioconductor at r-project.org
>>> Subject: limma-voom
>>>
>>> Hi Ming,
>>>
>>> voom is part of the limma package. There is a voom case study in the
>>> limma User's Guide with complete working code. It is the last case study
>>> in the users guide.
>>>
>>> voom works fine with either counts, or fractional counts, or scaled
>>> counts.
>>>
>>> We are currently finalizing additional voom code to detect differential
>>> splicing. That code isn't in the public package, but will be soon.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> On Thu, 30 Jan 2014, Ming Yi wrote:
>>>
>>>> I found the following documents
>>>>
>>>> http://stuff.mit.edu/afs/athena/software/r/r_v2.15.1/lib/R/library/limma/html/voom.html
>>>> http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf
>>>> http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
>>>>
>>>> But I could not find the real limma-voom package document. Is voom just
>>>> as normalization function prior to using limma linear model? Any good
>>>> examples of case study of working commands available as example?
>>>>
>>>> Thanks again!
>>>> Ming
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:9}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list