[BioC] warning with lmFit command in limma

Thu Aug 7 04:00:34 CEST 2014

Dear Shruti Marwaha,

The problem is that more than half of the GSE38932 data consists of 
missing values. This seems to me to be very unfortunate, but limma can 
only analyse the data that it is given and give you a warning to keep you 
informed of what is happening.  Given the enormous number of missing 
values, there may be no available data for many of the blocks, and hence 
the coefficients for the blocks become NA.

Another problem is the log2 transformation code that you are using, which 
is unnecessary and incorrect.  I realize that you are using R code from 
the GEO site itself:

  http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE38932

However it is clear from the description of the data processing that the 
expression values are already log2-ratios produced by limma, and hence do 
not need to be further processed.

After the step,

  gset38932 <- gset38932[,ConventionalMicroarrayData]

you can already pass the data object to lmFit() without any extra 
processing except for row filtering.

If you the unnecessary re-processing, the tests of the treatment effect 
from lmFit() etc should still be correct despite the warnings.  limma is 
making the best use of the non-missing data.  However, personally, I would 
download the raw gpr files from the GEO site and re-normalize the data 
without introducing missing values.

This is a dataset where I think that somewhat less processing would have 
resulted in better results.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

> From: "Shruti Marwaha [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, marwahsi at gmail.com
> Sent: Wednesday, 6 August, 2014 7:31:13 AM
> Subject: [BioC] warning with lmFit command in limma
>
> Dear List & Gordon,
>
> Firstly thanks to you and your team for the wonderful limma package and 
> its well dcoumented manual.

> I have used it earlier for analysing both one colored and two colored 
> data. But with one of the datasets, I get the following warning with 
> lmFit command in limma package.
>
>> fit <- lmFit(gset38932, design38932)
> Warning message:
> Partial NA coefficients for 28837 probe(s)
>
> Background of experiment:
>
>    It is a two color experiment with common reference.
>    Paired samples (cancerous and adjacent noncancerous tissues were obtained from 12 patients)
>
> I have followed the instructions under section 9.4.1 "Paired Samples" of 
> limma manual (last revised on 17 june 2014) and also goggled for this 
> warning but I am unable to understand it. My another concern is that if 
> I ignore the warning and proceed with limma analysis, I get only 84 
> differentially expressed genes (genes with adj.P.Value < 0.05), which 
> appears too small a number for a microarray experiment.
>
> Complete code:
>> library(Biobase)
>> library(GEOquery)
>> library(limma)
>
>> gset38932 <- getGEO("GSE38932", GSEMatrix =TRUE)
>> GPLid <- levels((gset38932[[1]])$platform_id)
>> if (length(gset38932) > 1)
> + {
> +   #idx <- grep("GPL5936", attr(gset38932, "names"))
> +   idx <- grep(GPLid, attr(gset38932, "names"))
> + } else
> + {
> +   idx <- 1
> + }
>> gset38932 <- gset38932[[idx]]
>
>> ConventionalMicroarrayData <- which(pData(gset38932)$characteristics_ch1.10 == "strategy: conventional microarray")
>> ## remove data with "Indirect strategy." and keep ones with "Direct strategy."
>> gset38932 <- gset38932[,ConventionalMicroarrayData]
>
>> # log2 transform
>> ex38932 <- exprs(gset38932)
>> qx <- as.numeric(quantile(ex38932, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
>> LogC <- (qx[5] > 100) ||
> +   (qx[6]-qx[1] > 50 && qx[2] > 0) ||
> +   (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
>> if (LogC) { ex38932[which(ex38932 <= 0)] <- NaN
> +             exprs(gset38932) <- log2(ex38932) }
>
>> PatientInfo <- pData(gset38932)[,c(2,10:20,27)]
>> ## remove anything that is not a digit
>> PatientID <- as.numeric(sub(pattern="\D+", replacement="", x=PatientInfo$characteristics_ch1.9,ignore.case=T))
>> TissueType <- sub(pattern="tissue type: ", replacement="", x=PatientInfo$characteristics_ch1.8,ignore.case=T)
>> TissueType <- make.names(TissueType)
>> Block <- factor(PatientID)
>> Treatment <- factor(TissueType,levels=c("adjacent.non.tumor","tumor")) ## order of levels is important
>> design38932 <- model.matrix(~Block+Treatment)
>
>> fit <- lmFit(gset38932, design38932)
> Warning message:
> Partial NA coefficients for 28837 probe(s)
>
>> fit2 <- eBayes(fit, proportion=0.01) #proportion: assumed proportion of genes which are differentially expressed
>>
>> ## to get the no.of genes in each array
>> nrow38932 <- nrow(ex38932)
>> LastColumn <- dim(design38932)[2]
>> tT <- topTable(fit2, coef=colnames(design38932)[LastColumn], adjust="fdr", sort.by="B",number=nrow38932)
>> dim(subset(tT,adj.P.Val<0.05,))
> [1] 84 15
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_3.20.8        GEOquery_2.30.1     Biobase_2.24.0      BiocGenerics_0.10.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.95-4.3 tools_3.1.0    XML_3.98-1.1

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}