[BioC] warning with lmFit command in limma
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Aug 7 04:00:34 CEST 2014
Dear Shruti Marwaha,
The problem is that more than half of the GSE38932 data consists of
missing values. This seems to me to be very unfortunate, but limma can
only analyse the data that it is given and give you a warning to keep you
informed of what is happening. Given the enormous number of missing
values, there may be no available data for many of the blocks, and hence
the coefficients for the blocks become NA.
Another problem is the log2 transformation code that you are using, which
is unnecessary and incorrect. I realize that you are using R code from
the GEO site itself:
http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE38932
However it is clear from the description of the data processing that the
expression values are already log2-ratios produced by limma, and hence do
not need to be further processed.
After the step,
gset38932 <- gset38932[,ConventionalMicroarrayData]
you can already pass the data object to lmFit() without any extra
processing except for row filtering.
If you the unnecessary re-processing, the tests of the treatment effect
from lmFit() etc should still be correct despite the warnings. limma is
making the best use of the non-missing data. However, personally, I would
download the raw gpr files from the GEO site and re-normalize the data
without introducing missing values.
This is a dataset where I think that somewhat less processing would have
resulted in better results.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
> From: "Shruti Marwaha [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, marwahsi at gmail.com
> Sent: Wednesday, 6 August, 2014 7:31:13 AM
> Subject: [BioC] warning with lmFit command in limma
>
> Dear List & Gordon,
>
> Firstly thanks to you and your team for the wonderful limma package and
> its well dcoumented manual.
> I have used it earlier for analysing both one colored and two colored
> data. But with one of the datasets, I get the following warning with
> lmFit command in limma package.
>
>> fit <- lmFit(gset38932, design38932)
> Warning message:
> Partial NA coefficients for 28837 probe(s)
>
> Background of experiment:
>
> It is a two color experiment with common reference.
> Paired samples (cancerous and adjacent noncancerous tissues were obtained from 12 patients)
>
> I have followed the instructions under section 9.4.1 "Paired Samples" of
> limma manual (last revised on 17 june 2014) and also goggled for this
> warning but I am unable to understand it. My another concern is that if
> I ignore the warning and proceed with limma analysis, I get only 84
> differentially expressed genes (genes with adj.P.Value < 0.05), which
> appears too small a number for a microarray experiment.
>
> Complete code:
>> library(Biobase)
>> library(GEOquery)
>> library(limma)
>
>> gset38932 <- getGEO("GSE38932", GSEMatrix =TRUE)
>> GPLid <- levels((gset38932[[1]])$platform_id)
>> if (length(gset38932) > 1)
> + {
> + #idx <- grep("GPL5936", attr(gset38932, "names"))
> + idx <- grep(GPLid, attr(gset38932, "names"))
> + } else
> + {
> + idx <- 1
> + }
>> gset38932 <- gset38932[[idx]]
>
>> ConventionalMicroarrayData <- which(pData(gset38932)$characteristics_ch1.10 == "strategy: conventional microarray")
>> ## remove data with "Indirect strategy." and keep ones with "Direct strategy."
>> gset38932 <- gset38932[,ConventionalMicroarrayData]
>
>> # log2 transform
>> ex38932 <- exprs(gset38932)
>> qx <- as.numeric(quantile(ex38932, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
>> LogC <- (qx[5] > 100) ||
> + (qx[6]-qx[1] > 50 && qx[2] > 0) ||
> + (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
>> if (LogC) { ex38932[which(ex38932 <= 0)] <- NaN
> + exprs(gset38932) <- log2(ex38932) }
>
>> PatientInfo <- pData(gset38932)[,c(2,10:20,27)]
>> ## remove anything that is not a digit
>> PatientID <- as.numeric(sub(pattern="\D+", replacement="", x=PatientInfo$characteristics_ch1.9,ignore.case=T))
>> TissueType <- sub(pattern="tissue type: ", replacement="", x=PatientInfo$characteristics_ch1.8,ignore.case=T)
>> TissueType <- make.names(TissueType)
>> Block <- factor(PatientID)
>> Treatment <- factor(TissueType,levels=c("adjacent.non.tumor","tumor")) ## order of levels is important
>> design38932 <- model.matrix(~Block+Treatment)
>
>> fit <- lmFit(gset38932, design38932)
> Warning message:
> Partial NA coefficients for 28837 probe(s)
>
>> fit2 <- eBayes(fit, proportion=0.01) #proportion: assumed proportion of genes which are differentially expressed
>>
>> ## to get the no.of genes in each array
>> nrow38932 <- nrow(ex38932)
>> LastColumn <- dim(design38932)[2]
>> tT <- topTable(fit2, coef=colnames(design38932)[LastColumn], adjust="fdr", sort.by="B",number=nrow38932)
>> dim(subset(tT,adj.P.Val<0.05,))
> [1] 84 15
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.20.8 GEOquery_2.30.1 Biobase_2.24.0 BiocGenerics_0.10.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.95-4.3 tools_3.1.0 XML_3.98-1.1
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list