[BioC] [Limma] lmFit handle missing value in phenotype

James W. MacDonald jmacdon at uw.edu
Thu Apr 5 15:38:33 CEST 2012


Hi Chen,

On 4/5/2012 9:12 AM, yao chen wrote:
> Thanks, Moshe and James
>
> Any question about confounding factor if you or anybody can help me.
>
> Actually, I have many different phenotype of samples here, such as 
> weight, age, cholesterol. I want to find the differential expressed 
> genes between case and control (say:high blood pressure vs health ). I 
> found some significant genes when I adjust the age, weight. However, 
> when I adjust the cholesterol level, these genes became not 
> significant (/P/ values larger). So either cholesterol is confounding 
> factor which I should control , or these differential expressed genes 
> control the cholesterol level which I should not adjust. I am not sure 
> which assumption is right. I am thinking it should be the last one.

I'm not sure I completely understand what you are saying here. It is 
conventional in this field to state that 'genes are significant', but 
that is an incomplete statement. The real statement is that the gene 
expression appears to be significantly differentially expressed between 
(e.g., treatment and control). So are you saying that the case vs 
control contrast is no longer significant if you add cholesterol level 
to the model?

If you add cholesterol level to the model, you have moved from an ANOVA 
model to an ANCOVA model in which you are fitting a linear model between 
the gene expression and cholesterol level, and allowing different 
intercepts between the different factor levels, but assuming equal 
slopes. If the contrasts between factor levels are no longer 
significant, that implies that the intercepts aren't different.

In this case, the cholesterol level 'explains' more about gene 
expression than the treatment. Whether or not the genes control the 
cholesterol level or the cholesterol level is affecting gene expression 
cannot be determined. But what can be said is that once you control for 
cholesterol level, the differences between the factor levels are no 
longer significant.

Also note that fitting a linear model (as compared to ANOVA) is 
trickier. For ANOVA, you really just assume that the distribution of the 
data for the different factor levels is relatively close to normal (or 
at least 'hump shaped'). With linear regression you can run into 
problems if your residuals are not correctly distributed, if you have 
high leverage data points, etc, which are not things you can check for 
thousands of models. In addition, you are making the assumption that the 
slopes of the regression lines are parallel, which might not always be 
valid. You may need to introduce an interaction term to allow for 
diverging slopes.

Best,

Jim


>
> 2012/4/4 Moshe Olshansky <olshansky at wehi.edu.au 
> <mailto:olshansky at wehi.edu.au>>
>
>     You are right.  My mistake. Sorry...
>
>     > Hi, Moshe,
>     >
>     > I still did not get it. How to use your method?
>     >
>     > lmFit<-(x,design) which x allow missing values, but design didn't.
>     >
>     > usually, I want to adjust batch effects, I will include it in design
>     > matrix, i.e.
>     >                case  control  batch
>     >                  1        0           1
>     >                   0       1           1
>     >                   1       0           2
>     > which x is expression matrix, and then I will get the differential
>     > expressed gene between case and control with adjusting batch.
>     >
>     > In this case, I just want to simply use "weight" instead of
>      "batch" in
>     > design matrix. But unfortunately, you can't have missing values
>     in design
>     > matrix which I think it's very often if we have many samples.
>     >
>     > 2012/4/4 Moshe Olshansky <olshansky at wehi.edu.au
>     <mailto:olshansky at wehi.edu.au>>
>     >
>     >> Why do you want to include weight in the design matrix?
>     >> It may be more reasonable to include weight in the linear
>     model, i.e.
>     >> expression ~ condition + weight and then your design matrix
>     will have
>     >> two
>     >> columns (if there are two conditions): the first column will
>     contain 0's
>     >> and 1's depending on the condition (group) of the patient while the
>     >> second
>     >> one will be identically 1. And then the weights of the patients
>     will be
>     >> in
>     >> the values (i.e. the second argument to lmFit) which I believe
>     allows
>     >> missing values.
>     >> But this implicitly assumes that the weight (or some function
>     of it -
>     >> you
>     >> can use any transformation you like) contributes additively (and
>     >> linearly)
>     >> to the expression (or it's logarithm).
>     >>
>     >> Moshe.
>     >>
>     >> > Hi,
>     >> >
>     >> > I am using Limma to get differential expressed genes between
>     case and
>     >> > control samples. In addition, I want to adjust the "weight" of
>     >> patients
>     >> by
>     >> > including it in the "design" matrix. However, one weight
>     value of a
>     >> > patient
>     >> > is missing, so I have a "NA" in design matrix. When I run
>     lmFit, I got
>     >> the
>     >> > error message :Error in qr.default(x) : NA/NaN/Inf in foreign
>     function
>     >> > call
>     >> > (arg 1).
>     >> >
>     >> > Can limma handle the missing value in design matrix or I have to
>     >> remove
>     >> > this sample ?
>     >> >
>     >> > Thanks,
>     >> >
>     >> > Chen
>     >> >
>     >> >       [[alternative HTML version deleted]]
>     >> >
>     >> > _______________________________________________
>     >> > Bioconductor mailing list
>     >> > Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>     >> > Search the archives:
>     >> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>     >> >
>     >>
>     >>
>     >>
>     >>
>     ______________________________________________________________________
>     >> The information in this email is confidential and intended
>     solely for
>     >> the
>     >> addressee.
>     >> You must not disclose, forward, print or use it without the
>     permission
>     >> of
>     >> the sender.
>     >>
>     ______________________________________________________________________
>     >>
>     >
>
>
>
>     ______________________________________________________________________
>     The information in this email is confidential and intended solely
>     for the addressee.
>     You must not disclose, forward, print or use it without the
>     permission of the sender.
>     ______________________________________________________________________
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list