[BioC] [Limma] lmFit handle missing value in phenotype

Thu Apr 5 16:12:06 CEST 2012

Hi Chen,

On 4/5/2012 10:03 AM, yao chen wrote:
> Thanks, James.
>
> You are right. I find expression of some genes are significant between 
> treatment and control: I use lmFit(x,design), in which design only 
> include treatment vs control
>
>  and the case vs control contrast is no longer significant when I add 
> cholesterol level to the model. In this case, cholesterol level is in 
> the design matrix, as
>               case  control   cholesterol
>                  1        0           130
>                   0       1           120
>                   1       0           110
> My question is, the cholesterol explain gene expression more may be 
> the reflection of treatment . That is treatment change the expression 
> of some genes and then cause the cholesterol difference. Therefore 
> it's not the confounding factor. But I don't know how to distinguish them.

That is the point I tried to make in my last response. All you know at 
this point is that the difference in gene expression correlates with 
cholesterol level. The model only tells you correlation, not causation.

If you want to know if the gene causes differences in cholesterol levels 
or if cholesterol levels cause differential gene expression, you will 
have to design and perform an experiment to test for the causal factor.

Best,

Jim

>
> Any suggestion?
>
>
> Chen
>
>
>
> 2012/4/5 James W. MacDonald <jmacdon at uw.edu <mailto:jmacdon at uw.edu>>
>
>     Hi Chen,
>
>
>     On 4/5/2012 9:12 AM, yao chen wrote:
>
>         Thanks, Moshe and James
>
>         Any question about confounding factor if you or anybody can
>         help me.
>
>         Actually, I have many different phenotype of samples here,
>         such as weight, age, cholesterol. I want to find the
>         differential expressed genes between case and control
>         (say:high blood pressure vs health ). I found some significant
>         genes when I adjust the age, weight. However, when I adjust
>         the cholesterol level, these genes became not significant (/P/
>         values larger). So either cholesterol is confounding factor
>         which I should control , or these differential expressed genes
>         control the cholesterol level which I should not adjust. I am
>         not sure which assumption is right. I am thinking it should be
>         the last one.
>
>
>     I'm not sure I completely understand what you are saying here. It
>     is conventional in this field to state that 'genes are
>     significant', but that is an incomplete statement. The real
>     statement is that the gene expression appears to be significantly
>     differentially expressed between (e.g., treatment and control). So
>     are you saying that the case vs control contrast is no longer
>     significant if you add cholesterol level to the model?
>
>     If you add cholesterol level to the model, you have moved from an
>     ANOVA model to an ANCOVA model in which you are fitting a linear
>     model between the gene expression and cholesterol level, and
>     allowing different intercepts between the different factor levels,
>     but assuming equal slopes. If the contrasts between factor levels
>     are no longer significant, that implies that the intercepts aren't
>     different.
>
>     In this case, the cholesterol level 'explains' more about gene
>     expression than the treatment. Whether or not the genes control
>     the cholesterol level or the cholesterol level is affecting gene
>     expression cannot be determined. But what can be said is that once
>     you control for cholesterol level, the differences between the
>     factor levels are no longer significant.
>
>     Also note that fitting a linear model (as compared to ANOVA) is
>     trickier. For ANOVA, you really just assume that the distribution
>     of the data for the different factor levels is relatively close to
>     normal (or at least 'hump shaped'). With linear regression you can
>     run into problems if your residuals are not correctly distributed,
>     if you have high leverage data points, etc, which are not things
>     you can check for thousands of models. In addition, you are making
>     the assumption that the slopes of the regression lines are
>     parallel, which might not always be valid. You may need to
>     introduce an interaction term to allow for diverging slopes.
>
>     Best,
>
>     Jim
>
>
>
>         2012/4/4 Moshe Olshansky <olshansky at wehi.edu.au
>         <mailto:olshansky at wehi.edu.au> <mailto:olshansky at wehi.edu.au
>         <mailto:olshansky at wehi.edu.au>>>
>
>
>            You are right.  My mistake. Sorry...
>
>         > Hi, Moshe,
>         >
>         > I still did not get it. How to use your method?
>         >
>         > lmFit<-(x,design) which x allow missing values, but design
>         didn't.
>         >
>         > usually, I want to adjust batch effects, I will include it
>         in design
>         > matrix, i.e.
>         >                case  control  batch
>         >                  1        0           1
>         >                   0       1           1
>         >                   1       0           2
>         > which x is expression matrix, and then I will get the
>         differential
>         > expressed gene between case and control with adjusting batch.
>         >
>         > In this case, I just want to simply use "weight" instead of
>             "batch" in
>         > design matrix. But unfortunately, you can't have missing values
>            in design
>         > matrix which I think it's very often if we have many samples.
>         >
>         > 2012/4/4 Moshe Olshansky <olshansky at wehi.edu.au
>         <mailto:olshansky at wehi.edu.au>
>         <mailto:olshansky at wehi.edu.au <mailto:olshansky at wehi.edu.au>>>
>
>         >
>         >> Why do you want to include weight in the design matrix?
>         >> It may be more reasonable to include weight in the linear
>            model, i.e.
>         >> expression ~ condition + weight and then your design matrix
>            will have
>         >> two
>         >> columns (if there are two conditions): the first column will
>            contain 0's
>         >> and 1's depending on the condition (group) of the patient
>         while the
>         >> second
>         >> one will be identically 1. And then the weights of the patients
>            will be
>         >> in
>         >> the values (i.e. the second argument to lmFit) which I believe
>            allows
>         >> missing values.
>         >> But this implicitly assumes that the weight (or some function
>            of it -
>         >> you
>         >> can use any transformation you like) contributes additively
>         (and
>         >> linearly)
>         >> to the expression (or it's logarithm).
>         >>
>         >> Moshe.
>         >>
>         >> > Hi,
>         >> >
>         >> > I am using Limma to get differential expressed genes between
>            case and
>         >> > control samples. In addition, I want to adjust the
>         "weight" of
>         >> patients
>         >> by
>         >> > including it in the "design" matrix. However, one weight
>            value of a
>         >> > patient
>         >> > is missing, so I have a "NA" in design matrix. When I run
>            lmFit, I got
>         >> the
>         >> > error message :Error in qr.default(x) : NA/NaN/Inf in foreign
>            function
>         >> > call
>         >> > (arg 1).
>         >> >
>         >> > Can limma handle the missing value in design matrix or I
>         have to
>         >> remove
>         >> > this sample ?
>         >> >
>         >> > Thanks,
>         >> >
>         >> > Chen
>         >> >
>         >> >       [[alternative HTML version deleted]]
>         >> >
>         >> > _______________________________________________
>         >> > Bioconductor mailing list
>         >> > Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>>
>
>         >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>         >> > Search the archives:
>         >> >
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>         >> >
>         >>
>         >>
>         >>
>         >>
>          
>          ______________________________________________________________________
>         >> The information in this email is confidential and intended
>            solely for
>         >> the
>         >> addressee.
>         >> You must not disclose, forward, print or use it without the
>            permission
>         >> of
>         >> the sender.
>         >>
>          
>          ______________________________________________________________________
>         >>
>         >
>
>
>
>          
>          ______________________________________________________________________
>            The information in this email is confidential and intended
>         solely
>            for the addressee.
>            You must not disclose, forward, print or use it without the
>            permission of the sender.
>          
>          ______________________________________________________________________
>
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099