[BioC] covariate information

W. Evan Johnson wej at bu.edu
Fri Oct 26 15:04:12 CEST 2012


Chol-hee, 

So I think I now better understand your issue. Why do you have individual ID in the model as a covariate? Do you have repeated measures in your dataset? If not, you should NOT be including individual ID in as a covariate. If it is a repeated measures experiment, then is seems that individual is nested within your treatments/covariates (do you only have one individual ID per treatment combination?). Also, note that you should be treating age as a numerical covariate. 

I would strongly advise to you that you go meet with a statistician with experience in linear models and discuss with her/him what covariates should be included in your model (and why!). It seems that your question is more than just a ComBat question but rather a design and linear modeling question. Any good statistician should be able to help you with this.

Thanks!

Evan



On Oct 26, 2012, at 12:46 AM, Cholhee Jung wrote:

> Dear Evan,
> 
> Thank you for your reply.
> 
> So, my understanding is that you suggest my first attempt as the more
> appropriate way to run ComBat than the second attempt.
> However, removing covariate one after another until I don't get the
> singularity error spared just one covariate.
> 
> While the only remaining covariate is the most important one (as it
> is the individual ID), I wonder if it's really OK to lose all other
> covariate information, such as sample preparation methods, age,
> disease etc, for the technical artefact removal.
> 
> I'm just worried about losing actual signals coming up between these
> covariates by ComBat.
> 
> Regards,
> Chol-hee
> 
> 
> On Fri, Oct 26, 2012 at 12:19 AM, W. Evan Johnson <wej at bu.edu> wrote:
>> Chol-hee,
>> 
>> Notice the simple example:
>> 
>>> x=as.factor(c(1,0,1,0));y=as.factor(c(1,2,1,2));z=rnorm(4)
>> 
>> Notice that x and y are the same covariate. Now:
>> 
>>> design=model.matrix(z~x+y)
>>> design
>>  (Intercept) x1 y2
>> 1           1  1  0
>> 2           1  0  1
>> 3           1  1  0
>> 4           1  0  1
>> attr(,"assign")
>> [1] 0 1 2
>> attr(,"contrasts")
>> attr(,"contrasts")$x
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$y
>> [1] "contr.treatment"
>> 
>>> solve(t(design)%*%design)
>> Error in solve.default(t(design) %*% design) :
>>  Lapack routine dgesv: system is exactly singular
>> 
>> You get the singularity error because your covariates are exactly the same
>> (or not linearly independent). Now if you concatenate the variables like you
>> did:
>> 
>>> xy=as.factor(paste(x,y,sep='.'))
>>> xy
>> [1] 1.1 0.2 1.1 0.2
>> Levels: 0.2 1.1
>> 
>> Which is clearly different than the original x+y. Now:
>> 
>>> design=model.matrix(z~xy)
>>> design
>>  (Intercept) xy1.1
>> 1           1     1
>> 2           1     0
>> 3           1     1
>> 4           1     0
>> attr(,"assign")
>> [1] 0 1
>> attr(,"contrasts")
>> attr(,"contrasts")$xy
>> [1] "contr.treatment"
>> 
>>> solve(t(design)%*%design)
>>            (Intercept) xy1.1
>> (Intercept)         0.5  -0.5
>> xy1.1              -0.5   1.0
>> 
>> Which now works. However, note that I wouldn't use the latter. You need to
>> find out which of your six covariates are linearly dependent with each other
>> and remove one or more so they are NOT linearly dependent. This will be
>> different from your second attempt but will be equivalent to what you were
>> trying to accomplish in your first attempt
>> 
>> Let me know if this doesn't work!
>> 
>> Thanks!
>> 
>> Evan
>> 
>> --
>> W. Evan Johnson
>> Assistant Professor
>> Division of Computational Biomedicine
>> Boston University School of Medicine
>> 72 East Concord St., E-645
>> Boston, MA 02118
>> Phone: (617) 638-2541
>> 
>> 
>> 
>> On Oct 25, 2012, at 6:00 AM, bioconductor-request at r-project.org wrote:
>> 
>> ------------------------------
>> 
>> Message: 8
>> Date: Wed, 24 Oct 2012 13:40:33 -0400
>> From: Achilleas Pitsillides <anp4r at virginia.edu>
>> To: Cholhee Jung <jung.cholhee at gmail.com>, Bioconductor mailing list
>> <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] Fwd: covariate information
>> Message-ID:
>> <CABDy-6=dP5-Wu+Zqo4-UpMBPX51zeCRPQCwdSV5yGYHg2Z821A at mail.gmail.com>
>> Content-Type: text/plain
>> 
>> Dear Chol-hee,
>> 
>> The short answer is that the two model matrices are different and they have
>> different dimensions; you can verify this by using the dim(mod_mat) to see
>> the dimension of the model matrix.
>> 
>> Here is my understanding: If you have a factor  f1 with 3 levels and a
>> factor f2 with 2 levels (where all the possible level combinations exist),
>> then f1:f2 is all the combinations of f1 and f2 (an equivalent  factor
>> with 6 levels) and  the  model matrix ~f1:f2 would have six columns (i.e.
>> fit a model with 6 coefficients).
>> However, the model matrix ~f1+f2 will have 4 columns ( i.e. fit 4
>> coefficients: constant, two for f1 and one for f2).
>> The model ~f1*f2 will fit 6 coefficients and have a model matrix with the
>> same column space as the model matrix for ~f1:f2.
>> 
>> 
>> I hope this helps,
>> 
>> cheers,
>> Achilleas
>> 
>> 
>> On Tue, Oct 23, 2012 at 10:10 PM, Cholhee Jung
>> <jung.cholhee at gmail.com>wrote:
>> 
>> 
>> 
>> Dear users,
>> 
>> 
>> Below is the question I posted originally in the 'ComBat user forum' of
>> 
>> Google group.
>> 
>> But, as I was suggested to forward my question to Bioconductor mailing
>> 
>> list, I'm doing it now.
>> 
>> 
>> Please find my question, below.
>> 
>> 
>> 
>> Regards,
>> 
>> Chol-hee
>> 
>> 
>> On Tuesday, October 23, 2012 10:13:26 AM UTC+11, Cholhee Jung wrote:
>> 
>> 
>> 
>> 
>> Dear users,
>> 
>> 
>> I was trying ComBat on ~1,000 samples.
>> 
>> Samples are spread over 12 batches and each batch contains 4 technical
>> 
>> replicates that are identical across all batches.
>> 
>> The number of covariates is 5, and I was using the ComBat implemented in
>> 
>> the 'sva' package.
>> 
>> 
>> I tried ComBat with two model matrix built from the same covariate
>> 
>> information.
>> 
>> 
>> First model matrix was constructed as below:
>> 
>> mod_mat = model.matrix(~as.factor(cov1) + as.factor(cov2) +
>> 
>> as.factor(cov3) + as.factor(cov4) + as.factor(cov5), data=pheno_data )
>> 
>> 
>> Second one was built as below:
>> 
>> mod_mat = model.matrix(~as.factor(paste(pheno_data$cov1,
>> 
>> pheno_data$cov2, pheno_data$cov3, pheno_data$cov4, pheno_data$cov5,
>> 
>> sep=":")))
>> 
>> 
>> Basically, covariates were concatenated into one string for the the
>> 
>> second model matrix.
>> 
>> 
>> ComBat with the first model matrix raised the 'singular' error like
>> 
>> below:
>> 
>> 
>> Error in solve.default(t(design) %*% design) :
>> 
>>  Lapack routine dgesv: system is exactly singular
>> 
>> 
>> But, ComBat run without error with the second model matrix.
>> 
>> 
>> 
>> Now I wonder if the two different model matrices are same?
>> 
>> 
>> Regards,
>> 
>> Chol-hee
>> 
>> 
>> 
>> _______________________________________________
>> 
>> Bioconductor mailing list
>> 
>> Bioconductor at r-project.org
>> 
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> 
>> Search the archives:
>> 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> 
>> [[alternative HTML version deleted]]
>> 
>> 
> 



More information about the Bioconductor mailing list