[BioC] covariate information

W. Evan Johnson wej at bu.edu
Fri Oct 26 15:04:12 CEST 2012

```Chol-hee,

So I think I now better understand your issue. Why do you have individual ID in the model as a covariate? Do you have repeated measures in your dataset? If not, you should NOT be including individual ID in as a covariate. If it is a repeated measures experiment, then is seems that individual is nested within your treatments/covariates (do you only have one individual ID per treatment combination?). Also, note that you should be treating age as a numerical covariate.

I would strongly advise to you that you go meet with a statistician with experience in linear models and discuss with her/him what covariates should be included in your model (and why!). It seems that your question is more than just a ComBat question but rather a design and linear modeling question. Any good statistician should be able to help you with this.

Thanks!

Evan

On Oct 26, 2012, at 12:46 AM, Cholhee Jung wrote:

> Dear Evan,
>
>
> So, my understanding is that you suggest my first attempt as the more
> appropriate way to run ComBat than the second attempt.
> However, removing covariate one after another until I don't get the
> singularity error spared just one covariate.
>
> While the only remaining covariate is the most important one (as it
> is the individual ID), I wonder if it's really OK to lose all other
> covariate information, such as sample preparation methods, age,
> disease etc, for the technical artefact removal.
>
> I'm just worried about losing actual signals coming up between these
> covariates by ComBat.
>
> Regards,
> Chol-hee
>
>
> On Fri, Oct 26, 2012 at 12:19 AM, W. Evan Johnson <wej at bu.edu> wrote:
>> Chol-hee,
>>
>> Notice the simple example:
>>
>>> x=as.factor(c(1,0,1,0));y=as.factor(c(1,2,1,2));z=rnorm(4)
>>
>> Notice that x and y are the same covariate. Now:
>>
>>> design=model.matrix(z~x+y)
>>> design
>>  (Intercept) x1 y2
>> 1           1  1  0
>> 2           1  0  1
>> 3           1  1  0
>> 4           1  0  1
>> attr(,"assign")
>> [1] 0 1 2
>> attr(,"contrasts")
>> attr(,"contrasts")\$x
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")\$y
>> [1] "contr.treatment"
>>
>>> solve(t(design)%*%design)
>> Error in solve.default(t(design) %*% design) :
>>  Lapack routine dgesv: system is exactly singular
>>
>> You get the singularity error because your covariates are exactly the same
>> (or not linearly independent). Now if you concatenate the variables like you
>> did:
>>
>>> xy=as.factor(paste(x,y,sep='.'))
>>> xy
>> [1] 1.1 0.2 1.1 0.2
>> Levels: 0.2 1.1
>>
>> Which is clearly different than the original x+y. Now:
>>
>>> design=model.matrix(z~xy)
>>> design
>>  (Intercept) xy1.1
>> 1           1     1
>> 2           1     0
>> 3           1     1
>> 4           1     0
>> attr(,"assign")
>> [1] 0 1
>> attr(,"contrasts")
>> attr(,"contrasts")\$xy
>> [1] "contr.treatment"
>>
>>> solve(t(design)%*%design)
>>            (Intercept) xy1.1
>> (Intercept)         0.5  -0.5
>> xy1.1              -0.5   1.0
>>
>> Which now works. However, note that I wouldn't use the latter. You need to
>> find out which of your six covariates are linearly dependent with each other
>> and remove one or more so they are NOT linearly dependent. This will be
>> different from your second attempt but will be equivalent to what you were
>> trying to accomplish in your first attempt
>>
>> Let me know if this doesn't work!
>>
>> Thanks!
>>
>> Evan
>>
>> --
>> W. Evan Johnson
>> Assistant Professor
>> Division of Computational Biomedicine
>> Boston University School of Medicine
>> 72 East Concord St., E-645
>> Boston, MA 02118
>> Phone: (617) 638-2541
>>
>>
>>
>> On Oct 25, 2012, at 6:00 AM, bioconductor-request at r-project.org wrote:
>>
>> ------------------------------
>>
>> Message: 8
>> Date: Wed, 24 Oct 2012 13:40:33 -0400
>> From: Achilleas Pitsillides <anp4r at virginia.edu>
>> To: Cholhee Jung <jung.cholhee at gmail.com>, Bioconductor mailing list
>> <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] Fwd: covariate information
>> Message-ID:
>> <CABDy-6=dP5-Wu+Zqo4-UpMBPX51zeCRPQCwdSV5yGYHg2Z821A at mail.gmail.com>
>> Content-Type: text/plain
>>
>> Dear Chol-hee,
>>
>> The short answer is that the two model matrices are different and they have
>> different dimensions; you can verify this by using the dim(mod_mat) to see
>> the dimension of the model matrix.
>>
>> Here is my understanding: If you have a factor  f1 with 3 levels and a
>> factor f2 with 2 levels (where all the possible level combinations exist),
>> then f1:f2 is all the combinations of f1 and f2 (an equivalent  factor
>> with 6 levels) and  the  model matrix ~f1:f2 would have six columns (i.e.
>> fit a model with 6 coefficients).
>> However, the model matrix ~f1+f2 will have 4 columns ( i.e. fit 4
>> coefficients: constant, two for f1 and one for f2).
>> The model ~f1*f2 will fit 6 coefficients and have a model matrix with the
>> same column space as the model matrix for ~f1:f2.
>>
>>
>> I hope this helps,
>>
>> cheers,
>> Achilleas
>>
>>
>> On Tue, Oct 23, 2012 at 10:10 PM, Cholhee Jung
>> <jung.cholhee at gmail.com>wrote:
>>
>>
>>
>> Dear users,
>>
>>
>> Below is the question I posted originally in the 'ComBat user forum' of
>>
>>
>> But, as I was suggested to forward my question to Bioconductor mailing
>>
>> list, I'm doing it now.
>>
>>
>> Please find my question, below.
>>
>>
>>
>> Regards,
>>
>> Chol-hee
>>
>>
>> On Tuesday, October 23, 2012 10:13:26 AM UTC+11, Cholhee Jung wrote:
>>
>>
>>
>>
>> Dear users,
>>
>>
>> I was trying ComBat on ~1,000 samples.
>>
>> Samples are spread over 12 batches and each batch contains 4 technical
>>
>> replicates that are identical across all batches.
>>
>> The number of covariates is 5, and I was using the ComBat implemented in
>>
>> the 'sva' package.
>>
>>
>> I tried ComBat with two model matrix built from the same covariate
>>
>> information.
>>
>>
>> First model matrix was constructed as below:
>>
>> mod_mat = model.matrix(~as.factor(cov1) + as.factor(cov2) +
>>
>> as.factor(cov3) + as.factor(cov4) + as.factor(cov5), data=pheno_data )
>>
>>
>> Second one was built as below:
>>
>> mod_mat = model.matrix(~as.factor(paste(pheno_data\$cov1,
>>
>> pheno_data\$cov2, pheno_data\$cov3, pheno_data\$cov4, pheno_data\$cov5,
>>
>> sep=":")))
>>
>>
>> Basically, covariates were concatenated into one string for the the
>>
>> second model matrix.
>>
>>
>> ComBat with the first model matrix raised the 'singular' error like
>>
>> below:
>>
>>
>> Error in solve.default(t(design) %*% design) :
>>
>>  Lapack routine dgesv: system is exactly singular
>>
>>
>> But, ComBat run without error with the second model matrix.
>>
>>
>>
>> Now I wonder if the two different model matrices are same?
>>
>>
>> Regards,
>>
>> Chol-hee
>>
>>
>>
>> _______________________________________________
>>
>> Bioconductor mailing list
>>
>> Bioconductor at r-project.org
>>
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>

```