[BioC] RE: Design matrix with multiple genotypes + quantified variables (+cor/regression)

Mon Aug 23 16:33:02 CEST 2004

Again, sorry for initially posting without to much investigation, but
lots on (haven't we all) and I was hoping someones experience could save
me alot of time. So heres an update.

There are 2 basic questions -
1. Are the design and contrast matrices below correct? Is there a better
way to design it. My hypothesis is that treatment N - treatment A will
be similar between genotypes, but the genotypes will be different to
each other. I'm looking for the global treatment contrast, but don't
want the genotype differences getting in the way. Is this already taken
care of in the design below or does the design need to be different. ie:
is the lm contrast comparing (ConA, MutA, Mut2A) vs. (ConN, MutN, Mut2N)
OR averaging(ConA-ConN, MutA-MutN, Mut2A-Mut2N).

2. How is it best to compare a variable to find genes that correlate to
it. I've done a fair bit on this now but still need some pointers. The
obvious thing to do was a genewise pearson, however, In 'Intro stats
with R' there is the statement - "The reader should be warned that there
are many incorrect uses of correlation coefficients, particularly when
they are used in regression-type settings". Well I'm duly warned but not
sure on what a regression-type setting is. Also it seems that regression
and pearson give the same result.

For the correlation I used cor, and then it suggests to test that the
correlation is significantly different from zero using cor.test. From
comparing these it seems that there is a strict relationship between the
p-value and pearson coefficient that only varies with sample number (#
of arrays). The p-value just gives an indication of what pearson is
significant - but surely you don't need to get it for all genes as it
just seems to rely on sample #?

So I then proceded with regression analysis using lm(). The output
values that appear to be useful are p-value and Rsquared. The former is
the same as from cor.test, and the later is the squared pearson
coefficient, which I've just discussed. Am I missing something, or is
there a better way?

Finally as Limma uses lm functions can I do the regression using it, to
provide access to the other tools such as eBayes, classifyTests or
toptable. Or are they fundamentally different?

Thanks for your time,
Matt

-----Original Message-----
From: Matthew Hannah 
Sent: Donnerstag, 19. August 2004 14:56
To: 'bioconductor at stat.math.ethz.ch'
Subject: Design matrix with multiple genotypes + quantified variables

Hi,

After asking before this design and contrast matrix was suggested and it
worked well. But now it gets complicated?
2 genotypes - Con, Mut
2 treatments - A, N.
4 replicates

treatments <- factor(c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4))
design <- model.matrix(~ 0+treatments)
colnames(design) <- c("ConA","ConN","MutA","MutN") fit <-
lmFit(esetgcrma, design)

cont.matrix <- makeContrasts(ConA-MutA, ConN-MutN,
Gen=(ConN+ConA-MutN-MutA)/2, ConA-ConN, MutA-MutN,
treatment=(ConA+MutA-ConN-MutN)/2,levels=design)
con.fit <- contrasts.fit(fit, cont.matrix)

So what if I add a third genotype - Mut2? 
Is it the obvious add treatments <- .....5,5,5,5,6,6,6,6)) and then for
the contrasts treatment=(ConA+MutA+Mut2A-ConN-MutN-Mut2N)/3)
Or am I misunderstanding how to design contrasts? Is there an easier way
of writing this when you have more genotypes?

Also logically the lm is treating all samples as independent when they
are not, does this matter? Is it possible to fit the original lm using a
design taking genotype and treatment into account? Would this be a
better approach, especially as if you have more genotypes (eg:5-10).
What would the design matrix then look like?

Finally, what if you have a quantified variable for each genotype like a
measure of growth before and after the treatment. Can you specify this
in anyway (in the design matrix?) so you take this into account during
the fit. I thought this was possible using lm or rlm, or am I confusing
something? Alternatively, does anyone have a different approach, such as
an efficient way of doing a gene-by-gene regression or correlation
analysis against the growth measure, and extracting the genes that
correlate best with the growth measure?

Perhaps there is there a good (biologist simple?) book that would cover
design and contrast of lms, anyone know of one?

Thanks again,
Matt