[BioC] Limma issues.
smyth at wehi.edu.au
Wed Nov 12 02:02:24 MET 2003
Firstly, I think you'd find it helpful to read Section 7.3 in the Limma
User's Guide which does address some of your questions. Everything in
Section 7 was new in Limma 1.3.0 so you may not have seen it yet.
Secondly, it is generally best to analyse all of your data at once. It is
no surprise that you get more differentially expressed (DE) genes when
using all of the data. Obviously with more data you can have more
confidence in your results, hence a greater number of genes pass the
threshold of being judged DE. What is happening specifically is that the
gene-wise standard deviation is estimated more reliably when using all of
your data, so there are more degrees of freedom associated with your tests,
hence more results significant.
Some more specific responses below.
At 07:51 AM 12/11/2003, Marcus wrote:
>Hi limma users.
>I have some general questions regarding the use of dupcor.series and
>I use an experimental design that compares three different conditions to a
>fourth (which I will call a reference).
>So the design of the experiment looks something like:
>Cond 1 Cond2 Cond2
>(There are supposed to be double arrows between the different conditions
>and the reference.)
>The experiment is done in four (and between some conditions, five)
>biological replicates, so in total I have made 4 times 6 (dye-swap) = 24
>hybridizations which compares all the conditions with the reference, plus
>four additional hybridizations which is a dye-swap between cond2 and
>reference plus cond3 and the reference. This gives in total 27 slides
>since one slide was excluded due to a terrible hybridization. Each slide
>has a duplicate spot.
>This data has been filtered using different filter cut-offs and the
>remaining data has been put into a matrix of M-values. The data has been
>normalized and dye-swapped.
> From this point I have made two different analysis paths.
>One analysis path was to divide the M-matrix into three different matrices
>which contains data from the different conditions and I continued using
>the matrices with the following
>Where the design was something like: design <- c(1,1,1,1,1,1,1,1)
>The correlation is later used in the function: gls.series(M.matrix.cond1,
>and continuing with: eb<-ebayes(fit)
>the data from ebayes is later used in the toptable
>When I make this analysis I get a list of 82 genes for cond1 that has a
>B-value larger than zero.
>But if I use another strategy where I use all the slides in the same
>analysis and use a design matrix which looks something like:
> Con1 Con2 Con3
>1 1 0 0
>2 1 0 0
>3 1 0 0
>4 1 0 0
>5 1 0 0
>6 1 0 0
>7 1 0 0
>8 1 0 0
>9 0 1 0
>10 0 1 0
>11 0 1 0
>12 0 1 0
>13 0 1 0
>14 0 1 0
>15 0 1 0
>16 0 1 0
>17 0 1 0
>18 0 1 0
>19 0 0 1
>20 0 0 1
>21 0 0 1
>22 0 0 1
>23 0 0 1
>24 0 0 1
>25 0 0 1
>26 0 0 1
>27 0 0 1
>And make the function calls like the procedures above except the toptable
>where I specify the coef number to be the condition which I want to
>investigate. The thing is, when I use this approach I get more genes that
>have a b-value greater than zero, namely 110.
>My guess is that when I run the dupcor.series the correlations is based on
>all the slides in some way, since corr$cor is a lonely number and not a
>correlation for every comparison in the experimental design. The cor
>numbers from the different analysis pathways are not equal either.
>When I run the gls.series function I get a fit$sigma that is based on all
>the slides or?
>What is actually happening when the sigma is calculated. The sigma is also
>generally smaller when all the slides are included compared to the
>analysis path where I used one condition at a time, which I guess should
>be the case if all slides are included.
I don't think this can be correct. Sigma measures the spot to spot
variability of measured log-ratios for a given gene. The whole-data
estimate for each gene should be intermediate between the individual
>The lods which is calculated in the ebayes function is larger then the
>lods which originates from the analysis path where I examined one
>condition at a time, so this result in more genes that is larger than zero.
>So to my questions (sorry for the long introduction):
>Is the design matrix correct when the experiment is performed the way it
>is? According to the user guide, this is an optional way when you have a
>reference design but since the result is deviating from the one comparison
>at a time I am not sure.
You have dye-swaps but your design matrices do not reflect this. Apart from
this, your design may be correct.
>If the design is correct, which of the two analysis pathways is the
Almost always, the better approach is to use all your data.
>I dont consider myself as a statistician, more like a biologist, but I
>reckon the sigma factor as a crucial variable in these analyses. I changed
>the fit$sigma for the "one condition" analysis path to the sigma that
>originates from the analysis that contains all the slides and the number
>of genes that reach above B-value zero increased to 111 which is quite
>like the 110 that I got from the other analysis pathway. So I guess that
>the sigma factor is quite responsible for the B-value spectra.
It is one factor that is used in the computation.
>If I would like to compare, lets say the cond2 with cond3 can I then add a
>another column to the design matrix, and if so, how would it look like.
No, you do this using a contrast matrix. Please see Section 7.3 in the
>During these analysis I have used R-version 1.8 and limma version 1.3.1
>I would be very grateful for some response.
>Marcus Gry Björklund
>Royal Institute of Technology
>AlbaNova University Center
>Stockholm Center for Physics, Astronomy and Biotechnology
>Department of Molecular Biotechnology
>106 91 Stockholm, Sweden
>Phone (office): +46 8 553 783 45
>Fax: + 46 8 553 784 81
>Visiting adress: Roslagstullsbacken 21, Floor 3
>Delivery adress: Roslagsvägen 30B
More information about the Bioconductor