[BioC] edgeR, logFC calculation in factor combination

Mon May 19 11:47:09 CEST 2014

Dear Mike,

On Mon, 19 May 2014, Mike Miller wrote:

> Dear Gordon,
>
>
> Thank you for your help.
>
> The contrast was: (control_1.localization_A) minus 
> (control_1.localization_B). It was not a simple comparison between 
> groups, the complete design looks like (in previous email I tried to 
> simplify the question, for that reason, I excluded some factors in the 
> question):

Unfortunately, the phenomenon that you say you have observed could not 
have occured in the simplified situation you reported, which made 
answering the question difficult.

> design_combi=model.matrix(~0+ ctrl_loc + Gender+Sample_blocking, data=combi)
>
>> colnames(design_combi)
> [1] "control_0.localization_A"       "control_0.localization_B"
> "control_1.localization_A"
> [4] "control_1.localization_B"       "Gender1"
> "Sample_blocking15"
> [7] "Sample_blocking16" "Sample_blocking17" "Sample_blocking18"
> [10] "Sample_blocking19" "Sample_blocking2"  "Sample_blocking20"
> [13] "Sample_blocking21" "Sample_blocking22" "Sample_blocking23"
> [16] "Sample_blocking24" "Sample_blocking27" "Sample_blocking28"
> [19] "Sample_blocking3"  "Sample_blocking33" "Sample_blocking34"
> [22] "Sample_blocking35" "Sample_blocking36" "Sample_blocking37"
> [25] "Sample_blocking38" "Sample_blocking39" "Sample_blocking6"
> [28] "Sample_blocking8"
>
> Factor "ctrl_loc" contains combinations of 2 factors (thus has 4 levels):
> 1. disease_state (levels: control_0 and control_1)
> 2. localization (levels: A and B)
> Levels: control_0.localization_A, control_1.localization_A,
> control_0.localization_B, control_1.localization_B.
>
> Factor "Gender" has 2 levels: 0 and 1.
> Factor "Sample_blocking" is here to account for the batch effect (some
> samples are coming from the same patient).
> Since the "Design matrix not of full rank" I excluded 2 columns from the
> "design_combi" (columns number 25 and 26) to be able to compute dispersions.

This means that you are treating blocks 38 and 39 as if they were the same 
as block 1.  You seem to be trying to adjust for too many things.

> After fitting the model,  I defined contrasts:
> my_contrasts=makeContrasts(disease.A_minus_disease.B=control_1.localization_A
> - control_1.localization_B, levels=design_combi).
>
> lrt=glmLRT( fit, contrast=my_contrasts[, "disease.A_minus_disease.B"])
>
> I was wondering how is logFC computed in this case (in this contrast), and
> I wanted just to confirm that when the logFC >0, that means that the gene
> is up-regulated in control_1.localization_A compared to
> control_1.localization_B.

What it means is that the gene would be higher in localization A than in 
B, if both localizations were applied to cases of the same gender and the 
same block.

Given that the localizations A and B might actually be unbalanced by 
gender and block, the observed marginal differences between A and B could 
be quite different.

> To check that, I tried to manually calculate just for one gene the logFC 
> value, but obviously I do not know how it is computed in this case.

You can't compute it manually.  The whole point of a package like edgeR is 
to do computations you can't do easily by yourself.

Best wishes
Gordon

> Thank you very much in advance!
>
> Best regards,
> Mike
>
>
> On Sun, May 18, 2014 at 5:55 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Mike,
>>
>> edgeR does compute logFC correctly, and a positive logFC does obviously
>> mean that the counts are higher, other things being equal, in the second
>> group than the first group.
>>
>> If you are making the comparison B-A (B minus A), then a positive logFC
>> means expression is higher in B than in A.  The topTable() function always
>> output the contrast you are making at the top of the table, so that the
>> table is unambiguous.
>>
>> I can't tell from your email what comparison you actually made.  What is
>> the non-ascii character is your email between "control_1.localization_A"
>> and "control_1.localization_B"?  Was it supposed to be a minus sign or
>> something else?
>>
>> I don't know whether you are making a simple comparison between two
>> groups, or whether you have fitted a more complex linear model.  This can
>> affect the interpration of logFCs.
>>
>> Best wishes
>> Gordon
>>
>>
>>  Date: Fri, 16 May 2014 12:06:14 +0200
>>> From: Mike Miller <mike.bioc32 at gmail.com>
>>> To: <bioconductor at stat.math.ethz.ch>
>>> Subject: [BioC] edgeR, logFC calculation in factor combination
>>> Message-ID:
>>>         <CANkSkzosYxfQKn3y=iTKQ4Qj7DPP+Da+umgPJkU0DE98RUU5bg at mail.
>>> gmail.com>
>>> Content-Type: text/plain
>>>
>>> Dear All,
>>>
>>>
>>> I am using edgeR for an RNASeq experiment (~30 samples), where I need to
>>> explore the influence of 2 factors with 2 levels each (there are actually
>>> more factors, but for simplicity I put only 2 here):
>>>
>>> 1. disease_state (levels: control_0 and control_1)
>>>
>>> 2. localization (levels: A and B)
>>>
>>>
>>> I combined the factors and got 4 combinations:
>>>
>>> control_0.localization_A, control_1.localization_A,
>>> control_0.localization_B, control_1.localization_B.
>>>
>>> I am specifically interested in a list of diff. expressed genes between
>>> these 2 combinations: control_1.localization_A â€“
>>> control_1.localization_B.
>>>
>>> When separating up- and down-regulated genes, I did it simply according to
>>> logFC column (after filtering for genes with padj < 0.05). If the logFC<0,
>>> could I infer that the gene is DOWN-regulated in control_1.localization_A?
>>> I assume that this is true.
>>>
>>> However, when I take the raw counts for one of the diff. expressed genes
>>> (for which logFC<0), in control_1.localization_A and in
>>> control_1.localization_B conditions, it turned out to be the following:
>>>
>>> mean(control_1.localization_A) / mean(control_1.localization_B) = 4.2
>>>
>>> (Formula explained in words: mean of raw counts for the diff. expressed
>>> gene X in samples which are control_1 and localization_A, divided by mean
>>> of raw counts for the diff. expressed gene X in samples which are
>>> control_1
>>> and localization_B)
>>>
>>> This is contrary to the conclusion I got from logFC, since according to
>>> raw counts that gene is UP-regulated in control_1.localization_A.
>>>
>>> I know that logFC is calculated differently, but shouldn't the ratio
>>> (whether the gene is up- or down-regulated) stay nevertheless the same (if
>>> library sizes are very similar in all samples, which is the case here)?
>>>
>>> Maybe a more precise question would be: how is logFC calculated in this
>>> case, when we have a combination of different factors?
>>>
>>> Thank you very much in advance for any piece of clarification!
>>>
>>> Mike
>>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}