[BioC] edgeR uneven group sizes

Gordon K Smyth smyth at wehi.EDU.AU
Sat Jul 6 01:01:03 CEST 2013


Dear Charles,

The link you give is to a user question.  I replied to that post 
explaining how to solve the problem without removing samples.  Have you 
not read my reply?

https://stat.ethz.ch/pipermail/bioconductor/2012-November/049087.html

The advice that I gave there applies also to your data.

The problem is that the model.matrix() function in R adds superfluous 
columns to the design matrix that have to removed manually.  In your case 
you have to remove the design columns for disease patients 3 and 4, 
because there are no such patients.  It is beyond the scope of the edgeR 
package to rewrite the model.matrix() function, which is maintained by R 
core, so I can only advise on work-arounds.

Best wishes
Gordon

On Fri, 5 Jul 2013, Charles Determan Jr wrote:

> Gordon,
>
> The reason I ask is because I get an error if I attempt to run a design
> formula of (~group + group:subject + group:time) and I run
> estimateGLMCommonDisp(dge, design) I get the error:
>
> Error in glmFit.default(y, design = design, dispersion = dispersion,
> offset = offset,  :
>  Design matrix not of full rank.  The following coefficients not estimable:
>
>
> The mailing list post I am referring to, with the same error, is at the
> following link:
> https://stat.ethz.ch/pipermail/bioconductor/2012-November/049055.html
>
> Am I simply writing the design formula incorrectly to still account for the
> subject variation?
>
> Regards,
> Charles
>
>
>
>
> On Thu, Jul 4, 2013 at 6:49 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Charles,
>>
>> There is no requirement in edgeR for equal group sizes, and never has
>> been.  I am puzzled why you might think there is such an assumption. edgeR
>> always allows you to use all the available data that is scientifically
>> meaningful.
>>
>> You say that you read "the initial posting that lead to this section of
>> the manual and it said to drop the samples that don't have equal numbers"
>> but I do not know what you are refering to.  I have never seen such advice.
>>
>> Best wishes
>> Gordon
>>
>>  Date: Wed, 3 Jul 2013 09:49:30 -0500
>>> From: Charles Determan Jr <deter088 at umn.edu>
>>> To: Bioconductor mailing list <bioconductor at r-project.org>
>>> Subject: [BioC] edgeR uneven group sizes
>>>
>>> Hello,
>>>
>>> I recently had a question regarding repeated measures RNA-seq analysis.
>>> This has been thoroughly answered through an extension of the edgeR manual
>>> section 3.5. However this has lead to me towards another question as I
>>> attempted to extend such concepts to another experiment wherein the sample
>>> size in each group is different.  For example, here is a dataframe modified
>>> from the edgeR user manual concerning between and within subjects
>>> comparisons (Section 3.5) and another containing specific times points to
>>> explain my point, both dataframes re-numbered as recommended by the manual.
>>>
>>>  targets
>>>>
>>>    Disease Patient Treatment
>>> 1   Healthy    1        None
>>> 2   Healthy    1        Hormone
>>> 3   Healthy    2        None
>>> 4   Healthy    2        Hormone
>>> 5   Healthy    3        None
>>> 6   Healthy    3        Hormone
>>> 7   Disease1  1       None
>>> 8   Disease1  1       Hormone
>>> 9   Disease1  2       None
>>> 10 Disease1  2       Hormone
>>> 11 Disease2  1       None
>>> 12 Disease2  1       Hormone
>>> 13 Disease2  2       None
>>> 14 Disease2  2       Hormone
>>> 15 Disease2  3       None
>>> 16 Disease2  3       Hormone
>>>
>>>  sample_data
>>>>
>>>    Condition Subject Time
>>> 1   control    1        0hr
>>> 2   control    1        1hr
>>> 3   control    1        2hr
>>> 4   control    2        0hr
>>> 5   control    2        1hr
>>> 6   control    2        2hr
>>> 7   control    3        0hr
>>> 8   control    3        1hr
>>> 9   control    3        2hr
>>> 10 control    4        0hr
>>> 11 control    4        1hr
>>> 12 control    4        2hr
>>> 13 Disease  1        0hr
>>> 14 Disease  1        1hr
>>> 15 Disease  1        2hr
>>> 16 Disease  2        0hr
>>> 17 Disease  2        1hr
>>> 18 Disease  2        2hr
>>>
>>> I have read the initial posting that lead to this section of the 
>>> manual and it said to drop the samples that don't have equal numbers. 
>>> Now this doesn't seem to be a big deal if only dropping from one group 
>>> a sample or two but could potentially be a problem such as above where 
>>> dropping four or six samples seems more of a sacrifice.  I begin to 
>>> think of experiments which (assuming repeated/dependent samples) group 
>>> numbers very more significantly as a result of difficulty acquiring 
>>> samples. Are there any recommendations from the community regarding 
>>> such a situation?  All I have found assumes that the samples within 
>>> each group are equal.
>>>
>>> Regards,
>>> --
>>> Charles Determan
>>> Integrated Biosciences PhD Candidate
>>> University of Minnesota

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list