[R] How to test a difference in ratios of count data in R

Fri Sep 30 20:37:01 CEST 2016

> On Sep 30, 2016, at 9:40 AM, Shuhua Zhan <szhan at uoguelph.ca> wrote:
> 
> 
> Thank you, David and Greg for your help!
> I drew conclusion that the treatment B significantly increases the ratio of x group (X/n) from  based on p values from the treatmentB line of the outputs at logistic reg. and Poisson reg.(p=6.11e-07, Logistic; p=0.000152, Poisson). I'm wondering whether the significance of the (Intercept) line in both outputs affect my conclusion.
> Logistic reg.  Fixed effects:
>             Estimate Std. Error z value Pr(>|z|)    
> (Intercept)  -1.7442     0.5438  -3.208  0.00134 ** 

As in most regression situations, the p-value is calculated on the basis of a null hypothesis that the baseline assemblage of covariates will be associated with an "outcome" of one on the transformed scale of analysis. For logistic regression this would imply  an odds ratio of 1.0 and that the baseline covariates would then be associated with a 50:50 split of events and non-events. This is often not a hypothesis of particular interest, so generally the advice is to ignore that p-value.

> treatmentB    2.3647     0.4741   4.988 6.11e-07 ***
> 
> Poisson reg. Fixed effects:
>             Estimate Std. Error z value Pr(>|z|)    
> (Intercept)  -1.7875     0.3372  -5.301 1.15e-07 ***

For Poisson regression the Intercept is the log( rate) at a covariate value of again zero for continuous variables and baseline for factors.  The null hypothesis tested is that the log(rate) is zero. This is often not a meaningful hypothesis, (especially if age is a covariate as it usually is events in a medical settings).

> treatmentB    1.3365     0.3529   3.787 0.000152 ***
> 
> I think I may use the command below to obtain the mean ratios of x group in treatment A and B for Logistic reg. but I have not figured out yet for Poisson reg.
> 
> > tapply(predict(fit2, type="response"), test$treatment, mean)
>         A         B 
> 0.1620254 0.6404239
> 
> I'll appreciate if you know the command for Poisson reg.

Should be the same call but the interpretation is different as mentioned above. The ?predict.glm halp page says " the alternative "response" is on the scale of the response variable. " and that is no limited to logistic regression despite the fact that the next sentence applies that principle to the logistic case.

This is knowledge one would probably get in a third or fourth semester stats course, at least based on my experience 30 year ago. This is really not what the Rhelp mailing list is set up to provide, (nor am I necessarily the most qualified to provide online statistics education) so I'm suggesting you pose any further question to a forum that is designed to address such matters. Go to http://stats.stackexchange.com/  or find an outlet that is designed for self-learning of statistical concepts. 

This is material that I learned by both formal education and study of Breslow and Day's two volume text "Statistical Methods in Cancer Research" which I believe is a classic and still useful resource. It used GLIM as the coding platform, but I found transferring from GLIM to R rather easy. I can also recommend Venables and Ripley's "Modern Applied Statistics with S-Plus" which in it's 4th edition had R-specific annotations. It covers a wide range of applications, but does assume significant prior training. 

I encourage other respondents who have different recommendations for more recent texts suitable for self-study of using R for generalized linear models with replication to chime in here, since I think such recommendations _are_ on-topic. I hesitate to recommend "Mixed Effects Models with S and S-Plus" because so much has been changed in more recent versions of package lme4 that I find it very error prone to use. I wish the effort to write a replacement had continued but my searches have suggested to me that it has not happened. I'd be happy to be corrected.

Best;

David.

> Joshua
> 
> 
> From: David Winsemius <dwinsemius at comcast.net>
> Sent: Wednesday, September 28, 2016 4:54:46 PM
> To: Shuhua Zhan
> Cc: r-help at R-project.org; Greg Snow
> Subject: Re: [R] How to test a difference in ratios of count data in R
>  
> 
> > On Sep 28, 2016, at 9:49 AM, Greg Snow <538280 at gmail.com> wrote:
> > 
> > There are multiple ways of doing this, but here are a couple.
> > 
> > To just test the fixed effect of treatment you can use the glm function:
> > 
> > test <- read.table(text="
> > replicate treatment n X
> > 1 A 32 4
> > 1 B 33 18
> > 2 A 20 6
> > 2 B 21 18
> > 3 A 7 0
> > 3 B 8 4
> > ", header=TRUE)
> > 
> > fit1 <- glm( cbind(X,n-X) ~ treatment, data=test, family=binomial)
> > summary(fit1)
> > 
> > Note that the default baseline value may differ between R and SAS,
> > which would result in a reversed sign on the slope coefficient (and
> > different intercept).
> > 
> > To include replicate as a random effect you need an additional
> > package, here I use lme4 and the glmer function:
> > 
> > library(lme4)
> > fit2 <- glmer( cbind(X, n-X) ~ treatment + (1|replicate), data=test,
> > family=binomial)
> > summary(fit2)
> > 
> > 
> > 
> > On Tue, Sep 27, 2016 at 9:03 PM, Shuhua Zhan <szhan at uoguelph.ca> wrote:
> >> Hello R-experts,
> >> I am interested to determine if the ratio of counts from two groups differ across two distinct treatments. For example, we have three replicates of treatment A, and three replicates of treatment B. For each treatment, we have counts X from one group and counts Y from another group. My understanding is that that GLIMMIX procedure in SAS can calculate whether the ratio of counts in one group (X/X+Y) significantly differs between treatments.
> >> 
> >> I think this is the way you do it in SAS. The replicate and treatment variables are self-explanatory. The first number (n) refers to the total counts X + Y; the second number (X) refers to the counts X. Is there a way to do this in R? Since we have 20,000 datasets to be tested, it may be easier to retrive the significant test as the given dataset below and its p>F value and mean ratios of treatments in R than SAS.
> >> 
> >> 
> >> data test;
> >> input replicate treatment$ n X;
> >> datalines;
> >> 1 A 32 4
> >> 1 B 33 18
> >> 2 A 20 6
> >> 2 B 21 18
> >> 3 A 7 0
> >> 3 B 8 4
> >> ;
> >> 
> 
> Greg has already shown you how that is done in R and how to do logistic regression:
> 
> #  I usually think of Poisson regression when I hear a desire is to model ratios of counts that have a denominator. The log(sample_size) is supplied as an offset to correct for the variation in size of subsamples.
> 
> 
> fit1 <- glm( X ~ treatment+offset(log(n)), data=test, family=poisson)
> summary(fit1)
> 
> #  And the lme4 analogue with replication:
> 
> library(lme4)
> fit2 <- glmer( X ~ treatment + offset(log(n))+ (1|replicate), data=test,
> family=poisson)
> summary(fit2)
> #----output----
> Generalized linear mixed model fit by maximum likelihood (Laplace  Approximation)
>  [glmerMod]
>  Family: poisson  ( log )
> Formula: X ~ treatment + offset(log(n)) + (1 | replicate)
>    Data: test
> 
>      AIC      BIC   logLik deviance df.resid 
>     31.9     31.3    -13.0     25.9        3 
> 
> Scaled residuals: 
>     Min      1Q  Median      3Q     Max 
> -1.0504 -0.4146 -0.3487  0.3956  1.0791 
> 
> Random effects:
>  Groups    Name        Variance Std.Dev.
>  replicate (Intercept) 0.03159  0.1777  
> Number of obs: 6, groups:  replicate, 3
> 
> Fixed effects:
>             Estimate Std. Error z value Pr(>|z|)    
> (Intercept)  -1.7875     0.3372  -5.301 1.15e-07 ***
> treatmentB    1.3365     0.3529   3.787 0.000152 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Correlation of Fixed Effects:
>            (Intr)
> treatmentB -0.838
> 
> Compare with the binomial model:
> #============
> 
> 
>  fitBin <- glmer( cbind(X,n-X) ~ treatment +  (1|replicate), data=test,
>  family=binomial)
>  coef(fitBin)
> #----
> $replicate
>   (Intercept) treatmentB
> 1  -2.0487694   2.364695
> 2  -0.9908556   2.364695
> 3  -2.1844435   2.364695
> 
> attr(,"class")
> [1] "coef.mer"
> #-----
>  summary(fitBin)
> #---------
> Generalized linear mixed model fit by maximum likelihood (Laplace  Approximation)
>  [glmerMod]
>  Family: binomial  ( logit )
> Formula: cbind(X, n - X) ~ treatment + (1 | replicate)
>    Data: test
> 
>      AIC      BIC   logLik deviance df.resid 
>     30.1     29.4    -12.0     24.1        3 
> 
> Scaled residuals: 
>      Min       1Q   Median       3Q      Max 
> -0.88757 -0.35065 -0.03137  0.26897  0.67505 
> 
> Random effects:
>  Groups    Name        Variance Std.Dev.
>  replicate (Intercept) 0.4123   0.6421  
> Number of obs: 6, groups:  replicate, 3
> 
> Fixed effects:
>             Estimate Std. Error z value Pr(>|z|)    
> (Intercept)  -1.7442     0.5438  -3.208  0.00134 ** 
> treatmentB    2.3647     0.4741   4.988 6.11e-07 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Correlation of Fixed Effects:
>            (Intr)
> treatmentB -0.568
> 
> The binomial model has a logit link. Your glimmix procedure appears to have a gaussian/normal distributional assumption and an identity link by default. If we run this using those assumptions in lme4::glmer we get these results (with a warning that in this case we can overlook since the results with lmer turned out to be identical)
> #--------
>  fitNorm <- glmer( I(X/n) ~ treatment +  (1|replicate), data=test,
>                   family=gaussian)
> 
> #-------
> Warning message:
> In glmer(I(X/n) ~ treatment + (1 | replicate), data = test, family = gaussian) :
>   calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
> > coef(fitNorm); summary(fitNorm)
> $replicate
>   (Intercept) treatmentB
> 1 0.091096925  0.4925325
> 2 0.324579602  0.4925325
> 3 0.009323473  0.4925325
> 
> attr(,"class")
> [1] "coef.mer"
> Linear mixed model fit by REML ['lmerMod']
> Formula: I(X/n) ~ treatment + (1 | replicate)
>    Data: test
> 
> REML criterion at convergence: -4.2
> 
> Scaled residuals: 
>     Min      1Q  Median      3Q     Max 
> -0.7864 -0.4278 -0.1152  0.5143  0.8246 
> 
> Random effects:
>  Groups    Name        Variance Std.Dev.
>  replicate (Intercept) 0.027895 0.16702 
>  Residual              0.002356 0.04854 
> Number of obs: 6, groups:  replicate, 3
> 
> Fixed effects:
>             Estimate Std. Error t value
> (Intercept)  0.14167    0.10042   1.411
> treatmentB   0.49253    0.03963  12.427
> 
> Correlation of Fixed Effects:
>            (Intr)
> treatmentB -0.197
> 
> That's (probably) the model to compare to your SAS results if my reading of the SAS Proc GLIMMIX manual page is correct.
> 
> -- 
> David.
> 
> >> proc glimmix data=test;
> >> class replicate treatment;
> >> model X/n = treatment / solution;
> >> random intercept / subject=replicate;
> >> run;
> >> 
> >> ods select lsmeans;
> >> proc glimmix data=test;
> >> class replicate treatment;
> >> model X/n = treatment / solution;
> >> random intercept / subject=replicate;
> >> lsmeans treatment / cl ilink;
> >> run;
> >> 
> >> I appreciate your help in advance!
> >> Joshua
> >> 
> >> 
> >>        [[alternative HTML version deleted]]
> >> 
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> > 
> > -- 
> > Gregory (Greg) L. Snow Ph.D.
> > 538280 at gmail.com
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA

David Winsemius
Alameda, CA, USA