[R] repeated measures: multiple comparisons with pairwise.t.test and multcomp disagree

Wed Jun 24 23:55:03 CEST 2015

Andrew Gelman's "Working Through Some Issues" and the two Letters to
the Editor that follow responding to the editorial decision to ban P
values from The Journal of Basic and Applied Social Psychology (BASP).
You may wish also to read ASA President's David Morgenstern's
reflexive and entirely predictable reaction (P-values are OK; it's
their abuse/misuse that is the problem) in the June 2015 Amstat News.

While I have lots of personal opinions on this, this is not the venue
to (further?) air them. If you wish to engage me -- pro or con; I
welcome both -- please respond privately. I will not comment further
on list.

Cheers,
Bert
Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll

On Wed, Jun 24, 2015 at 2:37 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> Bert, can you be more specific about which article for those of us who don't subscribe?
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On June 24, 2015 12:13:05 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>I would **strongly** recommend that you speak with a local statistical
>>expert before proceeding further. Your obsession with statistical
>>significance is very dangerous. (see the current issue of SIGNIFICANCE
>>for some explanation).
>>
>>Cheers,
>>Bert
>>Bert Gunter
>>
>>"Data is not information. Information is not knowledge. And knowledge
>>is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>
>>On Wed, Jun 24, 2015 at 10:30 AM, Denis Chabot <denis.chabot at me.com>
>>wrote:
>>> Thank you, Thierry. And yes, Bert, it turns out that it is more of a
>>statistical question after all, but again, since my question used
>>specific R functions, R experts are well placed to help me.
>>>
>>> As pairewise.t.test was recommended in a few tutorials about
>>repeated-measure Anovas, I assumed it took into account the fact that
>>the measures were indeed repeated, so thank you for pointing out that
>>it does not.
>>>
>>> But my reason for not accepting the result of multcomp went further
>>than this. Before deciding to test 4 different durations, I had tested
>>only two of them, corresponding to sets 1 and 2 of my example. I used a
>>paired t test (as in t test for paired samples). I had a very
>>significant effect, i.e. the mean of the differences calculated for
>>each subject was significantly different from zero.
>>>
>>> After adding two other durations and switching from my paired t test
>>to a repeated measures design, these same 2 sets are no longer
>>different. I think the explanation is lack of homogeneity of variances.
>>I thought a log transformation of the raw data had been sufficient to
>>fix this, and a Levene test on the variances of the 4 sets found no
>>problem in this regard.
>>>
>>> But maybe it is the variance of all the possible differences (set 1
>>vs 2, etc, for a total of 6 differences calculated for each subject)
>>that matters.  I just calculated these and they range from 1.788502e-05
>>to 1.462171e-03. A Levene test on these 6 "groups" showed that their
>>variances were heterogeneous.
>>>
>>> I think I'll stay away from  the "repeated measures followed by
>>multiple comparisons" and just report my 6 t tests for paired samples,
>>correcting the p-level for the number of comparisons with, say, the
>>Sidak method (p for significance is then 0.0085).
>>>
>>> Thanks for your help.
>>>
>>> Denis
>>>
>>>> Le 2015-06-23 à 08:15, Thierry Onkelinx <thierry.onkelinx at inbo.be> a
>>écrit :
>>>>
>>>> Dear Denis,
>>>>
>>>> It's not multcomp which is too conservative, it is the pairwise
>>t-test
>>>> which is too liberal. The pairwise t-test doesn't take the random
>>>> effect of Case into account.
>>>>
>>>> Best regards,
>>>> ir. Thierry Onkelinx
>>>> Instituut voor natuur- en bosonderzoek / Research Institute for
>>Nature
>>>> and Forest
>>>> team Biometrie & Kwaliteitszorg / team Biometrics & Quality
>>Assurance
>>>> Kliniekstraat 25
>>>> 1070 Anderlecht
>>>> Belgium
>>>>
>>>> To call in the statistician after the experiment is done may be no
>>>> more than asking him to perform a post-mortem examination: he may be
>>>> able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>> The combination of some data and an aching desire for an answer does
>>>> not ensure that a reasonable answer can be extracted from a given
>>body
>>>> of data. ~ John Tukey
>>>>
>>>>
>>>> 2015-06-23 5:17 GMT+02:00 Denis Chabot <denis.chabot at me.com>:
>>>>> Hi,
>>>>>
>>>>> I am working on a problem which I think can be handled as a
>>repeated measures analysis, and I have read many tutorials about how to
>>do this with R. This part goes well, but I get stuck with the multiple
>>comparisons I'd like to run afterward. I tried two methods that I have
>>seen in my readings, but their results are quite different and I don't
>>know which one to trust.
>>>>>
>>>>> The two approaches are pairwise.t.test() and multcomp, although the
>>latter is not available after a repeated-measures aov model, but it is
>>after a lme.
>>>>>
>>>>> I have a physiological variable measured frequently on each of 67
>>animals. These are then summarized with a quantile for each animal. To
>>check the effect of experiment duration, I recalculated the quantile
>>for each animal 4 times, using different subset of the data (so the
>>shortest subset is part of all other subsets, the second subset is
>>included in the 2 others, etc.). I handle this as 4 repeated
>>(non-independent) measurements for each animal, and want to see if the
>>average value (for 67 animals) differs for the 4 different durations.
>>>>>
>>>>> Because animals with high values for this physiological trait have
>>larger differences between the 4 durations than animals with low
>>values, the observations were log transformed.
>>>>>
>>>>> I attach the small data set (Rda format) here, but it can be
>>obtained here if the attachment gets stripped:
>>>>> <https://dl.dropboxusercontent.com/u/612902/RepMeasData.Rda>
>>>>>
>>>>> The data.frame is simply called Data.
>>>>> My code is
>>>>>
>>>>> load("RepMeasData.Rda")
>>>>> Data_Long = melt(Data, id="Case")
>>>>> names(Data_Long) = c("Case","Duration", "SMR")
>>>>> Data_Long$SMR = log10(Data_Long$SMR)
>>>>>
>>>>> # I only show essential code to reproduce my opposing results
>>>>> mixmod = lme(SMR ~ Duration, data = Data_Long, random = ~ 1 | Case)
>>>>> anova(mixmod)
>>>>> posthoc <- glht(mixmod, linfct = mcp(Duration = "Tukey"))
>>>>> summary(posthoc)
>>>>>        Simultaneous Tests for General Linear Hypotheses
>>>>>
>>>>> Multiple Comparisons of Means: Tukey Contrasts
>>>>>
>>>>>
>>>>> Fit: lme.formula(fixed = SMR ~ Duration, data = Data_Long, random =
>>~1 |
>>>>>   Case)
>>>>>
>>>>> Linear Hypotheses:
>>>>>                 Estimate Std. Error z value Pr(>|z|)
>>>>> Set2 - Set1 == 0 -0.006135   0.003375  -1.818    0.265
>>>>> Set3 - Set1 == 0 -0.002871   0.003375  -0.851    0.830
>>>>> Set4 - Set1 == 0  0.015395   0.003375   4.561   <1e-04 ***
>>>>> Set3 - Set2 == 0  0.003264   0.003375   0.967    0.768
>>>>> Set4 - Set2 == 0  0.021530   0.003375   6.379   <1e-04 ***
>>>>> Set4 - Set3 == 0  0.018266   0.003375   5.412   <1e-04 ***
>>>>> ---
>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>>> (Adjusted p values reported -- single-step method)
>>>>>
>>>>> with(Data_Long, pairwise.t.test(SMR, Duration,
>>p.adjust.method="holm", paired=T))
>>>>>       Pairwise comparisons using paired t tests
>>>>>
>>>>> data:  SMR and Duration
>>>>>
>>>>>    Set1    Set2    Set3
>>>>> Set2 < 2e-16 -       -
>>>>> Set3 0.11118 0.10648 -
>>>>> Set4 0.00475 7.9e-05 0.00034
>>>>>
>>>>> P value adjustment method: holm
>>>>>
>>>>> So the difference between sets 1 and 2 goes from non significant to
>>very significant, depending on method.
>>>>>
>>>>> I have other examples with essentially the same type of data and
>>sometimes the two approches differ in the opposing way. In the example
>>shown here, multcomp was more conservative, in some others it yielded a
>>larger number of significant differences.
>>>>>
>>>>> I admit not mastering all the intricacies of multcomp, but I have
>>used multcomp and other methods of doing multiple comparisons many
>>times before (but never with a repeated measures design), and always
>>found the results very similar. When there were small differences, I
>>trusted multcomp. This time, I get rather large differences and I am
>>worried that I am doing something wrong.
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Denis Chabot
>>>>> Fisheries & Oceans Canada
>>>>>
>>>>> sessionInfo()
>>>>> R version 3.2.0 (2015-04-16)
>>>>> Platform: x86_64-apple-darwin13.4.0 (64-bit)
>>>>> Running under: OS X 10.10.3 (Yosemite)
>>>>>
>>>>> locale:
>>>>> [1] fr_CA.UTF-8/fr_CA.UTF-8/fr_CA.UTF-8/C/fr_CA.UTF-8/fr_CA.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods
>>base
>>>>>
>>>>> other attached packages:
>>>>> [1] multcomp_1.4-0  TH.data_1.0-6   survival_2.38-1 mvtnorm_1.0-2
>>nlme_3.1-120    car_2.0-25      reshape2_1.4.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Rcpp_0.11.5      magrittr_1.5     splines_3.2.0    MASS_7.3-40
>>    lattice_0.20-31  minqa_1.2.4      stringr_1.0.0
>>>>> [8] plyr_1.8.2       tools_3.2.0      nnet_7.3-9
>>pbkrtest_0.4-2   parallel_3.2.0   grid_3.2.0       mgcv_1.8-6
>>>>> [15] quantreg_5.11    lme4_1.1-7       Matrix_1.2-0
>>nloptr_1.0.4     codetools_0.2-11 sandwich_2.3-3   stringi_0.4-1
>>>>> [22] SparseM_1.6      zoo_1.7-12
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>