[R] Data analysis: normal approximation for binomial

Joshua Wiley jwiley.psych at gmail.com
Sun Nov 20 07:00:54 CET 2011


Hi Colin,

I have never heard of a binomial distribution z statistic with (or
without for that matter) a continuity correction, but I am not a
statistician.  Other's may have some ideas there.  As for other ways
to analyze the data, I skimmed through the article and brought the
data and played around with some different analyses and graphs.  I
attached a file headache.txt with all the R script (including the data
in an Rish format).  It is really a script file (i.e., .R) but for the
listservs sake I saved it as a txt.  There are quite a few different
things I tried in there so hopefully it gives you some ideas.
Regardless of the analysis type used and whether one considers
proportion that "significantly improved" or the raw frequency or
intensity scores, I would say that concluding the treatment was
effective is a good conclusion.  The only real concern could be that
people would naturally get better on their own (a control group would
be needed to bolster the causal inference drawn from a pre/post
measurement).  However, given at least what I know about migraines, it
is often a fairly chronic condition so over a relatively short time
period, it seems implausible to conclude that as many people would be
improving as this study reported.

Cheers,

Josh

On Sat, Nov 19, 2011 at 7:43 PM, Colstat <colstat at gmail.com> wrote:
> hey, Joshua
> I was reading this paper, in attachment, and reproducing the results.
>
> I was really confused when he said in the paper "The results were then
> statistically analyzed using binomial distribution z statistics with
> continuity correction."  The data is binomial?  To me, this is a paired
> t-test.
>
> What command should I use to get those results (the first paragraph in
> Results section)?  Basically, it's a pre and post treatment problem.
>
> What other graphical analysis do you think is appropriate? reshape package?
> lattice package, namely conditional graph?
>
> I know this might be too much, but I do really appreciate it if you do take
> a look at it.
>
> Thanks,
> Colin
>
>
> On Sat, Nov 19, 2011 at 10:15 PM, Joshua Wiley <jwiley.psych at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I am not clear what your goal is.  There is a variety of data there.
>> You could look at t-test differences in preIntensity broken down by
>> sex, you could use regression looking at postIntensity controlling for
>> preIntensity and explained by age, you could....
>>
>> Why are you analyzing data from an article?  What did the article do?
>> What you mention---some sort of z statistic (what exactly this was of
>> and how it should be calculated did not seem like was clear even to
>> you), histogram, t-test, lattice, are all very different things that
>> help answer different questions, show different things, and in one is
>> a piece of software.
>>
>> Without a clearer question and goal, my best advice is here are a
>> number of different functions some of which may be useful to you:
>>
>> ls(pos = "package:stats")
>>
>> Cheers,
>>
>> Josh
>>
>> On Sat, Nov 19, 2011 at 3:01 PM, Colstat <colstat at gmail.com> wrote:
>> > Dear R experts,
>> >
>> > I am trying to analyze data from an article, the data looks like this
>> >
>> > Patient Age Sex Aura preCSM preFreq preIntensity postFreq postIntensity
>> > postOutcome
>> > 1 47 F A 4 6 9 2 8 SD
>> > 2 40 F A/N 5 8 9 0 0 E
>> > 3 49 M N 5 8 9 2 6 SD
>> > 4 40 F A 5 3 10 0 0 E
>> > 5 42 F N 5 4 9 0 0 E
>> > 6 35 F N 5 8 9 12 7 NR
>> > 7 38 F A 5 NA 10 2 9 SD
>> > 8 44 M A 4 4 10 0 0 E
>> > 9 47 M A 4 5 8 2 7 SD
>> > 10 53 F A 5 3 10 0 0 E
>> > 11 41 F N 5 6 7 0 0 E
>> > 12 49 F A 4 6 8 0 0 E
>> > 13 48 F A 5 4 8 0 0 E
>> > 14 63 M N 4 6 9 15 9 NR
>> > 15 58 M N 5 9 7 2 8 SD
>> > 16 53 F A 4 3 9 0 0 E
>> > 17 47 F N 5 4 8 1 4 SD
>> > 18 34 F A NA  5 9 0 0 E
>> > 19 53 F N 5 4 9 5 7 NR
>> > 20 45 F N 5 5 8 5 4 SD
>> > 21 30 F A 5 3 8 0 0 E
>> > 22 29 F A 4 5 9 0 0 E
>> > 23 49 F N 5 9 10 0 0 E
>> > 24 24 F A 5 5 9 0 0 E
>> > 25 63 F N 4 19 7 10 7 NR
>> > 26 62 F A 5 8 9 11 9 NR
>> > 27 44 F A 5 3 10 0 0 E
>> > 28 38 F N 4 8 10 1 3 SD
>> > 29 38 F N 5 3 10 0 0 E
>> >
>> > How do I do a binomial distribution z statistics with continuity
>> > correction? basically normal approximation.
>> > Could anyone give me some suggestions what I (or R) can do with these
>> > data?
>> > I have tried tried histogram, maybe t-test? or even lattice?  what else
>> > can
>> > I(or can R) do?
>> > help please, thanks so much.
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, ATS Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
-------------- next part --------------
#### Data (provided via dput)
dat <- structure(list(patient = 1:29, age = c(47L, 40L, 49L, 40L, 42L,
35L, 38L, 44L, 47L, 53L, 41L, 49L, 48L, 63L, 58L, 53L, 47L, 34L,
53L, 45L, 30L, 29L, 49L, 24L, 63L, 62L, 44L, 38L, 38L), sex = structure(c(1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), nonaura = structure(c(1L, 2L, 3L,
1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L, 1L, 3L,
3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 3L, 3L), .Label = c("A", "A/N",
"N"), class = "factor"), csm = c(4L, 5L, 5L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 4L, 5L, 4L, 5L, 4L, 5L, NA, 5L, 5L, 5L, 4L, 5L,
5L, 4L, 5L, 5L, 4L, 5L), freq0 = c(6L, 8L, 8L, 3L, 4L, 8L, NA,
4L, 5L, 3L, 6L, 6L, 4L, 6L, 9L, 3L, 4L, 5L, 4L, 5L, 3L, 5L, 9L,
5L, 19L, 8L, 3L, 8L, 3L), inten0 = c(9L, 9L, 9L, 10L, 9L, 9L,
10L, 10L, 8L, 10L, 7L, 8L, 8L, 9L, 7L, 9L, 8L, 9L, 9L, 8L, 8L,
9L, 10L, 9L, 7L, 9L, 10L, 10L, 10L), freq1 = c(2L, 0L, 2L, 0L,
0L, 12L, 2L, 0L, 2L, 0L, 0L, 0L, 0L, 15L, 2L, 0L, 1L, 0L, 5L,
5L, 0L, 0L, 0L, 0L, 10L, 11L, 0L, 1L, 0L), inten1 = c(8L, 0L,
6L, 0L, 0L, 7L, 9L, 0L, 7L, 0L, 0L, 0L, 0L, 9L, 8L, 0L, 4L, 0L,
7L, 4L, 0L, 0L, 0L, 0L, 7L, 9L, 0L, 3L, 0L), btxa = structure(c(2L,
1L, 2L, 1L, 1L, 3L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 2L,
1L, 3L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L), .Label = c("E",
"SD", "NR"), class = "factor")), .Names = c("patient", "age",
"sex", "nonaura", "csm", "freq0", "inten0", "freq1", "inten1",
"btxa"), row.names = c(NA, -29L), class = "data.frame")

table(dat$btxa)

## note p = .1 is 10% (much higher than the 1% they said they used)

## eliminated + significantly reduced
binom.test(x = 16 + 8, n = 16 + 8 + 5, p = .1)

## eliminated only
binom.test(x = 16, n = 16 + 8 + 5, p = .1)

## unclear what they compare the 8 SDs against
## perhaps *just* the no responses??
binom.test(x = 8, n = 8 + 5, p = .1)
## even against everyone (terribly unfair)
## do not approach p < .04
binom.test(x = 8, n = 16 + 8 + 5, p = .1)


#### Examination of baseline characteristics
t.test(freq0 ~ sex, data = dat)
t.test(inten0 ~ sex, data = dat)
t.test(csm ~ sex, data = dat)
t.test(age ~ sex, data = dat)
## note how you treat the single A/N case
## will affect these results
t.test(freq0 ~ I(nonaura == "N"), data = dat)
t.test(inten0 ~ I(nonaura == "N"), data = dat)
t.test(csm ~ I(nonaura == "N"), data = dat)
t.test(age ~ I(nonaura == "N"), data = dat)

###### Tests of pre post frequency ######
## (paired) wilcox test for frequency (with and without correction)
with(dat, wilcox.test(x = freq0, y = freq1, paired = TRUE,
  conf.int = TRUE, correct = TRUE, exact = FALSE))

with(dat, wilcox.test(x = freq0, y = freq1, paired = TRUE,
  conf.int = TRUE, correct = FALSE, exact = FALSE))

## paired t-test
with(dat, t.test(x = freq0, y = freq1, paired = TRUE))

## see if age predicts post controlling for pre
## center freq0 and age so intercept is the expected
## post frequency (freq1) for an individual with average
## pre frequency and age
dat$cfreq0 <- with(dat, freq0 - mean(freq0, na.rm = TRUE))
dat$cage <- with(dat, age - mean(age, na.rm = TRUE))

summary(lm(freq1 ~ cfreq0 + cage, data = dat))
## age appears to be related to frequency of headaches

###### Tests of pre post intensity ######
## wilcox test for intensity (with and without correction)
with(dat, wilcox.test(x = inten0, y = inten1, paired = TRUE,
  conf.int = TRUE, correct = TRUE, exact = FALSE))

with(dat, wilcox.test(x = inten0, y = inten1, paired = TRUE,
  conf.int = TRUE, correct = FALSE, exact = FALSE))

with(dat, t.test(x = inten0, y = inten1, paired = TRUE))

## see if age predicts post controlling for pre
## center inten0 so intercept is the expected
## post intensity (inten) for an individual with average
## pre intensity and age
dat$cinten0 <- with(dat, inten0 - mean(inten0, na.rm = TRUE))

summary(lm(inten1 ~ cinten0 + cage, data = dat))
## age appears to be related to intensity of headaches

## see if pre frequency predicts post intensity
## or if pre intensity predicts post frequency
summary(lm(inten1 ~ cinten0 + cage + cfreq0, data = dat))
summary(lm(freq1 ~ cinten0 + cage + cfreq0, data = dat))


########### Some data visualization #################

require(reshape)
require(ggplot2)

## categorical variables
catvars <- c("patient", "sex", "nonaura", "btxa")
## continuous variables (ordered, and ignoring the centered ones we made)
## (ordered so the plots put freq0/1 and inten0/1 next to each other
contvars <- c("age", "csm", "freq0", "freq1", "inten0", "inten1")

## scatter plot matrix of continuous variables
plot(dat[contvars])
## correlation matrix
cor(dat[contvars], use = "pairwise.complete.obs")

mdat <- melt(dat[, c(catvars, contvars)], id = c("patient", "sex", "nonaura", "btxa"))

ggplot(data = mdat, aes(x = variable, y = value)) +
  geom_boxplot() +
  geom_jitter()

## age overwhelms things a bit
## we could create standardized variables
## some care is required because we want to be able to compare
## how frequency and intensity change, so they must be standardized together
## of course we could also make it a long data set with a time (pre/post) indicator
## but that makes it trickier to plot the nonchanging variables

## get mean/sd for overall freq and inten
desc <- with(dat, c(mf = mean(c(freq0, freq1), na.rm = TRUE),
  sdf = sd(c(freq0, freq1), na.rm = TRUE),
  mi = mean(c(inten0, inten1)), sdi = sd(c(inten0, inten1))))

## standardize
stdat <- within(dat, {
  age <- scale(age)
  csm <- scale(csm)
  freq0 <- (freq0 - desc["mf"])/desc["sdf"]
  freq1 <- (freq1 - desc["mf"])/desc["sdf"]
  inten0 <- (inten0 - desc["mi"])/desc["sdi"]
  inten1 <- (inten1 - desc["mi"])/desc["sdi"]
})

## melt data for plotting
stdmdat <- melt(stdat[, c(catvars, contvars)],
  id = c("patient", "sex", "nonaura", "btxa"))

## colour jittered points based on aura
ggplot(data = stdmdat, aes(x = variable, y = value)) +
  geom_boxplot() +
  geom_jitter(aes(colour = nonaura))

## it looks a bit like highest post frequencies are
## over represented in nonauras

summary(lm(freq1 ~ I(nonaura == "N"), data = dat))
summary(lm(freq1 ~ I(nonaura == "N") + freq0, data = dat))
## nonsignificant after account for pre frequency
##

## create long data just for freq
tmp <- melt(dat[, c("patient", "nonaura", "freq0", "freq1")],
            id = c("patient", "nonaura"))
## fit linear mixed effects model
## random intercept by patient
## value is freq0 and freq1
## variable indicates freq1 or freq0
## nonaura seems to predict overall frequency
## the negative effect of variable indicates that
## frequency decreases from freq0 to freq1
require(lme4)
summary(lmer(value ~ variable + I(nonaura=="N") + (1|patient), data = tmp))


More information about the R-help mailing list