[R] paired t-test with bootstrap

Tue Jul 13 15:23:44 CEST 2004

On Tue, 2004-07-13 at 07:28, Petr Pikal wrote:
> Hi
> 
> On 13 Jul 2004 at 12:28, luciana wrote:
> 
> > Dear Sirs,
> > 
> > I am a R beginning user: by mean of R I would like to apply the
> > bootstrap to my data in order to test cost differences between
> > independent or paired samples of people affected by a certain
> > disease.
> > 
> > My problem is that even if I am reading the book by Efron
> > (introduction to the bootstrap), looking at the examples in internet
> > and available in R, learning a lot of theoretical things on
> > bootstrap, I can't apply bootstrap with R to my data because of many
> > doubts and difficulties. This is the reason why I have decided to
> > ask the expert for help.
> > 
> > 
> > 
> > I have a sample of diabetic people, matched (by age and sex) with a
> > control sample. The variable I would like to compare is their drug
> > and hospital monthly cost. The variable cost has a very far from
> > gaussian distribution, but I need any way to compare the mean
> > between the two group. So, in the specific case of a paired sample
> > t-test, I aim at testing if the difference of cost is close to 0.
> > What is the better way to follow for that?
> > 
> > 
> > 
> > Another question is that sometimes I have missing data in my dataset
> > (for example I have the cost for a patients but not for a control).
> > If I introduce NA or a dot, R doesn't estimate the statistic I need
> > (for instance the mean). To overcome this problem I have replaced
> > the missing data with the mean computed with the remaining part of
> > data. Anyway, I think R can actually compute the mean even with the
> > presence of missing data. Is it right? What can I do?
> 
> your.statistic(your.data, na.rm=T)
> 
> e.g.
> mean(your.data, na.rm=T)
> 
> or look at ?na.action e.g  mean(na.omit(your.data))
> 
> Cheers
> Petr Pikal

A couple of other thoughts here with respect to the use of a paired
t-test for the comparison.

As Luciana notes above, cost data is typically highly skewed, raising
doubt as to the use of a simple parametric test to compare the two
groups.

One of the many reasons such data is skewed is that there are notable
differences in the populations that are not accounted for when using
simple characteristics for matching as is done here. What makes a
patient an "outlier" with respect to cost and how does the distribution
of these patients differ between the two groups and the individual
pairs?

For example, are all the patients in both groups insulin dependent or
are some controlled with oral agents or diet alone? If all are using
insulin, are some using self-administered injections while others are
using implanted infusion pumps? What is the interval from disease onset?
Have any had Pancreas/Islet Cell transplants? Do the matched patients
have similar diabetic related sequelae such as diabetic retinopathy,
neuropathy, vasculopathy, renal dysfunction and others? If not, the
costs to treat these other issues, such as dialysis and wound care
alone, can dramatically alter the cost profile for patients even when
matched by age and gender.

If you are not considering these issues (ie. such as inclusion/exclusion
criteria), you risk significant challenges in your conclusions with
respect to the comparison of costs for these two groups. I would raise
similar concerns when using a sample mean as the imputed value for
missing data.

If you have not done so already, a Medline search of the literature
would be in order to better understand what others have done in this
area for diabetic treatment costs and the pros and cons of their
respective approaches. I suspect that others here will have additional
recommendations.

HTH,

Marc Schwartz