[R] Bug in t.test?

Thomas Lumley tlumley at u.washington.edu
Sat Aug 14 18:07:14 CEST 2010


On Sat, 14 Aug 2010, Ted.Harding at manchester.ac.uk wrote:

> Hi Thomas,
> I'm not too sure about your interpretation. Consider:

It seems hard to interpret "The formula interface is only applicable for the 2-sample tests." any other way

>
> Johannes' original query was about differences when there
> are NAs, corresponding to different settings of "na.action".
> It is perhaps possible that 'na.action="na.pass"' and
> 'na.action="na.exclude"' result in different pairings in the
> case "paired=TRUE". However, it seems to me that the differences
> he observed are, shall we say, obscure!

No, they are perfectly straightforward.  Johannes's data had two missing values, one in each group, but not in the same pair.

With na.omit or na.exclude, model.frame() removes the NAs. If there are the same number of NAs in each group, this leaves the same number of observations in each group. t.test.formula() splits these according to the group variable and passes them to t.test.default. Because of the (invalid) paired=TRUE argument, t.test.default assumes these are nine pairs and gets bogus answers.

On the other hand with na.pass, model.frame() does not remove NAs. t.test.formula() passes two sets of ten observations (including missing observations) to t.test.default().  Because of the paired=TRUE argument, t.test.default() assumes these are ten pairs, which happens to be true in this case, and after deleting the two pairs with missing observations it gives the right answer.

Regardless of the details, however, t.test.formula() can't reliably work with paired=TRUE because the user interface provides no way to specify which observations are paired. It would be possible (though bad idea in my opinion) to specify that paired=TRUE is allowed and that the pairing is done in the order the observations appear in the data. The minimal change would be to stop doing missing-value removal in t.test.formula, although that would be undesirable if a user wanted to supply some sort of na.impute() option.

I would strongly prefer having an explicit indication of pairing, eg paired=variable.name, or even better, paired=~variable.name. Relying on data frame ordering seems a really bad idea.

    -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle



More information about the R-help mailing list