[R] Bug in t.test?

Sat Aug 14 01:50:01 CEST 2010

Hi Thomas,
I'm not too sure about your interpretation. Consider:

  set.seed(54321)
  X <- rnorm(10) ; Y <- rnorm(10)
  XY <- c(X,Y); group<-c(rep(0,10),rep(1,10))

  t.test(X,Y,paired=TRUE)
  #         Paired t-test
  # data:  X and Y 
  # t = -1.5265, df = 9, p-value = 0.1612
  # 95 percent confidence interval:
  #  -2.1521562  0.4178885 
  # sample estimates: mean of the differences 
  #                  -0.8671339 

  t.test(XY~group,paired=TRUE)
  #         Paired t-test
  # data:  XY by group 
  # t = -1.5265, df = 9, p-value = 0.1612
  # 95 percent confidence interval:
  #  -2.1521562  0.4178885 
  # sample estimates: mean of the differences
  #                  -0.8671339 

  t.test(X,Y,paired=FALSE)
  #         Welch Two Sample t-test
  # data:  X and Y 
  # t = -1.4865, df = 17.956, p-value = 0.1545
  # 95 percent confidence interval:
  #  -2.0928836  0.3586159 
  # sample estimates: mean of x  mean of y
  #                  -0.2646042  0.6025297 

  t.test(XY~group,paired=FALSE)
  #         Welch Two Sample t-test
  # data:  XY by group 
  # t = -1.4865, df = 17.956, p-value = 0.1545
  # 95 percent confidence interval:
  #  -2.0928836  0.3586159 
  # sample estimates: mean in group 0 mean in group 1
  #                  -0.2646042       0.6025297 

So in each case t.test(XY~group,...) has done the same as
the corresponding t.test(X,Y,...); thus it would not seem
that "The formula interface is only applicable for the
2-sample tests" excludes specifying the two samples by
a formula with a 2-level factor; and it would seem that
the pairing has been correctly carried out (i.e. in the
formula case t.test assumes that it is by sequential
position within the respective group, without having to
be told as you suggest). In other words, it acts as though
giving a formula with a 2-group factor on the right, and
equal numbers in the two groups, is equivalent to giving
the two samples separately.

Johannes' original query was about differences when there
are NAs, corresponding to different settings of "na.action".
It is perhaps possible that 'na.action="na.pass"' and
'na.action="na.exclude"' result in different pairings in the
case "paired=TRUE". However, it seems to me that the differences
he observed are, shall we say, obscure!

Ted.

On 13-Aug-10 22:31:45, Thomas Lumley wrote:
> Thanks for the clear example. However, if there is a bug it is
> only that t.test.formula doesn't throw an error when given the
> paired=TRUE option.
> 
> The documentation says "The formula interface is only applicable
> for the 2-sample tests.",  but there probably should be an explicit
> check -- I didn't see that in the documentation until I realized
> that t.test.formula couldn't work for paired tests because you don't
> tell it which observations are paired.
>     -thomas
> 
> On Fri, 13 Aug 2010, Johannes W. Dietrich wrote:
>> Hello all,
>>
>> due to unexplained differences between statistical results from
>> collaborators 
>> and our lab that arose in the same large proteomics dataset we
>> reevaluated 
>> the t.test() function. Here, we found a weird behaviour that is also 
>> reproducible in the following small test dataset:
>>
>> Suppose, we have two vectors with numbers and some missing values that
>> refer 
>> to the same individuals and that should therefore be evaluated with a
>> paired 
>> t-test:
>>
>>>  testdata.A <- c(1.15, -0.2, NA, 1, -2, -0.5, 0.1, 1.2, -1.4, 0.01);
>>>  testdata.B <- c(1.2, 1.1, 3, -0.1, 3, 1.1, 0, 1.3, 4, NA);
>>
>> Then
>>
>>>  print(t.test(testdata.A, testdata.B, paired=TRUE,
>>>  alternative="two.sided", 
>>> na.action="na.pass"))
>>
>> and
>>
>>>  print(t.test(testdata.A, testdata.B, paired=TRUE,
>>>  alternative="two.sided", 
>>> na.action="na.exclude"))
>>
>> deliver the same p value (0.1162, identical to Excel's result).
>>
>> However, after combining the two vectors with
>>
>>>  testdata <- c(testdata.A, testdata.B);
>>
>> and defining a criterion vector with
>>
>>>  criterion <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1);
>>
>> (that is the type of data layout we have in our proteomics project) we
>> get a 
>> different p-value (0.01453) with
>>
>>>  print(t.test(testdata ~ criterion, paired=TRUE,
>>>  alternative="two.sided", 
>>> na.action="na.exclude")) .
>>
>> The statement
>>
>>>  print(t.test(testdata ~ criterion, paired=TRUE,
>>>  alternative="two.sided", 
>>> na.action="na.pass"))
>>
>> however, delivers a p-value of 0.1162 again.
>>
>> With
>>
>>>  print(t.test(testdata[criterion==0], testdata[criterion==1],
>>>  paired=TRUE, 
>>> alternative="two.sided", na.action="na.exclude"))
>>
>> that imitates the first form, we get again a p value of 0.1162.
>>
>> What is the reason for the different p values? Should not all calls to
>> t.test 
>> that exlude missing values be equivalent and therefore deliver the
>> same 
>> results?
>>
>> Excel, StatView and KaleidaGraph all display a p-value of 0.1162.
>>
>> J. W. D.
>> -- 
>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
>> -- --
>> -- Dr. Johannes W. Dietrich, M.D.
>> -- Laboratory XU44, Endocrine Research
>> -- Medical Hospital I, Bergmannsheil University Hospitals
>> -- Ruhr University of Bochum
>> -- Buerkle-de-la-Camp-Platz 1, D-44789 Bochum, NRW, Germany
>> -- Phone: +49:234:302-6400, Fax: +49:234:302-6403
>> -- eMail: "j.w.dietrich at medical-cybernetics.de"
>> -- WWW: http://medical-cybernetics.de
>> -- WWW: http://www.bergmannsheil.de
>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
>> -- --
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> Thomas Lumley
> Professor of Biostatistics
> University of Washington, Seattle
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Aug-10                                       Time: 00:49:57
------------------------------ XFMail ------------------------------