[R] strange behaviour of median

(Ted Harding) Ted.Harding at manchester.ac.uk
Thu Feb 4 11:53:22 CET 2010


On 04-Feb-10 09:58:36, Petr PIKAL wrote:
> Hi
> so do you think I shall fire a bug announcement? I think I rather
> wait to see if there is some reaction from others. Maybe, there
> is some reason behind such behaviour. Those simple statistics tend
> to behave differently when operating on data.frames so median is
> not such a huge surprise.
> 
> see
> 
> sd(df1), var(df1), mean(df1), max(df1), min(df1), range(df1)
> 
> Produced results are usually clearly documented,

Yes, in the case of sd() and mean() it is clearly stated what
happens if the argument is a dataframe: it is the value of the
function as applied to each column separately. For var() it is
also clearly stated that when applied to a matrix it returns
the covariances between columns (with, presumably, dataframes
inplicitly converted to matrix). For max() and min() it is
clearly stated "the maximum or minimum of _all_ the values
present in their arguments"; for range() it is not so clear
but is similar: "a vector containing the minimum and maximum
of all the given arguments", and you have to experiment to
verify that it is apparently intended to be the same as
c(min(...),max(...)):

  range(c(1,4,7),c(2,5,8),c(3,6,9))
  # [1] 1 9

However, for median there is no such statement,compared with
what is stated for mean():

  '?mean':
  "For a data frame, a named vector with the appropriate
   method being applied column by column."

  '?median':
  "The default method returns a length-one object of the
   same type as 'x'"

(which is a bit cryptic).

It is possible that the behaviour of mean() with dataframes
is in tended as an "add-on": If mean() is applied to a matrix,
you get the mean of all the values in the matrix. For dataframes,
there seems to be a special "mean" method which causes the
standard mean() to be applied spearately to each column. This
is not the case with any of the other functions above.

Quite why mean() was specially designed in this way in the first
place is another question (presumably to match up with the
behavious of sd(), so that you can represent each column of
a dataframe by its (mean,sd) pair??); but it was, and there it is,
and it is useful.

> however for novice it is rather mysterious why using those functions
> on vector produce easily understandable results but using them on
> data.frame (which is most common structure of data) is far from
> consistent and intuitive.
> 
> But I agree with you that mean and median in best case shall give
> similar results regarding results structure.

Absolutely! Mean and median are, from the interpretative point
of view, essentially the same: a "measure of central tendency",
albeit computed in different ways and with somewhat different
properties. But any user will expect that whenever a mean (or a
set of means) can be computed using mean(), a similar median
(or set of medians) would result from using median().

Of course, one way round this gross anomaly between mean() and
median() would be to ignore the special behaviour of mean()
when applied to dataframes, and simply use an appropriate
"apply", just as one would for sd(), var() (if interested
in the variance of each column), max(), min() and range().
And this would then work for median().

But, despite all that, the fact that median() produces so
meaningless a result for a dataframe is undoubtedly a bug,
in my opinion. Either median() whould produce the median
of all the values present (like max(), min()), or it should
behave like mean() and sd(). I would prefer the latter.

However, like you, I prefer to wait for comments from others
before a bug report is filed -- it is just possible that
there is an important reason why median() should behave
as it does, though I cannot imagine what it might be!

Ted.
> Regards
> Petr
> 
> r-help-bounces at r-project.org napsal dne 04.02.2010 10:28:16:
> 
>> Well, I get the same as Petr with  R version 2.10.0 (2009-10-26)
>> on Linux.
>> 
>> To me, this suggests that median is broken! Any user would,
>> a priori, expect that median() should operate in exactly
>> the same way as mean(). To extend Petr's example:
>> 
>>   mat <- matrix(1:32, 4,8)
>>   df1 <- data.frame(mat)
>>   mean(df1)
>>   #   X1   X2   X3   X4   X5   X6   X7   X8 
>>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 30.5 
>>   median(df1)
>>   # [1] 14.5 18.5
>> 
>> so (as in Petr's original example, but more clearly) median()
>> returns the medians of the two "central" columns X4 and X5 of df1.
>> 
>> But that is with an even number of columns. Now look at what
>> happens with an odd number:
>> 
>>   mat <- matrix(1:28, 4,7)
>>   df1 <- data.frame(mat)
>>   mean(df1)
>>   #   X1   X2   X3   X4   X5   X6   X7 
>>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 
>>   median(df1)
>>   #   structure(c("13", "14", "15", "16"), class = "AsIs")
>>   # 1                                                   13
>>   # 2                                                   14
>>   # 3                                                   15
>>   # 4                                                   16
>> 
>> Wow!!!!!!!!!!
>> 
>> This does suggest a tie-in with Petr's observation about "As.Is",
>> and there is no doubt at all that the above result is rubbish.
>> It is certainly not what a user would expect, and in the context
>> of Petr's intention to present R lessons to a class, I could
>> foresee students turning their backs on R if they came up with
>> such a result in their early encounters!
>> 
>> Ted.
>> 
>> On 04-Feb-10 08:59:59, Mario Valle wrote:
>> > Linux 2.9.0 gives:
>> > 
>> >> median(df1)
>> > [1] 34
>> > 
>> > Ever stranger...
>> >               mario
>> > 
>> > Petr PIKAL wrote:
>> >> During some experimentation in preparing R lessons I encountered
>> >> this 
> 
>> >> behaviour which I can not explain fully
>> >> 
>> >> mat <- matrix(1:16, 4,4)
>> >> df1 <- data.frame(mat)
>> >> 
>> >>> mean(df1)
>> >>   X1   X2   X3   X4 
>> >>  2.5  6.5 10.5 14.5 
>> >> 
>> >> Expected, documented
>> >> 
>> >>> median(df1)
>> >> [1]  6.5 10.5
>> >> 
>> >> Rather weird, AFAIK there shall not be an issue with data frame at
>> >> least I 
>> >> did not find any in help page. I tracked it down probably to an
>> >> As.Is 
> 
>> >> operation with object and subsequent sorting in median.default.
>> >> 
>> >> I know other (*apply) ways how to compute median for data frames so
>> >> I
>> >> just 
>> >> would like to hear an opinion about this behaviour from more
>> >> experienced 
>> >> people.
>> >> 
>> >> Thank you
>> >> Best regards
>> >> 
>> >> Petr
>> >> 
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> > 
>> > -- 
>> > Ing. Mario Valle
>> > Data Analysis and Visualization Group            |
>> > http://www.cscs.ch/~mvalle
>> > Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91)
>> > 610.82.60
>> > v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91)
>> > 610.82.82
>> > 
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> 
>> --------------------------------------------------------------------
>> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
>> Fax-to-email: +44 (0)870 094 0861
>> Date: 04-Feb-10                                       Time: 09:28:13
>> ------------------------------ XFMail ------------------------------
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Feb-10                                       Time: 10:53:19
------------------------------ XFMail ------------------------------



More information about the R-help mailing list