[R] Odd behaviour of mean() with a numeric column in a tibble

Ista Zahn istazahn at gmail.com
Wed Dec 7 01:33:36 CET 2016


On Tue, Dec 6, 2016 at 5:10 PM, Chris Evans <chrishold at psyctc.org> wrote:
> {{SIGH}}
>
> You are absolutely right.
>
> I wonder if I am losing some cognitive capacities that are needed to be part of the evolving R community. It seems to me that if a tibble is designed to be an enhanced replacement for a dataframe then it shouldn't quite so radically change things.

Well, there are some things about data frames that are darn annoying,
and tibbles exist partly as an attempt to eliminate some of the
inconsistencies with data.frames. That necessarily means changing
things.

>
> I notice that the documentation on tibble says "[ Never simplifies (drops), so always returns data.frame"
> That is much less explicit than I would have liked and actually doesn't seem to be true. In fact, as you rightly say, it generally, but not quite always, returns a tibble. In fact it can be fooled into a vector of length 1.

Really? How?

>
>> tmpTibble[[1,]]
> Error in `[[.data.frame`(tmpTibble, 1, ) :
> argument "..2" is missing, with no default

That doesn't have anything to do with tibbles:

as.data.frame(tmpTibble)[[1, ]]

gives the same thing.

>
>> tmpTibble[1]
> # A tibble: 26 × 1
> ID
> <chr>
> 1 a
> 2 b
> 3 c
> 4 d
> 5 e
> 6 f
> 7 g
> 8 h
> 9 i
> 10 j
> # ... with 16 more rows

Again, just what you expect from a data.frame (except for the print method).

>> tmpTibble[,1]
> # A tibble: 26 × 1
> ID
> <chr>
> 1 a
> 2 b
> 3 c
> 4 d
> 5 e
> 6 f
> 7 g
> 8 h
> 9 i
> 10 j
> # ... with 16 more rows

That is different, and by design as you noted. It is different from
data.frame indexing, but the data.frame behavior is needlessly
complicated. Sometimes you get a vector, sometimes a data.frame. That
hardly seems worth it given that we already have $ or [[ if you really
wanted a vector.

>> tmpTibble[1,]
> Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 3 is a matrix/data frame of 26 rows, need 1
> In addition: Warning messages:
> 1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 1 has 26 rows to replace 1 rows
> 2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 2 has 26 rows to replace 1 rows

That's not what I get.

> tmpTibble[1,]
# A tibble: 1 × 2
    ID   num
 <chr> <int>
1     a     1

works just as I would expect here.
>> tmpTibble[1,1:26]
> Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26

Other than providing more information about what went wrong this is
the same as data.frame:

> as.data.frame(tmpTibble)[1,1:26]
Error in `[.data.frame`(as.data.frame(tmpTibble), 1, 1:26) :
 undefined columns selected

>> tmpTibble[[1,2]]
> [1] 1

Same as data.frame. (and not at odds with the documentations which
says that [ (not [[ ) always returns a data.frame).

>> str(tmpTibble[[1,2]])
> int 1
>> str(tmpTibble[[1:2,2]])
> Error in col[[i, exact = exact]] :
> attempt to select more than one element in vectorIndex

Same behavior as data.frame.

>>
>> tmpTibble[[1,1:2]]
> [1] "b"
>>

Same behavior as data.frame.
>
> So [[a,b]] works if a and b are legal with the dimensions of the tibble and if a is of length 1 but returns NOT a tibble but a vector of length 1 (I think), I can see that's logical but not what it says in the documentation.

In what documentation? The documentation that says [ always returns a
data.frame? Note that [ and [[ are not the same, and only [ is
documented to always return a data.frame.
>
> [[a]] and [[,a]] return the same result, that seems excessively tolerant to me.

Not for me:

> tmpTibble[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> tmpTibble[[, 1]]
Error in `[[.data.frame`(tmpTibble, , 1) :
 argument "..1" is missing, with no default

(this is the same thing that happens with a data.frame)
>
> [[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a tibble.

That is weird, but not different that data.frame. See above regarding
"NOT a  tibble".

>
> And row subsetting/indexing has gone.

Whatever do you mean?

> tmpTibble[tmpTibble$ID == "d", ]
# A tibble: 1 × 2
    ID   num
 <chr> <int>
1     d     4

>
> Why create replacement for a dataframe that has no row indexing and so radically redefines column indexing, in fact redefines the whole of indexing and subsetting?

It has row indexing, and besides [, x] not dropping dimension it works
pretty much the same.
>
> OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. Perhaps Prof. Wickham or someone can spell out a bit less tersely, and I think incompletely, than the tibble documentation does, why all this is good.

Most of the things you identify here are issues inherited from
data.frame, and and not due differences between tibbles and
data.frames.

Best,
Ista

>
> Thanks anyway Ista, you certainly hit the issue!
>
> Very best all,
>
> Chris
>
>> From: "Ista Zahn" <istazahn at gmail.com>
>> To: "Chris Evans" <chrishold at psyctc.org>
>> Cc: "r-helpr-project.org" <r-help at r-project.org>
>> Sent: Tuesday, 6 December, 2016 21:40:41
>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble
>
>> Not at a computer to check right now, but I believe single bracket indexing a
>> tibble always returns a tibble. To extract a vector use [[
>
>> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote:
>
>>> I hope I am obeying the list rules here. I am using a raw R IDE for this and
>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>
>> > Here is a reproducible example. Code only first
>
>> > require(tibble)
>> > tmpTibble <- tibble(ID=letters,num=1:26)
>> > min(tmpTibble[,2]) # fine
>> > max(tmpTibble[,2]) # fine
>> > median(tmpTibble[,2]) # not fine
>> > mean(tmpTibble[,2]) # not fine
>
>> I think you want
>
>> mean(tmpTibble[[2]]
>
>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > newMedianFun(tmpTibble[,2]) # ditto
>> > str(tmpTibble[,2])
>
>> > ### then I tried this to make sure it wasn't about having fed in integers
>
>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > tmpTibble2
>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>
>
>>> ### before I just created tmpTibble2 I found myself trying to add a column to
>> > tmpTibble
>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > ### and oddly enough ...
>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>
>> > Now here it is with the output:
>
>> > > require(tibble)
>> > Loading required package: tibble
>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>> > > min(tmpTibble[,2]) # fine
>> > [1] 1
>> > > max(tmpTibble[,2]) # fine
>> > [1] 26
>> > > median(tmpTibble[,2]) # not fine
>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>> > > mean(tmpTibble[,2]) # not fine
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble[, 2]) :
>> > argument is not numeric or logical: returning NA
>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
>> > [1] 13.5
>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > > newMedianFun(tmpTibble[,2]) # ditto
>> > [1] 13.5
>> > > str(tmpTibble[,2])
>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>
>> > > ### then I tried this to make sure it wasn't about having fed in integers
>
>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > > tmpTibble2
>> > # A tibble: 26 × 3
>> > ID num num2
>> > <chr> <int> <dbl>
>> > 1 a 1 0.1
>> > 2 b 2 0.2
>> > 3 c 3 0.3
>> > 4 d 4 0.4
>> > 5 e 5 0.5
>> > 6 f 6 0.6
>> > 7 g 7 0.7
>> > 8 h 8 0.8
>> > 9 i 9 0.9
>> > 10 j 10 1.0
>> > # ... with 16 more rows
>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble2[, 3]) :
>> > argument is not numeric or logical: returning NA
>
>
>>> > ### before I just created tmpTibble2 I found myself trying to add a column to
>> > > tmpTibble
>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > > ### and oddly enough ...
>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>> > Error: Each variable must be a 1d atomic vector or list.
>> > Problem variables: 'newNum'
>
>
>
>>> I discovered this when I hit odd behaviour after using read_spss() from the
>>> haven package for the first time as it seemed to be offering a step forward
>>> over good old read.spss() from the excellent foreign package. I am reporting it
>>> here not directly to Prof. Wickham as the issues seem rather general though I'm
>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps I've
>> > completely missed something.
>
>> > TIA,
>
>> > Chris
>
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list