[R] "[.data.frame" and lapply

Mon Mar 30 06:09:42 CEST 2009

Folks:

I do not wish to agree or disagree with the criticisms of either the speed
or possible design flaws of "[". But let's at least see what the docs say
about the issues, using the simple example you provided:

    m = matrix(1:9, 3, 3)
    md = data.frame(m)

    md[1]
    # the first column
## as documented. This is because a data frame is a list of 3 identical
## length columns, and this is how [ works for lists

    m[1]
    # the first element (i.e., m[1,1])
## as documented. A matrix is just a vector with a dim attribute and 
## this is how [ works for vectors

    md[,i=3]
    # third row
## See below

    m[,i=3]
    # third column
##  Correct,as documented in ?"["  for matrices, to whit:
"Note that these operations do not match their index arguments in the
standard way: argument names are ignored and positional matching only is
used. So m[j=2,i=1] is equivalent to m[2,1] and not to m[1,2]. "

## Note that the next lines immediately following say:

"This may not be true for methods defined for them; for example it is not
true for the data.frame methods described in [.data.frame. 

To avoid confusion, do not name index arguments (but drop and exact must be
named). "

So, while it may be fair to characterize the md[,i=3] as a design flaw, it
is both explicitly pointed out and warned against. Note that,of course

md[,3]
## 3rd column, good practice
md[,j=3]
## also 3rd column .. but warned against as bad practice

Whether a behavior should be considered a "bug" if it is explicitly warned
against in the docs, I leave for others to decide. Too deep for me. 

Cheers,
Bert

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Wacek Kusnierczyk
Sent: Friday, March 27, 2009 2:28 PM
To: Romain Francois fran; r-devel at r-project.org
Cc: R help
Subject: Re: [R] "[.data.frame" and lapply

redirected to r-devel, because there are implementational details of
[.data.frame discussed here.  spoiler: at the bottom there is a fairly
interesting performance result.

Romain Francois wrote:
>
> Hi,
>
> This is a bug I think. [.data.frame treats its arguments differently
> depending on the number of arguments.

you might want to hesitate a bit before you say that something in r is a
bug, if only because it drives certain people mad.  r is a carefully
tested software, and [.data.frame is such a basic function that if what
you talk about were a bug, it wouldn't have persisted until now.

treating the arguments differently depending on their number is actually
(if clearly...) documented:  if there is one index (the 'i'), it selects
columns.  if there are two, 'i' selects rows.

however, not all seems fine, there might be a design flaw:

    # dummy data frame
    d = structure(names=paste('col', 1:3, sep='.'),
        data.frame(row.names=paste('row', 1:3, sep='.'),
           matrix(1:9, 3, 3)))

    d[1:2]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as i, no j given

    d[,1:2]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

    d[,i=1:2]
    # correctly selects two first rows
    # 1:2 passed to [.data.frame as i, j given the missing argument
value (note the comma)

    d[j=1:2,]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

    d[i=1:2]
    # correctly (arguably) selects the first two columns
    # 1:2 passed to [.data.frame as i, no j given

    d[j=1:2]
    # wrong: returns the whole data frame
    # does not recognize the index as i because it is explicitly named 'j'
    # does not recognize the index as j because there is only one index

i say this *might* be a design flaw because it's hard to judge what the
design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:

"   The most important example of a class method for [ is that used for
data frames. It is not
be described in detail here (see the help page for [.data.frame, but in
broad terms, if two
indices are supplied (even if one is empty) it creates matrix-like
indexing for a structure that is
basically a list of vectors of the same length. If a single index is
supplied, it is interpreted as
indexing the list of columns-in that case the drop argument is ignored,
with a warning."

it does not say what happens when only one *named* index argument is
given.  from the above, it would indeed seem that there is a *bug*
here:  in the last example above only one index is given, and yet
columns are not selected, even though the *language definition* says
they should.  (so it's not a documented feature, it's a
contra-definitional misfeature -- a bug?)

somewhat on the side, the 'matrix-like indexing' above is fairly
misleading;  just try the same patterns of indexing -- one index, two
indices, named indices -- on a data frame and a matrix of the same shape:

    m = matrix(1:9, 3, 3)
    md = data.frame(m)

    md[1]
    # the first column
    m[1]
    # the first element (i.e., m[1,1])

    md[,i=3]
    # third row
    m[,i=3]
    # third column

the quote above refers to the ?'[.data.frame' for details. 
unfortunately, it the help page a lump of explanations for various
'['-like operators, and it is *not* a definition of any sort.  it does
not provide much more detail on '[.data.frame' -- it is hardly as a
design specification.  in particular, it does not explain the issue of
named arguments to '[.data.frame' at all.

`[.data.frame` only is called with two arguments in the second case,  
> so
> the following condition is true:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>
> And then, the function assumes the argument it has been passed is i,  
> and
> eventually calls NextMethod("[") which I think calls
> `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not
> passed to `[.listof`, so you have something equivalent to as.list(d) 
> [].
>
> I think we can replace the condition with this one:
>
> if(Narg < 3L && !has.j) {  # list-like indexing or matrix indexing
>
> or this:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>        if(has.j) i <- j
>

indeed, for a moment i thought a trivial fix somewhere there would
suffice.  unfortunately, the code for [.data.frame [2, lines 500-641] is
so clean and readable that i had to give up reading it, forget fixing. 
instead, i wrote an new version of '[.data.frame' from scratch.  it
fixes (or at least seems to fix, as far as my quick assessment goes) the
problem.  the function subdf (see the attached dataframe.r) is the new
version of '[.data.frame':

    # dummy data frame
    d = structure(names=paste('col', 1:3, sep='.'),
        data.frame(row.names=paste('row', 1:3, sep='.'),
           matrix(1:9, 3, 3)))

    d[j=1:2]
    # incorrect: the whole data frame

    subdf(d, j=1:2)
    # correct, only the first two columns

otherwise, subdf returns results equivalent (sensu all.equal;  see
below) to those returned by [.data.frame on the same input, modulo some
more or less minor details.  for example, i think the dropped-drop
warnings go wrong in the original:

    d[1, drop=FALSE]
    # warning: drop argument will be ignored

which suggests that dimensions will be dropped, while the intention is
that the actual argument will be ignored and the value will be FALSE
instead (while the default is TRUE, since i is specified).  well, it's
just one more confusing bit in r.  the rewritten version warns about
dropped drop only if it is explicitly TRUE:

    subdf(d, 1, drop=FALSE)
    # no warning
    subdf(d, 1, drop=TRUE)
    # warning

another issue the differs in my version is that i don't see much sense
in being able to select rows by indexing with NA:

    d[NA,1]
    # one row filled with NAs

    d[NA,]
    # data frame of the shape of d, filled with NAs

which is incoherent with how NA are treated in columns indices (i.e.,
raise an error).  the rewritten version raises an error if any element
of any index is an NA.

these minor differences are easily modifiable should compliance with the
original 'design' be desirable.

interestingly, there is a reduction in code by some 40 lines (~30%) wrt.
the original, even though the new code is quite redundant (but thus were
the original, too).  with a little effort, it can be compressed further,
but i felt it would become more convoluted and less readable, and also
less efficient.  procedural abstraction could help, but would also
negatively impact performance.  (presumably, an implementation in c
would run faster.)

incidentally (here's the best part!), my version seems to perform much
better than the original, at least in a limited set of naive
benchmarks.  here are some results, which you can (hopefully) reproduce
using the code in the attached test.r.  the data is a dummy df with 1k
rows and 1k columns, filled with rnorm;  each indexing was repeated 1000
times for both the original and the modified version:

   original patched ratio   test                                    
1  0.002    0.001      2.00 d[]                                     
2  0.027    0.001     27.00 d[drop = FALSE]                         
3  0.025    0.002     12.50 d[drop = TRUE]                          
4  0.026    0.002     13.00 d[, drop = FALSE]                       
5  0.026    0.003      8.67 d[, drop = TRUE]                        
6  1.274    0.002    637.00 d[, ]                                   
7  1.255    0.001   1255.00 d[, , ]                                 
8  1.183    0.001   1183.00 d[, , drop = FALSE]                     
9  1.183    0.003    394.33 d[, , drop = TRUE]                      
10 0.013    0.011      1.18 d[r]                                    
11 0.040    0.034      1.18 d[r, drop = TRUE]                       
12 0.037    0.010      3.70 d[r, drop = FALSE]                      
13 0.012    0.011      1.09 d[i = r]                                
14 0.036    0.034      1.06 d[i = r, drop = TRUE]                   
15 0.037    0.011      3.36 d[i = r, drop = FALSE]                  
16 0.222    0.163      1.36 d[rr]                                   
17 0.247    0.112      2.21 d[rr, drop = FALSE]                     
18 0.204    0.144      1.42 d[rr, drop = TRUE]                      
19 0.174    0.120      1.45 d[i = rr]                               
20 0.201    0.125      1.61 d[i = rr, drop = FALSE]                 
21 0.215    0.147      1.46 d[i = rr, drop = TRUE]                  
22 2.266    1.159      1.96 d[rr, ]                                 
23 2.236    1.164      1.92 d[rr, , drop = FALSE]                   
24 2.275    1.171      1.94 d[rr, , drop = TRUE]                    
25 2.269    1.165      1.95 d[i = rr, ]                             
26 2.264    1.155      1.96 d[i = rr, , drop = FALSE]               
27 2.290    1.189      1.93 d[i = rr, , drop = TRUE]                
28 2.301    1.198      1.92 d[, i = rr]                             
29 2.239    1.158      1.93 d[, i = rr, drop = FALSE]               
30 2.310    1.161      1.99 d[, i = rr, drop = TRUE]                
31 0.002    0.003      0.67 d[j = c]                                
32 0.026    0.011      2.36 d[j = c, drop = FALSE]                  
33 0.026    0.003      8.67 d[j = c, drop = TRUE]                   
34 0.001    0.111      0.01 d[j = cc]                               
35 0.025    0.110      0.23 d[j = cc, drop = FALSE]                 
36 0.025    0.111      0.23 d[j = cc, drop = TRUE]                  
37 0.243    0.051      4.76 d[rr, cc]                               
38 0.243    0.051      4.76 d[rr, cc, drop = FALSE]                 
39 0.244    0.050      4.88 d[rr, cc, drop = TRUE]                  
40 0.244    0.051      4.78 d[i = rr, cc]                           
41 0.243    0.050      4.86 d[i = rr, cc, drop = FALSE]             
42 0.244    0.051      4.78 d[i = rr, cc, drop = TRUE]              
43 0.243    0.052      4.67 d[cc, i = rr]                           
44 0.244    0.050      4.88 d[cc, i = rr, drop = FALSE]             
45 0.247    0.052      4.75 d[cc, i = rr, drop = TRUE]              
46 0.244    0.050      4.88 d[i = rr, j = cc]                       
47 0.244    0.051      4.78 d[i = rr, j = cc, drop = FALSE]         
48 0.244    0.051      4.78 d[i = rr, j = cc, drop = TRUE]          
49 0.244    0.051      4.78 d[j = cc, i = rr]                       
50 0.243    0.051      4.76 d[j = cc, i = rr, drop = FALSE]         
51 0.245    0.051      4.80 d[j = cc, i = rr, drop = TRUE]          
52 0.002    0.155      0.01 d[j = cn]                               
53 0.429    0.139      3.09 d[i = rn, j = cn]                       
54 1.791    0.690      2.60 d[i = c(TRUE, FALSE), j = c(FALSE, TRUE)]

(note:  the benchmark relies on a feature of rbenchmark that i have just
added, so you may need to download/update the package before trying.)

in some tests, the difference is two orders of magnitude; in some it's a
factor of 2-5;  in some there's no significant difference.  in only a
few cases, the original is way faster (e.g., tests 34 and 52), but this
is because the original is wrong there (it simply ignores the index, so
no wonder).

all the expressions above used in benchmarking were also used to test
the equivalence of output from the original and the new version (see
test.r again), and all of them were negative (no difference) -- except
for the cases where the original was wrong.

i'd consider making a patch for src/library/base/R/dataframe.R, but
there's a hack here:  it seems that some code relies on some part of the
'design' that differs between the rewrite and the original, and the new
code does not make (dataframe.R does, but then other sources fail). 
anyway, sourcing the attached dataframe.R suffices for testing. 

i will be happy to learn where my implementation, benchmarking, and/or
result checking are naive or wrong in any way, as they surely are.

vQ

[1] http://cran.r-project.org/doc/manuals/R-lang.pdf
[2] http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R