[Rd] speeding up perception

ivo welch ivo.welch at gmail.com
Mon Jul 4 07:19:53 CEST 2011


thank you, simon.  this was very interesting indeed.  I also now
understand how far out of my depth I am here.

fortunately, as an end user, obviously, *I* now know how to avoid the
problem.  I particularly like the as.list() transformation and back to
as.data.frame() to speed things up without loss of (much)
functionality.


more broadly, I view the avoidance of individual access through the
use of apply and vector operations as a mixed "IQ test" and "knowledge
test" (which I often fail).  However, even for the most clever, there
are also situations where the KISS programming principle makes
explicit loops still preferable.  Personally, I would have preferred
it if R had, in its standard "statistical data set" data structure,
foregone the row names feature in exchange for retaining fast direct
access.  R could have reserved its current implementation "with row
names but slow access" for a less common (possibly pseudo-inheriting)
data structure.


If end users commonly do iterations over a data frame, which I would
guess to be the case, then the impression of R by (novice) end users
could be greatly enhanced if the extreme penalties could be eliminated
or at least flagged.  For example, I wonder if modest special internal
code could store data frames internally and transparently as lists of
vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
but specific warning message could be issued with a suggestion if
there is an individual read/write into a data frame ("Warning: data
frames are much slower than lists of vectors for individual element
access").


I would also suggest changing the "Introduction to R" 6.3  from "A
data frame may for many purposes be regarded as a matrix with columns
possibly of differing modes and attributes. It may be displayed in
matrix form, and its rows and columns extracted using matrix indexing
conventions." to "A data frame may for many purposes be regarded as a
matrix with columns possibly of differing modes and attributes. It may
be displayed in matrix form, and its rows and columns extracted using
matrix indexing conventions.  However, data frames can be much slower
than matrices or even lists of vectors (which, like data frames, can
contain different types of columns) when individual elements need to
be accessed."  Reading about it immediately upon introduction could
flag the problem in a more visible manner.


regards,

/iaw



More information about the R-devel mailing list