[Rd] Some R questions

Thu Nov 2 04:07:19 CET 2006

On 11/2/06, Vladimir Dergachev <vdergachev at rcgardis.com> wrote:
> On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> The slowness manifests itself for vectorized code as well. I believe it is due
> to the code mucking about with row.names attribute which introduces a penalty
> on any [,] operation - penalty that grows linearly with the number of rows.
>
> Thus for large data frames   A[,1] is slower than A[[1]]. For example, for the
> data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my
> opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than
> twice the time it took to load the entire thing into memory. Silly, isn't
> it ?
>
> Also, there are good reasons to want to address individual cells. And there is
> no reason why such access cannot be constant time.
Yeah, it should be O(1) because a data frame is just a list of vectors
and everything is in memory: index the column in the list, then the
row on the vector. For non-vectorized code, the problem is more of the
loop overhead (maintaining loop variables) which is done on R instead
of in C.

> > <pimp-my-project>
> > Or, you may just use (and pour your effort on improving) SQLiteDF
> > http://cran.r-project.org/src/contrib/Descriptions/SQLiteDF.html
> > </pimp-my-project>
>
> Very nice ! The documentation mentioned something about assignment operator
> not working - is this still true ? Or, maybe, I misunderstood something ?
Yes, unfortunately, still no [<- operator. For every way that a data
frame can index-ed (or subscript-ed), that's how many ways the data
frames can be mutated. There are many other things more "fun" than
coding that (graphics!, extending sqlite syntax, R expression
evaluation), but I'd do that on the weekend.

> Also, I wonder whether it would be possible to extend [[ operator so one can
> run queries: SQLDF[["SELECT * FROM a WHERE.."]]
That has been suggested before, but in retrospect this can be achieved
more "poetically" as

sdf[sdf$a>3 && sdf$b=="i",]    # where a>3 and b == 'i'

although not as efficient. I have been thinking of adding a method like

select(sdf, select=<select_clause>,where=<where_clause>,ordery_by=order_by_clause)

so that sum(sdf$a) can just be done with select(sdf, "sum(a)"), and
not go .Call("..."). It can also optimize stuff, like with(sdf, a+b)
can be done with select(sdf, "a+b").

M. Manese