[Rd] data frame subscription operator

Vladimir Dergachev vdergachev at rcgardis.com
Wed Nov 8 17:13:54 CET 2006


On Wednesday 08 November 2006 3:21 am, Prof Brian Ripley wrote:
>
> > So far I was not able to figure out why this is necessary -
> > could anyone help ?
>
> You need to remove the class to avoid recursion: a few lines later x[i]
> needs to be a call to the primitive and not the data frame method.

I see. Is there a way to get at the primitive directly, i.e. something like
`[.list`(x, i) ?

>
> > The reason I am looking at it is that changing attributes forces
> > duplication of the data frame and this is the largest cause of slowness
> > of data.frames in general.
>
> Do you have evidence of that?  R has facilities to profile its code, and I
> have never seen  [.data.frame taking a significant proportion of the total
> time.  If it does for your application, consider if a data frame is an
> appropriate way to store your data.  I am not sure we would accept that
> data frames do have 'slowness in general', but their generality does make
> them slower than alternatives where the generality is not needed.

Evidence:

	# this can be copy'n'pasted directly into an R session
	# small N - both system calls return small, but comparable running times
	N<-100000
	A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
	system.time(B<-A[,1])
	system.time(B<-A[1,1])


	#larger N - both times are larger and still comparable
	N<-1000000
	A<-data.frame(X=1:N, Y=rnorm(N), Z=as.character(rnorm(N)))
	system.time(B<-A[,1])
	system.time(B<-A[1,1])
        
The running times would also grow with the number of columns. Also I have 
modified 2.4.0 version of R to print out large allocations and I get the 
impression that the data frame is being duplicated. Same happens for 
`[<-.data.frame` - but this function has much more complex code, I have not 
looked through it yet.

Of course, getting a small portion (i.e. A[1:5,]) also takes a lot of time - 
but the examples showed above should be O(1).

My data is a result of data base query - it has naturally columns of different 
types and the columns are named (no row.names though) - which is why I used 
data.frames. What would you suggest ?

                    thank you very much !

                             Vladimir Dergachev



More information about the R-devel mailing list