[Rd] [datatable-help] speeding up perception

Tue Jul 12 12:24:03 CEST 2011

> Matthew,
>
> I was hoping I misunderstood you first proposal, but I suspect I did not
> ;).
>
> Personally, I find  DT[1,V1 <- 3] highly disturbing - I would expect it to
> evaluate to
> { V1 <- 3; DT[1, V1] }
> thus returning the first element of the third column.

Please see FAQ 1.1, since further below it seems to be an expectation
issue about 'with' syntax, too.

>
> That said, I don't think it works, either. Taking you example and
> data.table form r-forge:
[ snip ]
> as you can see, DT is not modified.

Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
the non-working state I'll need some more environment information please.

> Also I suspect there is something quite amiss because even trivial things
> don't work:
>
>> DF[1:4,1:4]
>   V1 V2 V3 V4
> 1  3  1  1  1
> 2  1  1  1  1
> 3  1  1  1  1
> 4  1  1  1  1
>> DT[1:4,1:4]
> [1] 1 2 3 4

That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
and 1.10.

>
> When I first saw your proposal, I thought you have rather something like
> within(DT, V1[1] <- 3)
> in mind which looks innocent enough but performs terribly (note that I had
> to scale down the loop by a factor of 100!!!):
>
>> system.time(for (i in 1:10) within(DT, V1[1] <- 3))
>    user  system elapsed
>   2.701   4.437   7.138

No, since 'with' is already built into data.table, I was thinking of
building 'within' in, too. I'll take a look at within(). Might as well
provide as many options as possible to the user to use as they wish.

> With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3))
> performs reasonably:
>
>> system.time(within(DT, for (i in 1:1000) V1[i] <- 3))
>    user  system elapsed
>   0.392   0.613   1.003
>
> (Note: system.time() can be misleading when within() is involved, because
> the expression is evaluated in a different environment so within() won't
> actually change the object in the  global environment - it also interacts
> with the possible duplication)

Noted, thanks. That's pretty fast. Does within() on data.frame fix the
original issue Ivo raised, then?  If so, job done.

>
> Cheers,
> Simon
>
> On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:
>
>> Thanks for the replies and info. An attempt at fast
>> assign is now committed to data.table v1.6.3 on
>> R-Forge. From NEWS :
>>
>> o   Fast update is now implemented, FR#200.
>>    DT[i,j]<-value is now handled by data.table in C rather
>>    than falling through to data.frame methods.
>>
>>    Thanks to Ivo Welch for raising speed issues on r-devel,
>>    to Simon Urbanek for the suggestion, and Luke Tierney and
>>    Simon for information on R internals.
>>
>>    [<- syntax still incurs one working copy of the whole
>>    table (as of R 2.13.0) due to R's [<- dispatch mechanism
>>    copying to `*tmp*`, so, for ultimate speed and brevity,
>>    'within' syntax is now available as follows.
>>
>> o   A new 'within' argument has been added to [.data.table,
>>    by default TRUE. It is very similar to the within()
>>    function in base R. If an assignment appears in j, it
>>    assigns to the column of DT, by reference; e.g.,
>>
>>    DT[i,colname<-value]
>>
>>    This syntax makes no copies of any part of memory at all.
>>
>>> m = matrix(1,nrow=100000,ncol=100)
>>> DF = as.data.frame(m)
>>> DT = as.data.table(m)
>>> system.time(for (i in 1:1000) DF[1,1] <- 3)
>>       user  system elapsed
>>    287.730 323.196 613.453
>>> system.time(for (i in 1:1000) DT[1,V1 <- 3])
>>       user  system elapsed
>>      1.152   0.004   1.161         # 528 times faster
>>
>> Please note :
>>
>>    *******************************************************
>>    **  Within syntax is presently highly experimental.  **
>>    *******************************************************
>>
>> http://datatable.r-forge.r-project.org/
>>
>>
>> On Wed, 2011-07-06 at 09:08 -0500, luke-tierney at uiowa.edu wrote:
>>> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>>>
>>>> Interesting, and I stand corrected:
>>>>
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> .Internal(inspect(x))
>>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>
>>>>> x[1,1]=42L
>>>>> .Internal(inspect(x))
>>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>
>>>>> x[[1]][1]=42L
>>>>> .Internal(inspect(x))
>>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>>>>
>>>>> x[[1]][1]=42L
>>>>> .Internal(inspect(x))
>>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>
>>>>
>>>> I have R to release ;) so I won't be looking into this right now, but
>>>> it's something worth investigating ... Since all the inner contents
>>>> have NAMED=0 I would not expect any duplication to be needed, but
>>>> apparently becomes so is at some point ...
>>>
>>>
>>> The internals assume in various places that deep copies are made (one
>>> of the reasons NAMED setings are not propagated to sub-sturcture).
>>> The main issues are avoiding cycles and that there is no easy way to
>>> check for sharing.  There may be some circumstances in which a shallow
>>> copy would be OK but making sure it would be in all cases is probably
>>> more trouble than it is worth at this point. (I've tried this in the
>>> past in a few cases and always had to back off.)
>>>
>>>
>>> Best,
>>>
>>> luke
>>>
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>>>>
>>>>>
>>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>>>>> No subassignment function satisfies that condition, because you can
>>>>>> always call them directly. However, that doesn't stop the default
>>>>>> method from making that assumption, so I'm not sure it's an issue.
>>>>>>
>>>>>> David, Just to clarify - the data frame content is not copied, we
>>>>>> are talking about the vector holding columns.
>>>>>
>>>>> If it is just the vector holding the columns that is copied (and not
>>>>> the
>>>>> columns themselves), why does n make a difference in this test (on R
>>>>> 2.13.0)?
>>>>>
>>>>>> n = 1000
>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>  user  system elapsed
>>>>> 0.628   0.000   0.628
>>>>>> n = 100000
>>>>>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer
>>>>>> columns
>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>  user  system elapsed
>>>>> 20.145   1.232  21.455
>>>>>>
>>>>>
>>>>> With $<- :
>>>>>
>>>>>> n = 1000
>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>  user  system elapsed
>>>>> 0.304   0.000   0.307
>>>>>> n = 100000
>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>  user  system elapsed
>>>>> 37.586   0.388  38.161
>>>>>>
>>>>>
>>>>> If it's because the 1st column needs to be copied (only) because
>>>>> that's
>>>>> the one being assigned to (in this test), that magnitude of slow down
>>>>> doesn't seem consistent with the time of a vector copy of the 1st
>>>>> column :
>>>>>
>>>>>> n=100000
>>>>>> v = 1:n
>>>>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>>>>  user  system elapsed
>>>>> 0.016   0.000   0.017
>>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>>>>  user  system elapsed
>>>>> 1.816   1.076   2.900
>>>>>
>>>>> Finally, increasing the number of columns, again only the 1st is
>>>>> assigned to :
>>>>>
>>>>>> n=100000
>>>>>> x = data.frame(rep(list(1:n),100))
>>>>>> dim(x)
>>>>> [1] 100000    100
>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>  user  system elapsed
>>>>> 167.974  50.903 219.711
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Simon
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsemius at comcast.net>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tierney at uiowa.edu>
>>>>>>> <luke-tierney at uiowa.edu> wrote:
>>>>>>>
>>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>>>>
>>>>>>>>>> Simon (and all),
>>>>>>>>>>
>>>>>>>>>> I've tried to make assignment as fast as calling
>>>>>>>>>> `[<-.data.table`
>>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it
>>>>>>>>>> isn't
>>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-'
>>>>>>>>>> from
>>>>>>>>>> copying x?
>>>>>>>>>
>>>>>>>>> Good point, and conceptually, no. It's a subassignment after all
>>>>>>>>> - see R-lang 3.4.4 - it is equivalent to
>>>>>>>>>
>>>>>>>>> `*tmp*` <- x
>>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>>>>> rm(`*tmp*`)
>>>>>>>>>
>>>>>>>>> so there is always a copy involved.
>>>>>>>>>
>>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries
>>>>>>>>> to keep the pass-by-value illusion while passing references in
>>>>>>>>> cases where it knows that modifications cannot occur and/or they
>>>>>>>>> are safe. The default subassign method uses that feature which
>>>>>>>>> means it can afford to not duplicate if there is only one
>>>>>>>>> reference -- then it's safe to not duplicate as we are replacing
>>>>>>>>> that only existing reference. And in the case of a matrix, that
>>>>>>>>> will be true at the latest from the second subassignment on.
>>>>>>>>>
>>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more
>>>>>>>>> reference in the dispatch chain so there will always be two
>>>>>>>>> references so duplication is necessary. Since we have only 0 / 1
>>>>>>>>> / 2+ information on the references, we can't distinguish whether
>>>>>>>>> the second reference is due to the dispatch or due to the passed
>>>>>>>>> object having more than one reference, so we have to duplicate in
>>>>>>>>> any case. That is unfortunate, and I don't see a way around
>>>>>>>>> (unless we handle subassignment methods is some special way).
>>>>>>>>
>>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>>>>> seems to confirm this though I don't guarantee I did that right).
>>>>>>>> The
>>>>>>>> issue is that a replacement function implemented as a closure,
>>>>>>>> which
>>>>>>>> is the only option for a package, will always see NAMED on the
>>>>>>>> object
>>>>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>>>>> argument promise) and so any R level assignments will duplicate.
>>>>>>>> This
>>>>>>>> also isn't really an issue of imprecise reference counting --
>>>>>>>> there
>>>>>>>> really are (at least) two legitimate references -- one though the
>>>>>>>> argument and one through the caller's environment.
>>>>>>>>
>>>>>>>> It would be good it we could come up with a way for packages to be
>>>>>>>> able to define replacement functions that do not duplicate in
>>>>>>>> cases
>>>>>>>> where we really don't want them to, but this would require coming
>>>>>>>> up
>>>>>>>> with some sort of protocol, minimally involving an efficient way
>>>>>>>> to
>>>>>>>> detect whether a replacement funciton is being called in a
>>>>>>>> replacement
>>>>>>>> context or directly.
>>>>>>>
>>>>>>> Would "$<-" always satisfy that condition. It would be big help to
>>>>>>> me if it could be designed to avoid duplication the rest of the
>>>>>>> data.frame.
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>>
>>>>>>>> There are some replacement functions that use C code to cheat, but
>>>>>>>> these may create problems if called directly, so I won't advertise
>>>>>>>> them.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> luke
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Simon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Luke Tierney
>>>>>>>> Statistics and Actuarial Science
>>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>>>>> University of Iowa                  Phone:
>>>>>>>> 319-335-3386
>>>>>>>> Department of Statistics and        Fax:
>>>>>>>> 319-335-3017
>>>>>>>> Actuarial Science
>>>>>>>> 241 Schaeffer Hall                  email:
>>>>>>>> luke at stat.uiowa.edu
>>>>>>>> Iowa City, IA 52242                 WWW:
>>>>>>>> http://www.stat.uiowa.edu______________________________________________
>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>
>>>>>>> David Winsemius, MD
>>>>>>> West Hartford, CT
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Luke Tierney
>>> Statistics and Actuarial Science
>>> Ralph E. Wareham Professor of Mathematical Sciences
>>> University of Iowa                  Phone:             319-335-3386
>>> Department of Statistics and        Fax:               319-335-3017
>>>    Actuarial Science
>>> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>>
>>
>
>