[R] [External] Re: Strange behavior when sampling rows of a data frame

iuke-tier@ey m@iii@g oii uiow@@edu iuke-tier@ey m@iii@g oii uiow@@edu
Fri Jun 19 20:40:32 CEST 2020


The behavior has been there much longer than that in R and it's been a
known issue with complex assignment for a long time (not the only
one). You're in a better position than I to know how Splus handles this.

The complex assignment expression

     df[<index>, ]$treated <- TRUE

is basically evaluated as

     tmp <-df[<index>, ]
     tmp$treated <- TRUE
     df[<index>,] <- tmp

So the <index> argument is evaluated twice. This is always a little
inefficient, but probably not what you want if there are side effects
in the index argument. So the main take-away is:

     Don't use index arguments with side effects in complex assignments.

It is in principle possible, when standard evaluation is in use, to
capture the value of <index> from the first evaluation and re-use for
the second. But, for better or worse, assignment methods can and do
use non-standard evaluation for the index arguments, and it would be
very hard for authors of such methods to avoid this. So changing to
avoid multiple index evaluation would always have to come with an
asterisk.

There are other issues with complex assignment as implemented
currently that have higher priority but are also quite tricky to
address. Possibly this one can be addressed at the same time.

Best,

luke

On Fri, 19 Jun 2020, William Dunlap via R-help wrote:

> It is a bug that has been present in R since at least R-2.14.0 (the oldest
> that I have installed on my laptop).
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Fri, Jun 19, 2020 at 10:37 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
>> Hello,
>>
>>
>> Thanks, I hadn't thought of that.
>>
>> But, why? Is it evaluated once before assignment and a second time when
>> the assignment occurs?
>>
>> To trace both sample and `[<-` gives 2 calls to sample.
>>
>>
>> trace(sample)
>> trace(`[<-`)
>> df[sample(nrow(df), 3),]$treated <- TRUE
>> trace: sample(nrow(df), 3)
>> trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
>> 6L, 8L), treated = c(TRUE, TRUE, TRUE)))
>> trace: sample(nrow(df), 3)
>>
>>
>> Regards,
>>
>> Rui Barradas
>>
>>
>> Às 17:20 de 19/06/2020, William Dunlap escreveu:
>>> The first subscript argument is getting evaluated twice.
>>>> trace(sample)
>>>> set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
>>> trace: sample(10, 3)
>>> trace: sample(10, 3)
>>>> i
>>> [1]  1 10  4
>>>> set.seed(2020); sample(10,3)
>>> trace: sample(10, 3)
>>> [1] 7 6 8
>>>> sample(10,3)
>>> trace: sample(10, 3)
>>> [1]  1 10  4
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com <http://tibco.com>
>>>
>>>
>>> On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <ruipbarradas using sapo.pt
>>> <mailto:ruipbarradas using sapo.pt>> wrote:
>>>
>>>     Hello,
>>>
>>>     I don't have an answer on the reason why this happens but it seems
>>>     like
>>>     a bug. Where?
>>>
>>>     In which of  `[<-.data.frame` or `[<-.default`?
>>>
>>>     A solution is to subset and assign the vector:
>>>
>>>
>>>     set.seed(2020)
>>>     df2 <- data.frame(unit = 1:10)
>>>     df2$treated <- FALSE
>>>
>>>     df2$treated[sample(nrow(df2), 3)] <- TRUE
>>>     df2
>>>     #  unit treated
>>>     #1     1   FALSE
>>>     #2     2   FALSE
>>>     #3     3   FALSE
>>>     #4     4   FALSE
>>>     #5     5   FALSE
>>>     #6     6    TRUE
>>>     #7     7    TRUE
>>>     #8     8    TRUE
>>>     #9     9   FALSE
>>>     #10   10   FALSE
>>>
>>>
>>>     Or
>>>
>>>
>>>     set.seed(2020)
>>>     df3 <- data.frame(unit = 1:10)
>>>     df3$treated <- FALSE
>>>
>>>     df3[sample(nrow(df3), 3), "treated"] <- TRUE
>>>     df3
>>>     # result as expected
>>>
>>>
>>>     Hope this helps,
>>>
>>>     Rui  Barradas
>>>
>>>
>>>
>>>     Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
>>>    > I ran into some strange behavior in R when trying to assign a
>>>     treatment to
>>>    > rows in a data frame. I'm wondering whether any R experts can
>>>     explain
>>>    > what's going on.
>>>    >
>>>    > First, let's assign a treatment to 3 out of 10 rows as follows.
>>>    >
>>>    >> df <- data.frame(unit = 1:10)
>>>    >> df$treated <- FALSE
>>>    >> s <- sample(nrow(df), 3)
>>>    >> df[s,]$treated <- TRUE
>>>    >> df
>>>    >     unit treated
>>>    >
>>>    > 1     1   FALSE
>>>    >
>>>    > 2     2    TRUE
>>>    >
>>>    > 3     3   FALSE
>>>    >
>>>    > 4     4   FALSE
>>>    >
>>>    > 5     5    TRUE
>>>    >
>>>    > 6     6   FALSE
>>>    >
>>>    > 7     7    TRUE
>>>    >
>>>    > 8     8   FALSE
>>>    >
>>>    > 9     9   FALSE
>>>    >
>>>    > 10   10   FALSE
>>>    >
>>>    > This is as expected. Now we'll just skip the intermediate step
>>>     of saving
>>>    > the sampled indices, and apply the treatment directly as follows.
>>>    >
>>>    >> df <- data.frame(unit = 1:10)
>>>    >> df$treated <- FALSE
>>>    >> df[sample(nrow(df), 3),]$treated <- TRUE
>>>    >> df
>>>    >     unit treated
>>>    >
>>>    > 1     6    TRUE
>>>    >
>>>    > 2     2   FALSE
>>>    >
>>>    > 3     3   FALSE
>>>    >
>>>    > 4     9    TRUE
>>>    >
>>>    > 5     5   FALSE
>>>    >
>>>    > 6     6   FALSE
>>>    >
>>>    > 7     7   FALSE
>>>    >
>>>    > 8     5    TRUE
>>>    >
>>>    > 9     9   FALSE
>>>    >
>>>    > 10   10   FALSE
>>>    >
>>>    > Now the data frame still has 10 rows with 3 assigned to the
>>>     treatment. But
>>>    > the units are garbled. Units 1 and 4 have disappeared, for
>>>     instance, and
>>>    > there are duplicates for 6 and 9, one assigned to treatment and
>>>     the other
>>>    > to control. Why would this happen?
>>>    >
>>>    > Thanks,
>>>    > Sebastien
>>>    >
>>>    >       [[alternative HTML version deleted]]
>>>    >
>>>    > ______________________________________________
>>>    > R-help using r-project.org <mailto:R-help using r-project.org> mailing list
>>>     -- To UNSUBSCRIBE and more, see
>>>    > https://stat.ethz.ch/mailman/listinfo/r-help
>>>    > PLEASE do read the posting guide
>>>     http://www.R-project.org/posting-guide.html
>>>    > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>     --
>>>     Este e-mail foi verificado em termos de vírus pelo software
>>>     antivírus Avast.
>>>     https://www.avast.com/antivirus
>>>
>>>     ______________________________________________
>>>     R-help using r-project.org <mailto:R-help using r-project.org> mailing list --
>>>     To UNSUBSCRIBE and more, see
>>>     https://stat.ethz.ch/mailman/listinfo/r-help
>>>     PLEASE do read the posting guide
>>>     http://www.R-project.org/posting-guide.html
>>>     and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Este e-mail foi verificado em termos de vírus pelo software antivírus
>> Avast.
>> https://www.avast.com/antivirus
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


More information about the R-help mailing list