[R] Strange behavior when sampling rows of a data frame

William Dunlap wdun|@p @end|ng |rom t|bco@com
Fri Jun 19 19:42:08 CEST 2020


It is a bug that has been present in R since at least R-2.14.0 (the oldest
that I have installed on my laptop).

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Jun 19, 2020 at 10:37 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:

> Hello,
>
>
> Thanks, I hadn't thought of that.
>
> But, why? Is it evaluated once before assignment and a second time when
> the assignment occurs?
>
> To trace both sample and `[<-` gives 2 calls to sample.
>
>
> trace(sample)
> trace(`[<-`)
> df[sample(nrow(df), 3),]$treated <- TRUE
> trace: sample(nrow(df), 3)
> trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
> 6L, 8L), treated = c(TRUE, TRUE, TRUE)))
> trace: sample(nrow(df), 3)
>
>
> Regards,
>
> Rui Barradas
>
>
> Às 17:20 de 19/06/2020, William Dunlap escreveu:
> > The first subscript argument is getting evaluated twice.
> > > trace(sample)
> > > set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
> > trace: sample(10, 3)
> > trace: sample(10, 3)
> > > i
> > [1]  1 10  4
> > > set.seed(2020); sample(10,3)
> > trace: sample(10, 3)
> > [1] 7 6 8
> > > sample(10,3)
> > trace: sample(10, 3)
> > [1]  1 10  4
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com <http://tibco.com>
> >
> >
> > On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <ruipbarradas using sapo.pt
> > <mailto:ruipbarradas using sapo.pt>> wrote:
> >
> >     Hello,
> >
> >     I don't have an answer on the reason why this happens but it seems
> >     like
> >     a bug. Where?
> >
> >     In which of  `[<-.data.frame` or `[<-.default`?
> >
> >     A solution is to subset and assign the vector:
> >
> >
> >     set.seed(2020)
> >     df2 <- data.frame(unit = 1:10)
> >     df2$treated <- FALSE
> >
> >     df2$treated[sample(nrow(df2), 3)] <- TRUE
> >     df2
> >     #  unit treated
> >     #1     1   FALSE
> >     #2     2   FALSE
> >     #3     3   FALSE
> >     #4     4   FALSE
> >     #5     5   FALSE
> >     #6     6    TRUE
> >     #7     7    TRUE
> >     #8     8    TRUE
> >     #9     9   FALSE
> >     #10   10   FALSE
> >
> >
> >     Or
> >
> >
> >     set.seed(2020)
> >     df3 <- data.frame(unit = 1:10)
> >     df3$treated <- FALSE
> >
> >     df3[sample(nrow(df3), 3), "treated"] <- TRUE
> >     df3
> >     # result as expected
> >
> >
> >     Hope this helps,
> >
> >     Rui  Barradas
> >
> >
> >
> >     Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
> >     > I ran into some strange behavior in R when trying to assign a
> >     treatment to
> >     > rows in a data frame. I'm wondering whether any R experts can
> >     explain
> >     > what's going on.
> >     >
> >     > First, let's assign a treatment to 3 out of 10 rows as follows.
> >     >
> >     >> df <- data.frame(unit = 1:10)
> >     >> df$treated <- FALSE
> >     >> s <- sample(nrow(df), 3)
> >     >> df[s,]$treated <- TRUE
> >     >> df
> >     >     unit treated
> >     >
> >     > 1     1   FALSE
> >     >
> >     > 2     2    TRUE
> >     >
> >     > 3     3   FALSE
> >     >
> >     > 4     4   FALSE
> >     >
> >     > 5     5    TRUE
> >     >
> >     > 6     6   FALSE
> >     >
> >     > 7     7    TRUE
> >     >
> >     > 8     8   FALSE
> >     >
> >     > 9     9   FALSE
> >     >
> >     > 10   10   FALSE
> >     >
> >     > This is as expected. Now we'll just skip the intermediate step
> >     of saving
> >     > the sampled indices, and apply the treatment directly as follows.
> >     >
> >     >> df <- data.frame(unit = 1:10)
> >     >> df$treated <- FALSE
> >     >> df[sample(nrow(df), 3),]$treated <- TRUE
> >     >> df
> >     >     unit treated
> >     >
> >     > 1     6    TRUE
> >     >
> >     > 2     2   FALSE
> >     >
> >     > 3     3   FALSE
> >     >
> >     > 4     9    TRUE
> >     >
> >     > 5     5   FALSE
> >     >
> >     > 6     6   FALSE
> >     >
> >     > 7     7   FALSE
> >     >
> >     > 8     5    TRUE
> >     >
> >     > 9     9   FALSE
> >     >
> >     > 10   10   FALSE
> >     >
> >     > Now the data frame still has 10 rows with 3 assigned to the
> >     treatment. But
> >     > the units are garbled. Units 1 and 4 have disappeared, for
> >     instance, and
> >     > there are duplicates for 6 and 9, one assigned to treatment and
> >     the other
> >     > to control. Why would this happen?
> >     >
> >     > Thanks,
> >     > Sebastien
> >     >
> >     >       [[alternative HTML version deleted]]
> >     >
> >     > ______________________________________________
> >     > R-help using r-project.org <mailto:R-help using r-project.org> mailing list
> >     -- To UNSUBSCRIBE and more, see
> >     > https://stat.ethz.ch/mailman/listinfo/r-help
> >     > PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     > and provide commented, minimal, self-contained, reproducible code.
> >
> >     --
> >     Este e-mail foi verificado em termos de vírus pelo software
> >     antivírus Avast.
> >     https://www.avast.com/antivirus
> >
> >     ______________________________________________
> >     R-help using r-project.org <mailto:R-help using r-project.org> mailing list --
> >     To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>
> --
> Este e-mail foi verificado em termos de vírus pelo software antivírus
> Avast.
> https://www.avast.com/antivirus
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list