[R] which() vs. just logical selection in df

Sun Oct 11 01:24:40 CEST 2020

Hi R-helpers,

Does anyone know why adding which() makes the select call more
efficient than just using logical selection in a dataframe? Doesn't
which() technically add another conversion/function call on top of the
logical selection? Here is a reproducible example with a slight
difference in timing.

# Surrogate data - the timing here isn't interesting
urltext <- paste("https://drive.google.com/",
                 "uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
                 "-h8&export=download", sep="")
download.file(url=urltext, destfile="tempfile.csv") # download file first
dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
                  nrows=2.5e6) # read the file; 'nrows' is a slight
                                         # overestimate
dat <- dat[,1:3] # select just the first 3 columns
head(dat, 10) # print the first 10 rows

# Select using which() as the final step ~ 90ms total time on my macbook air
system.time(
  head(
    dat[which(dat$gender2=="other"),],),
  gcFirst=TRUE)

# Select skipping which() ~130ms total time
system.time(
  head(
    dat[dat$gender2=="other", ]),
  gcFirst=TRUE)

Now I would think that the second one without which() would be more
efficient. However, every time I run these, the first version, with
which() is more efficient by about 20ms of system time and 20ms of
user time. Does anyone know why this is?

Cheers!
Keith