[R] which() vs. just logical selection in df

1/k^c kch@mber|n @end|ng |rom gm@||@com
Thu Oct 15 00:23:09 CEST 2020


Hi Dr. Snow, & R-helpers,

Thank you for your reply! I hadn't heard of the {microbenchmark}
package & was excited to try it! Thank you for the suggestion! I did
check the reference source for which() beforehand, which included the
statement to remove NAa, and I didn't have any missing values or NAs:

sum(is.na(dat$gender2))
sum(is.na(dat$gender))
sum(is.na(dat$y))

[1] 0
[1] 0
[1] 0

I still had a 10ms difference in the value returned by microbenchmark
between the following methods: one with and one without using which().
The difference is reversed from what I expected, since which() is an
extra step.

microbenchmark(
  head(
    dat[which(dat$gender2=="other"),],), times=100L)
microbenchmark(
  head(
    dat[dat$gender2=="other",],), times=100L)

         min                lq                 mean
head(dat[which(dat$gender2 == "other"), ], )      62.93803
74.25939     88.4704
head(dat[dat$gender2 == "other", ], )                 71.8914
87.95844    103.7231

Is which() invoking c-level code by chance, making it slightly faster
on average? The difference likely becomes important on terabytes of
data. The addition of which() still seems superfluous to me, and I'd
like to know whether it's considered best practice to keep it. What is
R inoking when which() isn't called explicitly? Is R invoking which()
eventually anyway?

Cheers!
Keith

> Message: 2
> Date: Mon, 12 Oct 2020 13:01:36 -0600
> From: Greg Snow <538280 using gmail.com>
> To: "1/k^c" <kchamberln using gmail.com>
> Cc: r-help <r-help using r-project.org>
> Subject: Re: [R] which() vs. just logical selection in df
> Message-ID:
>         <CAFEqCdyUuHh5TZ7t5NJ8cs_4xB61mNeUgasncekD485eBNRK6Q using mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I would suggest using the microbenchmark package to do the time
> comparison.  This will run each a bunch of times for a more meaningful
> comparison.
>
> One possible reason for the difference is the number of missing values
> in your data (along with the number of columns).  Consider the
> difference in the following results:
>
> > x <- c(1,2,NA)
> > x[x==1]
> [1]  1 NA
> > x[which(x==1)]
> [1] 1
>
>
>
> On Sat, Oct 10, 2020 at 5:25 PM 1/k^c <kchamberln using gmail.com> wrote:
> >
> > Hi R-helpers,
> >
> > Does anyone know why adding which() makes the select call more
> > efficient than just using logical selection in a dataframe? Doesn't
> > which() technically add another conversion/function call on top of the
> > logical selection? Here is a reproducible example with a slight
> > difference in timing.
> >
> > # Surrogate data - the timing here isn't interesting
> > urltext <- paste("https://drive.google.com/",
> >                  "uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
> >                  "-h8&export=download", sep="")
> > download.file(url=urltext, destfile="tempfile.csv") # download file first
> > dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
> >                   nrows=2.5e6) # read the file; 'nrows' is a slight
> >                                          # overestimate
> > dat <- dat[,1:3] # select just the first 3 columns
> > head(dat, 10) # print the first 10 rows
> >
> > # Select using which() as the final step ~ 90ms total time on my macbook air
> > system.time(
> >   head(
> >     dat[which(dat$gender2=="other"),],),
> >   gcFirst=TRUE)
> >
> > # Select skipping which() ~130ms total time
> > system.time(
> >   head(
> >     dat[dat$gender2=="other", ]),
> >   gcFirst=TRUE)
> >
> > Now I would think that the second one without which() would be more
> > efficient. However, every time I run these, the first version, with
> > which() is more efficient by about 20ms of system time and 20ms of
> > user time. Does anyone know why this is?
> >
> > Cheers!
> > Keith
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538280 using gmail.com
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 12 Oct 2020 08:33:44 +0200 (CEST)
> From: =?UTF-8?Q?Frauke_G=C3=BCnther?= <guenther using leibniz-bips.de>
> To: "r-help using r-project.org" <r-help using r-project.org>
> Cc: William Michels <wjm1 using caa.columbia.edu>, "smm using posteo.org"
>         <smm using posteo.org>
> Subject: Re: [R]  Fwd:  Help using the exclude option in the neuralnet
>         package
> Message-ID: <957726669.124476.1602484424752 using srvmail.bips.eu>
> Content-Type: text/plain; charset="utf-8"
>
> Dear all,
>
> the exclude and constant.weights options are used as follows:
>
> exclude: A matrix with n rows and 3 columns will exclude n weights. The the first column refers to the layer, the second column to the input neuron and the third column to the output neuron of the weight.
>
> constant.weights: A vector specifying the values of the weights that are excluded from the training process and treated as fix.
>
> Please refer to the following example:
>
> Not using exclude and constant.weights (all weights are trained):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE)
> >
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] 6.513239
> [2,] -0.815920
> [3,] -5.859802
> [[1]][[2]]
> [,1]
> [1,] -4.597934
> [2,] 9.179436
>
> Using exclude (2 weights are excluded --> NA):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE,
> exclude = matrix(c(1,2,1, 2,2,1),byrow=T, nrow=2))
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] -0.2815942
> [2,] NA
> [3,] 0.2481212
> [[1]][[2]]
> [,1]
> [1,] -0.6934932
> [2,] NA
>
> Using exclude and constant.weights (2 weights are excluded and treated as fix --> 100 and 1000, respectively):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE,
> exclude = matrix(c(1,2,1, 2,2,1),byrow=T, nrow=2),
> constant.weights=c(100,1000))
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] 0.554119
> [2,] 100.000000
> [3,] 1.153611
> [[1]][[2]]
> [,1]
> [1,] -0.3962524
> [2,] 1000.0000000
>
> I hope you will find this example helpful.
>
> Sincerely,
> Frauke
>
>
> >     William Michels <wjm1 using caa.columbia.edu mailto:wjm1 using caa.columbia.edu > hat am 10.10.2020 18:16 geschrieben:
> >
> >
> >     Forwarding: Question re "neuralnet" package on the R-Help mailing list:
> >
> >     https://stat.ethz.ch/pipermail/r-help/2020-October/469020.html
> >
> >     If you are so inclined, please reply to:
> >
> >     r-help using r-project.org mailto:r-help using r-project.org <r-help using r-project.org mailto:r-help using r-project.org >
> >
> >     ---------- Forwarded message ---------
> >     From: Dan Ryan <Dan.Ryan using unbc.ca mailto:Dan.Ryan using unbc.ca >
> >     Date: Fri, Oct 9, 2020 at 3:52 PM
> >     Subject: Re: [R] Help using the exclude option in the neuralnet package
> >     To: r-help using r-project.org mailto:r-help using r-project.org <r-help using r-project.org mailto:r-help using r-project.org >
> >
> >     Good Morning,
> >
> >     I am using the neuralnet package in R, and am able to produce some
> >     basic neural nets, and use the output.
> >
> >     I would like to exclude some of the weights and biases from the
> >     iteration process and fix their values.
> >
> >     However I do not seem to be able to correctly define the exclude and
> >     constant.weights vectors.
> >
> >     Question: Can someone point me to an example where exclude and
> >     contant.weights are used. I have search the R help archive, and
> >     haven't found any examples which use these on the web.
> >
> >     Thank you in advance for any help.
> >
> >     Sincerely
> >
> >     Dan
> >
> >     [[alternative HTML version deleted]]
> >
> >     ______________________________________________
> >     R-help using r-project.org mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 13 Oct 2020 08:04:32 +0200
> From: Ablaye Ngalaba <ablayengalaba using gmail.com>
> To: R-help using r-project.org
> Subject: [R] package for kernel on R
> Message-ID:
>         <CAOkWQv2YoQPpsBUJzV3i4EhAYHNRVZP3vuRXeBA28fLKSUdeqA using mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
> Please, I want to know which package to install on R when coding the kernel
> functions
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 13 Oct 2020 09:09:00 +0200
> From: Ablaye Ngalaba <ablayengalaba using gmail.com>
> To: R-help using r-project.org
> Subject: [R] help for R code
> Message-ID:
>         <CAOkWQv0LsgxkHdqpai1=9BpLmp6tAdNwZiqTihA8zrirkf2yFQ using mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Good morning dear administrators,
> Please help me to code this code in R.
> I use in this file the redescription function Φ which by making a scalar
> product gives a . You can also choose instead of the redescription function
> Φ a kernel k(x,x).
>
>
>
>
>                   Sincerely
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 13 Oct 2020 11:21:45 +0300
> From: Eric Berger <ericjberger using gmail.com>
> To: Ablaye Ngalaba <ablayengalaba using gmail.com>
> Cc: R mailing list <R-help using r-project.org>
> Subject: Re: [R] help for R code
> Message-ID:
>         <CAGgJW74TP-+L6gg0_BLbnayL657Ejw+_fvQ+tScsaDgEj8vQDA using mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Ablaye,
> The CRAN repository has thousands of available R packages. To help
> people find relevant packages amid such a huge collection, there are
> some 'task view' pages that group packages according to a particular
> task. I am guessing that you are interested in kernels because of
> their use in machine learning, so you might want to look at the
> Machine Learning task view at:
>
> https://cran.r-project.org/web/views/MachineLearning.html
>
> If you search for 'kernels' on that page you will find
>
> 'Support Vector Machines and Kernel Methods' which mentions a few
> packages that use kernels.
>
> Good luck,
> Eric
>
>
> On Tue, Oct 13, 2020 at 10:09 AM Ablaye Ngalaba <ablayengalaba using gmail.com> wrote:
> >
> > Good morning dear administrators,
> > Please help me to code this code in R.
> > I use in this file the redescription function Φ which by making a scalar
> > product gives a . You can also choose instead of the redescription function
> > Φ a kernel k(x,x).
> >
> >
> >
> >
> >                   Sincerely
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> R-help using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ------------------------------
>
> End of R-help Digest, Vol 212, Issue 12
> ***************************************



More information about the R-help mailing list