[R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?
matthias-gondan
m@tth|@@-gond@n @end|ng |rom gmx@de
Wed Aug 4 14:08:05 CEST 2021
Response to 1You need the log version e.g. in maximum likelihood, otherwise the product of the densities and probabilities can become very small.
-------- Ursprüngliche Nachricht --------Von: r-help-request using r-project.org Datum: 04.08.21 12:01 (GMT+01:00) An: r-help using r-project.org Betreff: R-help Digest, Vol 222, Issue 4 Send R-help mailing list submissions to r-help using r-project.orgTo subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/r-helpor, via email, send a message with subject or body 'help' to r-help-request using r-project.orgYou can reach the person managing the list at r-help-owner using r-project.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of R-help digest..."Today's Topics: 1. What are the pros and cons of the log.p parameter in (p|q)norm and similar? (Michael Dewey) 2. Help with package EasyPubmed (bharat rawlley) 3. Re: Help with package EasyPubmed (bharat rawlley) 4. Re: What are the pros and cons of the log.p parameter in (p|q)norm and similar? (Duncan Murdoch) 5. Re: What are the pros and cons of the log.p parameter in (p|q)norm and similar? (Bill Dunlap) 6. Creating a log-transformed histogram of multiclass data (Tom Woolman) 7. Re: Creating a log-transformed histogram of multiclass data (Tom Woolman)----------------------------------------------------------------------Message: 1Date: Tue, 3 Aug 2021 17:20:12 +0100From: Michael Dewey <lists using dewey.myzen.co.uk>To: "r-help using r-project.org" <r-help using r-project.org>Subject: [R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?Message-ID: <e17bdaaa-7945-4f37-ee69-941eb8270f16 using dewey.myzen.co.uk>Content-Type: text/plain; charset="utf-8"; Format="flowed"Short versionApart from the ability to work with values of p too small to be of much practical use what are the advantages and disadvantages of setting this to TRUE?Longer versionI am contemplating upgrading various functions in one of my packages to use this and as far as I can see it would only have the advantage of allowing people to use very small p-values but before I go ahead have I missed anything? I am most concerned with negatives but if there is any other advantage I would mention that in the vignette. I am not concerned about speed or the extra effort in coding and expanding the documentation.-- Michaelhttp://www.dewey.myzen.co.uk/home.html------------------------------Message: 2Date: Tue, 3 Aug 2021 18:20:52 +0000 (UTC)From: bharat rawlley <bharat_m_all using yahoo.co.in>To: R-help Mailing List <r-help using r-project.org>Subject: [R] Help with package EasyPubmedMessage-ID: <1046636584.2205366.1628014852065 using mail.yahoo.com>Content-Type: text/plain; charset="utf-8"Hello, When I try to run the following code using the package Easypubmed, I get a null result - > batch_pubmed_download(query_7)NULL#query_7 <- "Cardiology AND randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, the exact same search string yields 668 results on Pubmed. I am unable to figure out why this is happening. If I use the search string "Cardiology AND 2011[PDAT]" then it works just fine. Any help would be greatly appreciatedThank you! [[alternative HTML version deleted]]------------------------------Message: 3Date: Tue, 3 Aug 2021 18:26:40 +0000 (UTC)From: bharat rawlley <bharat_m_all using yahoo.co.in>To: R-help Mailing List <r-help using r-project.org>Subject: Re: [R] Help with package EasyPubmedMessage-ID: <712126143.2207911.1628015200446 using mail.yahoo.com>Content-Type: text/plain; charset="utf-8" Okay, the following search string resolved my issue - "Cardiology AND randomized controlled trial[Publication type] AND 2011[PDAT]"Thank you! On Tuesday, 3 August, 2021, 02:21:38 pm GMT-4, bharat rawlley via R-help <r-help using r-project.org> wrote: Hello, When I try to run the following code using the package Easypubmed, I get a null result - > batch_pubmed_download(query_7)NULL#query_7 <- "Cardiology AND randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, the exact same search string yields 668 results on Pubmed. I am unable to figure out why this is happening. If I use the search string "Cardiology AND 2011[PDAT]" then it works just fine. Any help would be greatly appreciatedThank you! [[alternative HTML version deleted]]______________________________________________R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, seehttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]------------------------------Message: 4Date: Tue, 3 Aug 2021 14:53:28 -0400From: Duncan Murdoch <murdoch.duncan using gmail.com>To: Michael Dewey <lists using dewey.myzen.co.uk>, "r-help using r-project.org" <r-help using r-project.org>Subject: Re: [R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?Message-ID: <c15f610b-7a16-9d84-884c-54cc170bbad8 using gmail.com>Content-Type: text/plain; charset="utf-8"; Format="flowed"On 03/08/2021 12:20 p.m., Michael Dewey wrote:> Short version> > Apart from the ability to work with values of p too small to be of much> practical use what are the advantages and disadvantages of setting this> to TRUE?> > Longer version> > I am contemplating upgrading various functions in one of my packages to> use this and as far as I can see it would only have the advantage of> allowing people to use very small p-values but before I go ahead have I> missed anything? I am most concerned with negatives but if there is any> other advantage I would mention that in the vignette. I am not concerned> about speed or the extra effort in coding and expanding the documentation.> These are often needed in likelihood problems. In just about any problem where the normal density shows up in the likelihood, you're better off working with the log likelihood and setting log = TRUE in dnorm, because sometimes you want to evaluate the likelihood very far from its mode.The same sort of thing happens with pnorm for similar reasons. Some likelihoods involve normal integrals and will need it.I can't think of an example for qnorm off the top of my head, but I imagine there are some: maybe involving simulation way out in the tails.The main negative about using logs is that they aren't always needed.Duncan Murdoch------------------------------Message: 5Date: Tue, 3 Aug 2021 13:24:08 -0700From: Bill Dunlap <williamwdunlap using gmail.com>To: Duncan Murdoch <murdoch.duncan using gmail.com>Cc: Michael Dewey <lists using dewey.myzen.co.uk>, "r-help using r-project.org" <r-help using r-project.org>Subject: Re: [R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?Message-ID: <CAHqSRuSBQyuyJ5a9YrHk3BHXPn5UmbxQ54bKhAU3G6yroCnG4A using mail.gmail.com>Content-Type: text/plain; charset="utf-8"In maximum likelihood problems, even when the individual density values arefairly far from zero, their product may underflow to zero. Optimizers haveproblems when there is a large flat area. > q <- runif(n=1000, -0.1, +0.1) > prod(dnorm(q)) [1] 0 > sum(dnorm(q, log=TRUE)) [1] -920.6556A more minor advantage for some probability-related functions is speed.E.g., dnorm(log=TRUE,...) does not need to evaluate exp(). > q <- runif(1e6, -10, 10) > system.time(for(i in 1:100)dnorm(q, log=FALSE)) user system elapsed 9.13 0.11 9.23 > system.time(for(i in 1:100)dnorm(q, log=TRUE)) user system elapsed 4.60 0.19 4.78 -BillOn Tue, Aug 3, 2021 at 11:53 AM Duncan Murdoch <murdoch.duncan using gmail.com>wrote:> On 03/08/2021 12:20 p.m., Michael Dewey wrote:> > Short version> >> > Apart from the ability to work with values of p too small to be of much> > practical use what are the advantages and disadvantages of setting this> > to TRUE?> >> > Longer version> >> > I am contemplating upgrading various functions in one of my packages to> > use this and as far as I can see it would only have the advantage of> > allowing people to use very small p-values but before I go ahead have I> > missed anything? I am most concerned with negatives but if there is any> > other advantage I would mention that in the vignette. I am not concerned> > about speed or the extra effort in coding and expanding the> documentation.> >>> These are often needed in likelihood problems. In just about any> problem where the normal density shows up in the likelihood, you're> better off working with the log likelihood and setting log = TRUE in> dnorm, because sometimes you want to evaluate the likelihood very far> from its mode.>> The same sort of thing happens with pnorm for similar reasons. Some> likelihoods involve normal integrals and will need it.>> I can't think of an example for qnorm off the top of my head, but I> imagine there are some: maybe involving simulation way out in the tails.>> The main negative about using logs is that they aren't always needed.>> Duncan Murdoch>> ______________________________________________> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see> https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide> http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.> [[alternative HTML version deleted]]------------------------------Message: 6Date: Tue, 03 Aug 2021 18:56:08 -0400From: Tom Woolman <twoolman using ontargettek.com>To: r-help using r-project.orgSubject: [R] Creating a log-transformed histogram of multiclass dataMessage-ID: <2bc87c25f161bac1d8e5101e20bf2237 using ontargettek.com>Content-Type: text/plain; charset="us-ascii"; Format="flowed"# Resending this message since the original email was held in queue by the listserv software because of a "suspicious" subject line, and/or because of attached .png histogram chart attachments. I'm guessing that the listserv software doesn't like multiple image file attachments.Hi everyone. I'm working on a research model now that is calculating anomaly scores (RMSE values) for three distinct groups within a large dataset. The anomaly scores are a continuous data type and are quite small, ranging from approximately 1e-04 to 1-e07 across a population of approximately 1 million observations.I have all of the summary and descriptive statistics for each of the anomaly score distributions across each group label in the dataset, and I am able to create some useful histograms showing how each of the three groups is uniquely distributed across the range of scores. However, because of the large variance within the frequency of score values and the high density peaks within much of the anomaly scores, I need to use a log transformation within the histogram to show both the log frequency count of each binned observation range (y-axis) and a log transformation of the binned score values (x-axis) to be able to appropriately illustrate the distributions within the data and make it more readily understandable.Fortunately, ggplot2 is really useful for creating some really attractive dual-axis log transformed histograms.However, I cannot figure out a way to create the log transformed histograms to show each of my three groups by color within the same histogram. I would want it to look like this, BUT use a log transformation for each axis. This plot below shows the 3 groups in one histogram but uses the default normal values.For log transformed axis values, the best I can do so far is produce three separate histograms, one for each group.Below is sample R code to illustrate my problem with a randomly-generated example dataset and the ggplot2 approaches that I have taken so far:# Sample R code below:library(ggplot2)library(dplyr)library(hrbrthemes)# I created some simple random sample data to produce an example dataset.# This produces an example dataframe called d, which contains a class label IV of either A, B or C for each observation. The target variable is the anomaly_score continuous value for each observation.# There are 300 rows of dummy data in this dataframe.DV_score_generator = round(runif(300,0.001,0.999), 3)d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)# First, I use ggplot to create the normal distribution histogram that shows all 3 groups on the same plot, by color.# Please note that with this small set of randomized sample data it doesn't appear to be necessary to use an x and y-axis log transformation to show the distribution patterns, but it does becomes an issue with my vastly larger and more complex score values in the DV of the actual data.p <- d %>%ggplot( aes(x=anomaly_score, fill=label)) +geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +theme_ipsum() +labs(fill="")p# Produces a normal multiclass histogram.# Now produce a series of x and y-axis log-transformed histograms, producing one histogram for each distinct label class in the dataset:# Group A, log transformedggplot(group_a, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "darkgoldenrod1", fill = "darkgoldenrod2") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group A Only")# Group A transformed histogram is produced here.# Group B, log transformed ggplot(group_b, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "green", fill = "darkgreen") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group B Only")# Group B transformed histogram is produced here.# Group C, log transformed ggplot(group_c, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "red", fill = "darkred") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group C Only")# Group C transformed histogram is produced here.# End.Thanks in advance, everyone!- TomThomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS, MSOn Target Technologies, Inc.Virginia, USA------------------------------Message: 7Date: Tue, 03 Aug 2021 19:04:29 -0400From: Tom Woolman <twoolman using ontargettek.com>To: r-help using r-project.orgSubject: Re: [R] Creating a log-transformed histogram of multiclass dataMessage-ID: <ba170db0581b2b7f5c79448355685e92 using ontargettek.com>Content-Type: text/plain; charset="us-ascii"; Format="flowed"Apologies, I left out 3 critical lines of code after the randomized sample dataframe is created:group_a <- d[ which(d$label =='A'), ]group_b <- d[ which(d$label =='B'), ]group_c <- d[ which(d$label =='C'), ]On 2021-08-03 18:56, Tom Woolman wrote:> # Resending this message since the original email was held in queue by> the listserv software because of a "suspicious" subject line, and/or> because of attached .png histogram chart attachments. I'm guessing> that the listserv software doesn't like multiple image file> attachments.> > > Hi everyone. I'm working on a research model now that is calculating> anomaly scores (RMSE values) for three distinct groups within a large> dataset. The anomaly scores are a continuous data type and are quite> small, ranging from approximately 1e-04 to 1-e07 across a population> of approximately 1 million observations.> > I have all of the summary and descriptive statistics for each of the> anomaly score distributions across each group label in the dataset,> and I am able to create some useful histograms showing how each of the> three groups is uniquely distributed across the range of scores.> However, because of the large variance within the frequency of score> values and the high density peaks within much of the anomaly scores, I> need to use a log transformation within the histogram to show both the> log frequency count of each binned observation range (y-axis) and a> log transformation of the binned score values (x-axis) to be able to> appropriately illustrate the distributions within the data and make it> more readily understandable.> > Fortunately, ggplot2 is really useful for creating some really> attractive dual-axis log transformed histograms.> > However, I cannot figure out a way to create the log transformed> histograms to show each of my three groups by color within the same> histogram. I would want it to look like this, BUT use a log> transformation for each axis. This plot below shows the 3 groups in> one histogram but uses the default normal values.> > For log transformed axis values, the best I can do so far is produce> three separate histograms, one for each group.> > > > Below is sample R code to illustrate my problem with a> randomly-generated example dataset and the ggplot2 approaches that I> have taken so far:> > # Sample R code below:> > library(ggplot2)> library(dplyr)> library(hrbrthemes)> > # I created some simple random sample data to produce an example > dataset.> # This produces an example dataframe called d, which contains a class> label IV of either A, B or C for each observation. The target variable> is the anomaly_score continuous value for each observation.> # There are 300 rows of dummy data in this dataframe.> > DV_score_generator = round(runif(300,0.001,0.999), 3)> d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE,> prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)> > # First, I use ggplot to create the normal distribution histogram that> shows all 3 groups on the same plot, by color.> # Please note that with this small set of randomized sample data it> doesn't appear to be necessary to use an x and y-axis log> transformation to show the distribution patterns, but it does becomes> an issue with my vastly larger and more complex score values in the DV> of the actual data.> > p <- d %>%> ggplot( aes(x=anomaly_score, fill=label)) +> geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +> scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +> theme_ipsum() +> labs(fill="")> > p> > # Produces a normal multiclass histogram.> > > > # Now produce a series of x and y-axis log-transformed histograms,> producing one histogram for each distinct label class in the dataset:> > > # Group A, log transformed> > ggplot(group_a, aes(x = anomaly_score)) +> geom_histogram(aes(y = ..count..), binwidth = 0.05,> colour = "darkgoldenrod1", fill = "darkgoldenrod2") +> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") > +> scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +> ggtitle("Transformed Anomaly Scores - Group A Only")> > > # Group A transformed histogram is produced here.> > > > # Group B, log transformed> > ggplot(group_b, aes(x = anomaly_score)) +> geom_histogram(aes(y = ..count..), binwidth = 0.05,> colour = "green", fill = "darkgreen") +> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") > +> scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +> ggtitle("Transformed Anomaly Scores - Group B Only")> > # Group B transformed histogram is produced here.> > > > # Group C, log transformed> > ggplot(group_c, aes(x = anomaly_score)) +> geom_histogram(aes(y = ..count..), binwidth = 0.05,> colour = "red", fill = "darkred") +> scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") > +> scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +> ggtitle("Transformed Anomaly Scores - Group C Only")> > # Group C transformed histogram is produced here.> > > # End.> > > > Thanks in advance, everyone!> > > - Tom> > > Thomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS, > MS> On Target Technologies, Inc.> Virginia, USA> > ______________________________________________> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see> https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.------------------------------Subject: Digest Footer_______________________________________________R-help using r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.------------------------------End of R-help Digest, Vol 222, Issue 4**************************************
[[alternative HTML version deleted]]
More information about the R-help
mailing list