[R] Problem with data distribution

Ebert,Timothy Aaron tebert @end|ng |rom u||@edu
Thu Feb 17 23:43:45 CET 2022


Maybe what you want is to recode your data differently.
One data set has bug versus no bug. What is the probability of having one or more bugs?
The other data set has bugs only. Given that I have bugs how many will I get?

Tim

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Neha gupta
Sent: Thursday, February 17, 2022 4:54 PM
To: Bert Gunter <bgunter.4567 using gmail.com>
Cc: r-help mailing list <r-help using r-project.org>
Subject: Re: [R] Problem with data distribution

[External Email]

:) :)

On Thu, Feb 17, 2022 at 10:37 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> imo, with such simple data, a plot is mere chartjunk. A simple table(= 
> the distribution) would suffice and be more informative:
>
> > table(bug) ## bug is a vector. No data frame is needed
>
>   0   1     2    3   4   5   7   ## bug count
> 162  40   9   7   2   1   1   ## nmbr of cases with the given count
>
> You or others may disagree, of course.
>
> Bert Gunter
>
>
>
> On Thu, Feb 17, 2022 at 11:56 AM Neha gupta <neha.bologna90 using gmail.com>
> wrote:
> >
> > Ebert and Rui, thank you for providing the tips (in fact, for 
> > providing
> the
> > answer I needed).
> >
> > Yes, you are right that boxplot of all zero values will not make sense.
> > Maybe histogram will work.
> >
> > I am providing a few details of my data here and the context of the 
> > question I asked.
> >
> > My data is about bugs/defects in different classes of a large 
> > software system. I have to predict which class will contain bugs and 
> > which will be free of bugs (bug=0). I trained ML models and predict 
> > but my advisor
> asked
> > me to provide first the data distribution about bugs e.g details of 
> > how many classes with bugs (bug > 0) and how many are free of bugs (bug=0).
> >
> > That is why I need to provide the data distribution of both types of
> values
> > (i.e. bug=0 and bug >0)
> >
> > Thank you again.
> >
> > On Thu, Feb 17, 2022 at 8:28 PM Rui Barradas <ruipbarradas using sapo.pt>
> wrote:
> >
> > > Hello,
> > >
> > > In your original post you read the same file "synapse.arff" twice, 
> > > apparently to filter each of them by its own criterion. You don't 
> > > need to do that, read once and filter that one by different criteria.
> > >
> > > As for the data as posted, I have read it in with the following code:
> > >
> > >
> > > x <- "
> > > 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 0 0 0 
> > > 0 0 0
> > > 4 1 0
> > > 0 1 0 0 0 0 0 0 1 0 3 2 0 0 0 0 3 0 0 0 0 2 0 0 0 1 0 0 0 0 1 1 1 
> > > 0 0 0
> > > 0 0 0
> > > 1 1 2 1 0 1 0 0 0 2 2 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 5 0 0 0 0 
> > > 0 0 7
> > > 0 0 1
> > > 0 1 1 0 2 0 3 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 3 2 1 1 0 0 0 0 
> > > 0 0 0
> > > 1 0 0
> > > 0 0 0 0 0 0 0 0 0 1 0 1 0 0 3 0 0 1 0 1 3 0 0 0 0 0 0 0 0 1 0 4 1 
> > > 1 0 0
> > > 0 0 1
> > > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 0 "
> > > bug <- scan(text = x)
> > > data <- data.frame(bug)
> > >
> > >
> > > This is not the right way to post data, the posting guide asks to 
> > > post the output of
> > >
> > >
> > > dput(data)
> > > structure(list(bug = c(0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 
> > > 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
> > > 4, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 0, 0, 0, 0, 3, 0, 0, 
> > > 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 
> > > 2, 1, 0, 1, 0, 0, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 
> > > 0, 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 7, 0, 0, 1, 0, 1, 1, 0, 2, 0, 3, 
> > > 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 3, 2, 1, 1, 
> > > 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 
> > > 0, 0, 3, 0, 0, 1, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 4, 1, 1, 
> > > 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> > > 0, 0, 3, 0, 1, 0, 0, 0, 0, 0)), class = "data.frame", row.names = 
> > > c(NA, -222L))
> > >
> > >
> > >
> > > This can be copied into an R session and the data set recreated 
> > > with
> > >
> > > data <- structure(etc)
> > >
> > >
> > > Now the boxplots.
> > >
> > > (Why would you want to plot a vector of all zeros, btw?)
> > >
> > >
> > >
> > > library(dplyr)
> > >
> > > boxplot(filter(data, bug == 0))    # nonsense
> > > boxplot(filter(data, bug > 0), range = 0)
> > >
> > > # Another way
> > > data %>%
> > >    filter(bug > 0) %>%
> > >    boxplot(range = 0)
> > >
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > >
> > > Às 19:03 de 17/02/2022, Neha gupta escreveu:
> > > > That is all the code I have. How can I provide a  reproducible code ?
> > > >
> > > > How can I save this result?
> > > >
> > > > On Thu, Feb 17, 2022 at 8:00 PM Ebert,Timothy Aaron 
> > > > <tebert using ufl.edu>
> > > wrote:
> > > >
> > > >> You pipe the filter but do not save the result. A reproducible
> example
> > > >> might help.
> > > >> Tim
> > > >>
> > > >> -----Original Message-----
> > > >> From: R-help <r-help-bounces using r-project.org> On Behalf Of Neha 
> > > >> gupta
> > > >> Sent: Thursday, February 17, 2022 1:55 PM
> > > >> To: r-help mailing list <r-help using r-project.org>
> > > >> Subject: [R] Problem with data distribution
> > > >>
> > > >> [External Email]
> > > >>
> > > >> Hello everyone
> > > >>
> > > >> I have a dataset with output variable "bug" having the 
> > > >> following
> values
> > > >> (at the bottom of this email). My advisor asked me to provide 
> > > >> data distribution of bugs with 0 values and bugs with more than 0 values.
> > > >>
> > > >> data = readARFF("synapse.arff")
> > > >> data2 = readARFF("synapse.arff") data$bug
> > > >> library(tidyverse)
> > > >> data %>%
> > > >>    filter(bug == 0)
> > > >> data2 %>%
> > > >>    filter(bug >= 1)
> > > >> boxplot(data2$bug, data$bug, range=0)
> > > >>
> > > >> But both the graphs are exactly the same, how is it possible? 
> > > >> Where
> I am
> > > >> doing wrong?
> > > >>
> > > >>
> > > >> data$bug
> > > >>    [1] 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 
> > > >> 0 1
> 0 0
> > > 0 0 0
> > > >> 0 4 1 0
> > > >>   [40] 0 1 0 0 0 0 0 0 1 0 3 2 0 0 0 0 3 0 0 0 0 2 0 0 0 1 0 0 
> > > >> 0 0
> 1 1
> > > 1 0 0
> > > >> 0 0 0 0
> > > >>   [79] 1 1 2 1 0 1 0 0 0 2 2 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 
> > > >> 5 0
> 0 0
> > > 0 0 0
> > > >> 7 0 0 1
> > > >> [118] 0 1 1 0 2 0 3 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 3 2 1 1 
> > > >> 0 0
> 0 0
> > > 0 0
> > > >> 0 1 0 0
> > > >> [157] 0 0 0 0 0 0 0 0 0 1 0 1 0 0 3 0 0 1 0 1 3 0 0 0 0 0 0 0 0 
> > > >> 1 0
> 4 1
> > > 1 0
> > > >> 0 0 0 1
> > > >> [196] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 0
> > > >>
> > > >>          [[alternative HTML version deleted]]
> > > >>
> > > >> ______________________________________________
> > > >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, 
> > > >> see
> > > >>
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=TZx8pDTF9x1Tu4QZW3x_99uu9RowVjAna39KcjCXSElI1AOk1C_6L2pR8
> YIVfiod&s=NxfkBJHBnd8naYPQTd9Z8dZ2m-RCwh_lpGvHVQ8MwYQ&e=
> > > >> PLEASE do read the posting guide
> > > >>
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=TZx8pDTF9x1Tu4QZW3x_99uu9RowVjAna39KcjCXSElI1AOk1C_6L2pR
> 8YIVfiod&s=exznSElUW1tc6ajt0C8uw5cR8ZqwHRD6tUPAarFYdYo&e=
> > > >> and provide commented, minimal, self-contained, reproducible code.
> > > >>
> > > >
> > > >       [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, 
> > > > see 
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.c
> > > > h_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=
> > > > 9PEhQh2kVeAsRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgM
> > > > CpncRRH1bjKjIUqHjMj8vHCeH&s=53w0MvIpfAklRelSPE5abL_5YG-wyIrrXiFa
> > > > oqbAfLo&e= PLEASE do read the posting guide
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dprojec
> > > t.org_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PE
> > > hQh2kVeAsRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgMCpncR
> > > RH1bjKjIUqHjMj8vHCeH&s=MBVLtPJJyplOC4i8e5ZupFYAXaiICGuK6qsIzxnCEP4
> > > &e=
> > > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_ma
> > ilman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2k
> > VeAsRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgMCpncRRH1bjKj
> > IUqHjMj8vHCeH&s=53w0MvIpfAklRelSPE5abL_5YG-wyIrrXiFaoqbAfLo&e=
> > PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgMCpncRRH1bjKjIUqHj
> Mj8vHCeH&s=MBVLtPJJyplOC4i8e5ZupFYAXaiICGuK6qsIzxnCEP4&e=
> > and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgMCpncRRH1bjKjIUqHjMj8vHCeH&s=53w0MvIpfAklRelSPE5abL_5YG-wyIrrXiFaoqbAfLo&e=
PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=3hWViXJSTXDpoNVYXho6Boeq6QUtotK37L0ChgMCpncRRH1bjKjIUqHjMj8vHCeH&s=MBVLtPJJyplOC4i8e5ZupFYAXaiICGuK6qsIzxnCEP4&e=
and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list