[R] Model Comparision for case control studies in R

Thu Jun 16 06:41:49 CEST 2022

Hi Hana,

ROC (or AUC) is misleading and should not be used to assess model
performance. For details, please see the references in "Spatial Predictive
Modelign with R '' that also provides some methods (e.g., gbm, rf, svm and
glmlet) for 1/0 data along with accuracy-based variable selection and
parameter optimisation.

Hope this helps,
Jin

On Thu, Jun 16, 2022 at 6:53 AM Hana Tezera <hanatezera using gmail.com> wrote:

> Dear Tim, Thanks a lot I am looking for different methods for each
> method, I want to select the best predictors and I want to report some
> measures of the accuracy. And I will compare the performance of the
> models, by plotting their ROC curves.
> Best,
> Hana
>
> On 6/15/22, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
> > The uncorrelated nature of smoking and hypertension is a major medical
> > breakthrough and in contrast to reports like this:
> > https://pubmed.ncbi.nlm.nih.gov/20550499/ and the literature indicates
> the
> > possibility of a relationship between age and hypertension
> > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768730/. Depending on the
> > country, there might be a relationship between smoking and age as
> government
> > programs against smoking are developed.
> >
> > Are you looking at different models or different methods. I could have y
> = x
> > + y + z as one model and y=x + z as another model. Alternatively I could
> be
> > comparing ordinary least squares regression versus maximum likelihood
> versus
> > Bayesian linear regression versus nonlinear regression. The former might
> use
> > something like the Akaike information criterion. I am not sure the
> latter is
> > useful (or possible). For example I could approximate an exponential
> > function using a polynomial, but in this context I see no benefit in
> doing
> > so even if I could compare the models.
> >
> > I do not quite understand why this is being done. It feels like fishing
> > statistical methods to get the answer that I know is correct. Generally,
> one
> > should understand the system well enough to select an appropriate model
> > rather than try every possible model in the hope something fits. Of
> course
> > one sometimes collects extra data in the hope that we do not miss an
> > important feature. Then forwards/backwards/stepwise methods are used to
> > identify the "best" model but this is looking at similar models that
> differ
> > only in the list of independent variables.
> >
> > However the problem is solved, I would start by trying to determine if
> any
> > one model was appropriate. Are the model assumptions satisfied? If the
> > answer is no, then try another model until you find one that does satisfy
> > the model assumptions. Alternatively, start with an understanding of the
> > biology and use the best model. Comparing an biologically meaningless
> > statistical model to a biologically meaningful one is an easy choice.
> >
> > Tim
> >
> > -----Original Message-----
> > From: anteneh asmare <hanatezera using gmail.com>
> > Sent: Wednesday, June 15, 2022 1:10 PM
> > To: Ebert,Timothy Aaron <tebert using ufl.edu>
> > Cc: r-help using r-project.org
> > Subject: Re: [R] Model Comparision for case control studies in R
> >
> > [External Email]
> >
> > Dear Tim, Thanks. the first vector
> > y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) is the disease status y=
> > (1=Case,0=Control). The covariate age, smoking status and hypertension
> are
> > independent(uncorrelated). The logistic regression (unconditional) will
> > used. But I need to compare other models with logistic regression
> instead of
> > fitting it directly to logistic regression.
> > There is no matching on the data to use conditional logistics regression.
> > Best,
> > Hana
> > On 6/15/22, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
> >> Disease status is missing from the sample data.
> >> Are age, disease, smoking, and/or hypertension correlated in any way
> >> or are they independent (correlation=0)?
> >> Are the correlations large enough to adversely influence your model?
> >> Tim
> >>
> >> -----Original Message-----
> >> From: R-help <r-help-bounces using r-project.org> On Behalf Of anteneh
> >> asmare
> >> Sent: Wednesday, June 15, 2022 7:29 AM
> >> To: r-help using r-project.org
> >> Subject: [R] Model Comparision for case control studies in R
> >>
> >> [External Email]
> >>
> >> y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1)
> >> age<-c(45,23,56,67,23,23,28,56,45,47,36,37,33,35,38,39,43,28,39,41)
> >> smoking<-c(0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,1,1,1,0,1)
> >> hypertension<-c(1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,1,0,0,1,0)
> >> data<-data.frame(y,age,smoking,hypertension)
> >> data
> >> model<-glm(y~age+factor(smoking)+factor(hypertension), data, family =
> >> binomial(link = "logit"),na.action = na.omit)
> >> summary(model)
> >> from above sample data I want to study a case-control study on male
> >> individuals with my response variable y, disease status (1=Case,
> >> 0=Control) with covariates age, smoking status(1=Yes, 0=No)  and
> >> hypertension, hypertensive (1=Yes, 0=No). I want to fit the model to
> >> predict the disease status using at least two different methods. And
> >> to make model comparisons. I think logistic regression will be the
> >> best fit for this case control study. Do we have other options in
> addition
> >> to logistic regression?
> >> My objective is to fit the model to predict the disease status using
> >> at least two different methods.
> >> Kind regards,
> >> Hana
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> >> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> >> Rzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWww
> >> ig4oYOB&s=ztyDthknydhlcM49F33Gz6xRl6G7U9s8aIhB1VN-EKY&e=
> >> PLEASE do read the posting guide
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> >> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> >> sRzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWw
> >> wig4oYOB&s=tcsGkhvtVvoVvb1Ehah-vLRC6an40rJXQXqqfX2f0gI&e=
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jin
------------------------------------------
Jin Li, PhD
Founder, Data2action, Australia
https://www.researchgate.net/profile/Jin_Li32
https://scholar.google.com/citations?user=Jeot53EAAAAJ&hl=en

	[[alternative HTML version deleted]]