[R] logistic regression with 50 varaibales
Charles C. Berry
cberry at tajo.ucsd.edu
Tue Jun 15 00:37:27 CEST 2010
On Mon, 14 Jun 2010, array chip wrote:
> Thanks Charles for the reproducible codes. I started this question
> because I was asked to take a look at such dataset, but I have doubt if
> it's meaningful to do a LR with 50 variables. I haven't got the dataset
> yet, thus have not tried any code. But again for sharing some simulation
> code.
>
> have one question for your code. When you calculate common standard
> errors for coefficients using sd(coef(fit)), should you exlude intercept
> by doing sd(coef(fit)[-1])? Actually after removing intercept, the
> standard error calculate this way is very similar to one reported from
> colMeans(coef(summary(fit))), for both scenarios in your example (coef =
> 0.14 or 1.0)
Of course! I slipped up on that one!
>
> Another question is the 50 simulated variables have very low
> correlations (ranging from -0.1 to 0.08), that may contribute to the
> stable model. If some (not all) of the 50 variable have considerable
> correlation, like 0.7 or 0.8 correlation, How would LR model behave?
>
Why not try it out and see?
The mvtnorm package has a function for generating correlated random
variates.
HTH,
Chuck
> Thanks
>
> John
>
>
>
> ----- Original Message ----
> From: Charles C. Berry <cberry at tajo.ucsd.edu>
> To: Joris Meys <jorismeys at gmail.com>
> Cc: r-help at r-project.org; Marc Schwartz <marc_schwartz at me.com>
> Sent: Mon, June 14, 2010 8:32:02 AM
> Subject: Re: [R] logistic regression with 50 varaibales
>
> On Mon, 14 Jun 2010, Joris Meys wrote:
>
>> Hi,
>>
>> Marcs explanation is valid to a certain extent, but I don't agree with
>> his conclusion. I'd like to point out "the curse of
>> dimensionality"(Hughes effect) which starts to play rather quickly.
>>
>
> Ahem!
>
> ... minimal, self-contained, reproducible code ...
>
> The OPs situation may not be an impossible one:
>
>> set.seed(54321)
>> dat <- as.data.frame(matrix(rnorm(1800*50),nc=50))
>> dat$y <- rbinom(1800,1,plogis(rowSums(dat)/7))
>> fit <- glm(y~., dat, family=binomial)
>> 1/7 # the true coef
> [1] 0.1428571
>> sd(coef(fit)) # roughly, the common standard error
> [1] 0.05605597
>> colMeans(coef(summary(fit))) # what glm() got
> Estimate Std. Error z value Pr(>|z|)
> 0.14590836 0.05380666 2.71067328 0.06354820
>> # a trickier case:
>> set.seed(54321)
>> dat <- as.data.frame(matrix(rnorm(1800*50),nc=50))
>> dat$y <- rbinom(1800,1,plogis(rowSums(dat))) # try again with coef==1
>> fit <- glm(y~., dat, family=binomial)
>> colMeans(coef(summary(fit)))
> Estimate Std. Error z value Pr(>|z|)
> 0.982944012 0.119063631 8.222138491 0.008458002
>>
>
> Finer examination of the latter fit will show some values that differ too
> far from 1.0 to agree with the asymptotic std err.
>
>> sd(coef(fit)) # rather bigger than 0.119
> [1] 0.1827462
>> range(coef(fit))
> [1] -0.08128586 1.25797057
>
> And near separability may be playing here:
>
>> cbind(
> + table(
> + cut(
> + plogis(abs(predict(fit))),
> + c( 0, 0.9, 0.99, 0.999, 0.9999, 0.99999, 1 ) ) ) )
> [,1]
> (0,0.9] 453
> (0.9,0.99] 427
> (0.99,0.999] 313
> (0.999,0.9999] 251
> (0.9999,0.99999] 173
> (0.99999,1] 183
>>
>
> Recall that the observed information contains a factor of
>
> plogis( predict(fit) )* plogis( -predict(fit))
>
>> hist(plogis( predict(fit) )* plogis( -predict(fit)))
>
> So the effective sample size here was much reduced.
>
> But to the OP's question, whether what you get is reasonable depends on
> what the setup is.
>
> I wouldn't call the first of the above cases 'highly unstable'.
>
> Which is not to say that one cannot generate difficult cases (esp. with
> correlated covariates and/or one or more highly influential covariates)
> and that the OPs case is not one of them.
>
>
> HTH,
>
> Chuck
>
>> The curse of dimensionality is easily demonstrated looking at the
>> proximity between your datapoints. Say we scale the interval in one
>> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
>> distance between the observations is 0.05 units. To have a proximity
>> like that in a 2-dimensional space, you need 20^2=400 observations. in
>> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
>> distance between your observations is important, as a sparse dataset
>> will definitely make your model misbehave.
>>
>> Even with about 35 samples per variable, using 50 independent
>> variables will render a highly unstable model, as your dataspace is
>> about as sparse as it can get. On top of that, interpreting a model
>> with 50 variables is close to impossible, and then I didn't even start
>> on interactions. No point in trying I'd say. If you really need all
>> that information, you might want to take a look at some dimension
>> reduction methods first.
>>
>> Cheers
>> Joris
>>
>> On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
>>> On Jun 13, 2010, at 10:20 PM, array chip wrote:
>>>
>>>> Hi, this is not R technical question per se. I know there are many excellent statisticians in this list, so here my questions: I have dataset with ~1800 observations and 50 independent variables, so there are about 35 samples per variable. Is it wise to build a stable multiple logistic model with 50 independent variables? Any problem with this approach? Thanks
>>>>
>>>> John
>>>
>>>
>>> The general rule of thumb is to have 10-20 'events' per covariate degree of freedom. Frank has suggested that in some cases that number should be as high as 25.
>>>
>>> The number of events is the smaller of the two possible outcomes for your binary dependent variable.
>>>
>>> Covariate degrees of freedom refers to the number of columns in the model matrix. Continuous variables are 1, binary factors are 1, K-level factors are K - 1.
>>>
>>> So if out of your 1800 records, you have at least 500 to 1000 events, depending upon how many of your 50 variables are K-level factors and whether or not you need to consider interactions, you may be OK. Better if towards the high end of that range, especially if the model is for prediction versus explanation.
>>>
>>> Two excellent references would be Frank's book:
>>>
>>> http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/
>>>
>>> and Steyerberg's book:
>>>
>>> http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/
>>>
>>> to assist in providing guidance for model building/validation techniques.
>>>
>>> HTH,
>>>
>>> Marc Schwartz
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list