[R] Problem with a regression - Dataset Workinghours

peter dalgaard pdalgd at gmail.com
Sun Jul 29 10:18:26 CEST 2012


On Jul 28, 2012, at 17:37 , Giorgio Monti wrote:

> I'm a student. I'm working on a research using the statistical program "R
> 2.15.1".
> Here's my problem: how i can do a regression considering only values over a
> certain limit?
> For example, considering the dataset "Workinghour" of the "Ecdat" package,
> is possible to build a predictive model that express the probability that a
> wife works more than 8 hours per day?
> The dataset includes 3382 observation on the number of hours spent working
> by wifes per year in USA.
> 
> hoursday=hours/240
> index<-which(hoursday>=8)
> hoursday[index]
> 
> As you see, I'm able to extract the values that in 'hoursday' (which is
> hours/240 working days in one year) are > 8,0 but obviously i can't do a
> regression cause the extracted data are a subset of the entire dataset (955
> observations), while the other variables, like age, occupation, income,
> etc. are still complete(3382).
> 
> So i can't do:
> lm = lm(hoursday[index] ~
> income+age+education+unemp+child5+child13+child17+nonwhite+owned+mortgage+occupation)
> In fact "R" gives me: Error in model.frame.default(formula =
> hoursday[index] ~ income, drop.unused.levels = TRUE) : variable lengths
> differ (found for 'income').
> 
> Can you help me?
> 

Yes: don't do that. You are not going to "build a predictive model that express the probability that a wife works more than 8 hours per day" from data where everyone works more than 8 hours by day!

You can either fit the model to all data and work out the probabilistic consequences, or if you don't quite believe the normality assumption of linear models, perhaps reduce the outcome to 0/1 and turn to logit or probit regression.

It is not technically hard to fit data to a subset, but it is a big no-no to subset on the dependent variable. Well, you can, and people do, actually do subsampling on the response variable, but the standard methods of analysis do not apply.


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list