[R] How many samples ACTUALLY used in regression?

Mon Mar 18 16:18:16 CET 2013

On 18 Mar 2013, at 15:07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

> On 18/03/2013 14:51, Cade, Brian wrote:
>> Perhaps a crude but reliable way is to check the number of residuals, e.g.,
>> length(my.model$resid).
> 
> Not very reliable (what about zero weights, for example?), and the component is usually 'residuals'.
> 
> No one has so far mentioned nobs(), which seems to me to be the closest.

Given a my.data where 3 out of 100 rows will be discarded due to NAs

test = lm(formula = y ~ x + w, my.data, model = T)
nobs(test) 
[1] 97 # as expected

But if I substitute 1 NA in one of the row of the housing data:

house.plr = polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq)
nobs(house.plr)
[1] 1661

because of weights (which would not be take into account in a glm() fit).

Because I only care about the raw number of observations, is there a (hopefully) trivial way of getting nobs(poor.fit) to behave like a nobs(vlm.fit)?

BW

Federico

> 
>> Brian
>> 
>> Brian S. Cade, PhD
>> 
>> U. S. Geological Survey
>> Fort Collins Science Center
>> 2150 Centre Ave., Bldg. C
>> Fort Collins, CO  80526-8818
>> 
>> email:  cadeb at usgs.gov <brian_cade at usgs.gov>
>> tel:  970 226-9326
>> 
>> 
>> 
>> On Mon, Mar 18, 2013 at 8:39 AM, Marc Schwartz <marc_schwartz at me.com> wrote:
>> 
>>> 
>>> On Mar 18, 2013, at 7:36 AM, Federico Calboli <f.calboli at imperial.ac.uk>
>>> wrote:
>>> 
>>>> Dear All,
>>>> 
>>>> is there a simple way that covers all regression models to extract the
>>> number of samples from a data frame/matrix actually used in a regression
>>> model?
>>>> 
>>>> For instance I might have a data of 100 rows and 4 colums (1 response +
>>> 3 explanatory variables).  If 3 samples have one or more NAs in the
>>> explanatory variable columns these samples will be dropped in any model:
>>>> 
>>>> my.model = lm(y ~ x + w + z, my.data)
>>>> my.model = glm(y ~ x + w + z, my.data, family = binomial)
>>>> my.model = polr(y ~ x + w + z, my.data)
>>>> …
>>>> 
>>>> I don't seem to be able to find one single method that works in the
>>> exact same way -- irrespective of the model type -- to interrogate my.model
>>> to see how many samples of my.data were actually used.  Is there such
>>> function or do I need to hack something together?
>>>> 
>>>> Best wishes
>>>> 
>>>> Federico
>>> 
>>> 
>>> I don't know that this would be universal to all possible R model
>>> implementations, but should work for those that at least abide by "certain
>>> standards"[1] relative to the internal use of ?model.frame.
>>> 
>>> In the case where model functions use 'model = TRUE' as the default in
>>> their call (eg. lm(),  glm() and MASS::polr()), the returned model object
>>> will have a component called 'model', such that:
>>> 
>>>   nrow(my.model$model)
>>> 
>>> returns the number of rows in the internally created data frame.
>>> 
>>> Note that 'model = TRUE' is not the default for many functions, for
>>> example Terry's coxph() in survival or Frank's lrm() in rms.
>>> 
>>> Note also that the value of 'na.action' in the modeling function call may
>>> have a potential effect on whether the number of rows in the retained
>>> 'model' data frame is really the correct value.
>>> 
>>> You can also use model.frame(), independently matching arguments passed to
>>> the model function, to replicate what takes place internally in many
>>> modeling functions. The result of model.frame() will be a data frame,
>>> again, subject to similar limitations as above.
>>> 
>>> Regards,
>>> 
>>> Marc Schwartz
>>> 
>>> [1]: http://developer.r-project.org/model-fitting-functions.txt
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595