[Rd] Discourage the weights= option of lm with summarized data

Tue Nov 28 09:29:29 CET 2017

Since the three posters agree (only) that there is a bug, I propose to
file it as a bug, which is the least we can do now.

There is more to it: the only other case of a change in the Reference
Manual which I know of, is also about the weights option! This is in
coxph. The Reference Manual version 3.0.0 (2013) says about coxph:

   " ... If weights is a vector of integers, then the estimated
coefficients are equivalent to estimating the model from data with the
individual cases replicated as many times as indicated by weights."

This is not true, as can be seen from the following code, which uses
the data from the first example in the Reference Manual of coxph:

   library(survival)
   print(df1 <- as.data.frame(list(
     time=c(4,3,1,1,2,2,3),
   status=c(1,1,1,0,1,1,0),
        x=c(0,2,1,1,1,0,0),
      sex=c(0,0,0,0,1,1,1)
   )))
   print(w <- rep(2,7))
   print(coxph(Surv(time,status) ~ x + strata(sex),data=df1,weights=w))
      # manually doubling the data:
   print(df2 <- rbind(df1,df1))
   print(coxph(Surv(time,status) ~ x + strata(sex), data=df2))

This should not come as a surprise, since with coxph the computation
of the likelihood (given the parameters) for a single observation uses
also the other observations.

This bug has been repaired. The present Reference Manual of coxph says
that the weights option specifies a vector of case weights, to which
is added only: "For a thorough discussion of these see the book by
Therneau and Grambsch."

Let us repair the other bug also.

    Arie

On Thu, Oct 12, 2017 at 1:48 PM, Arie ten Cate <arietencate at gmail.com> wrote:
>  OK. We have now three suggestions to repair the text:
>  - remove the text
>  - add "not" at the beginning of the text
>  - add at the end of the text a warning; something like:
>
>   "Note that in this case the standard estimates of the parameters are
> in general not correct, and hence also the t values and the p value.
> Also the number of degrees of freedom is not correct. (The parameter
> values are correct.)"
>
> A remark about the glm example: the Reference manual says: "For a
> binomial GLM prior weights are used to give the number of trials when
> the response is the proportion of successes ....".  Hence in the
> binomial case the weights are frequencies.
> With y <- 0.51 and w <- 100 you get the same result.
>
>    Arie
>
> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>> AFAIR, it is a little more subtle than that.
>>
>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this.
>>
>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like
>>
>> y <- c(0,1)
>> w <- c(49,51)
>> glm(y~1, weights=w, family=binomial)
>>
>> -pd
>>
>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate at gmail.com> wrote:
>>>
>>> Yes.  Thank you; I should have quoted it.
>>> I suggest to remove this text or to add the word "not" at the beginning.
>>>
>>>   Arie
>>>
>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>>> <wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
>>>> Ah, I think you are referring to this part from ?lm:
>>>>
>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)"
>>>>
>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make.
>>>>
>>>> Best,
>>>> Wolfgang
>>>>
>>>> -----Original Message-----
>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>> Sent: Sunday, 08 October, 2017 14:55
>>>> To: r-devel at r-project.org
>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>
>>>> Indeed: Using 'weights' is not meant to indicate that the same
>>>> observation is repeated 'n' times.  As I showed, this gives erroneous
>>>> results. Hence I suggested that it is discouraged rather than
>>>> encouraged in the Details section of lm in the Reference manual.
>>>>
>>>>   Arie
>>>>
>>>> ---Original Message-----
>>>> On Sat, 7 Oct 2017, wolfgang.viechtbauer at maastrichtuniversity.nl wrote:
>>>>
>>>> Using 'weights' is not meant to indicate that the same observation is
>>>> repeated 'n' times. It is meant to indicate different variances (or to
>>>> be precise, that the variance of the last observation in 'x' is
>>>> sigma^2 / n, while the first three observations have variance
>>>> sigma^2).
>>>>
>>>> Best,
>>>> Wolfgang
>>>>
>>>> -----Original Message-----
>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>> Sent: Saturday, 07 October, 2017 9:36
>>>> To: r-devel at r-project.org
>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>
>>>> In the Details section of lm (linear models) in the Reference manual,
>>>> it is suggested to use the weights= option for summarized data. This
>>>> must be discouraged rather than encouraged. The motivation for this is
>>>> as follows.
>>>>
>>>> With summarized data the standard errors get smaller with increasing
>>>> numbers of observations. However, the standard errors in lm do not get
>>>> smaller when for instance all weights are multiplied with the same
>>>> constant larger than one, since the inverse weights are merely
>>>> proportional to the error variances.
>>>>
>>>> Here is an example of the estimated standard errors being too large
>>>> with the weights= option. The p value and the number of degrees of
>>>> freedom are also wrong. The parameter estimates are correct.
>>>>
>>>>  n <- 10
>>>>  x <- c(1,2,3,4)
>>>>  y <- c(1,2,5,4)
>>>>  w <- c(1,1,1,n)
>>>>  xb <- c(x,rep(x[4],n-1))  # restore the original data
>>>>  yb <- c(y,rep(y[4],n-1))
>>>>  print(summary(lm(yb ~ xb)))
>>>>  print(summary(lm(y ~ x, weights=w)))
>>>>
>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>>>> FREQ statement (for summarized data).
>>>>
>>>>    Arie
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>
>>