[Rd] Discourage the weights= option of lm with summarized data

Sun Dec 3 18:19:57 CET 2017

> On 3 Dec 2017, at 16:31 , Arie ten Cate <arietencate at gmail.com> wrote:
> 
> Peter,
> 
> This is a highly structured text. Just for the discussion, I separate
> the building blocks, where (D) and (E) and (F) are new:
> 
> BEGIN OF TEXT --------------------
> 
> (A)
> 
> Non-‘NULL’ ‘weights’ can be used to indicate that different
> observations have different variances (with the values in ‘weights’
> being inversely proportional to the variances);
> 
> (B)
> 
> or equivalently, when the elements of ‘weights’ are positive integers
> w_i, that each response y_i is the mean of w_i unit-weight
> observations
> 
> (C)
> 
> (including the case that there are w_i observations equal to y_i and
> the data have been summarized).
> 
> (D)
> 
> However, in the latter case, notice that within-group variation is not
> used. Therefore, the sigma estimate and residual degrees of freedom
> may be suboptimal;
> 
> (E)
> 
> in the case of replication weights, even wrong.
> 
> (F)
> 
> Hence, standard errors and analysis of variance tables should be
> treated with care.
> 
> END OF TEXT --------------------
> 
> I don't understand (D), partly because it is unclear to me whether (D)
> refers to (C) or to (B)+(C):

B, including C, is "the latter case". 

>    If (D) refers only to (C), as the reader might automatically think
> with the repetition of the word "case", then it is unclear to me to
> what block does (E) refer.

Not so. If it did, it should go inside the parentheses.

>    If, on the other hand, (D) refers to (B)+(C) then (E) probably
> refers to (C) and then I suggest to make this more clear by replacing
> "in the case of replication weights" in (E) by "in the case of
> summarized data".
> 

That would be wrong. Data can be summarized by means of groups (and SDs, which are unused, hence the suboptimality), _including_ the case where all elements are identical. 

> I suggest to change "even wrong" in (E) into the more down-to-earth "wrong".

That would seem to be a matter of taste. 

Howver, "equivalently" in (B) does not look right.

> 
> (For the record: I prefer something like my original explanation of
> the problem with (C), instead of (D)+(E)+(F):
>    "With summarized data the standard errors get smaller with
> increasing numbers of observations w_i. However, when for instance all
> w_i are multiplied with the same constant larger than one, the
> reported standard errors do not get smaller since the w_i are defined
> apart from an arbitrary positive multiplicative constant. Hence the
> reported standard errors tend to be too large and the reported t
> values and the reported number of significance stars too small.
> Obviously, also the reported number of observations and the reported
> number of degrees of freedom are too small."
>    Note that with heteroskedasticity, _the_ residual standard error
> has no meaning.)
> 
> Finally, about the original text: (B) and (C) mention only y_i, not
> x_i, while this is about entire observations. Maybe this can remedied
> also?
> 
>  Arie
> 
> On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>> My local R-devel version now has (in ?lm)
>> 
>>     Non-‘NULL’ ‘weights’ can be used to indicate that different
>>     observations have different variances (with the values in
>>     ‘weights’ being inversely proportional to the variances); or
>>     equivalently, when the elements of ‘weights’ are positive integers
>>     w_i, that each response y_i is the mean of w_i unit-weight
>>     observations (including the case that there are w_i observations
>>     equal to y_i and the data have been summarized). However, in the
>>     latter case, notice that within-group variation is not used.
>>     Therefore, the sigma estimate and residual degrees of freedom may
>>     be suboptimal; in the case of replication weights, even wrong.
>>     Hence, standard errors and analysis of variance tables should be
>>     treated with care.
>> 
>> OK?
>> 
>> 
>> -pd
>> 
>> 
>>> On 12 Oct 2017, at 13:48 , Arie ten Cate <arietencate at gmail.com> wrote:
>>> 
>>> OK. We have now three suggestions to repair the text:
>>> - remove the text
>>> - add "not" at the beginning of the text
>>> - add at the end of the text a warning; something like:
>>> 
>>> "Note that in this case the standard estimates of the parameters are
>>> in general not correct, and hence also the t values and the p value.
>>> Also the number of degrees of freedom is not correct. (The parameter
>>> values are correct.)"
>>> 
>>> A remark about the glm example: the Reference manual says: "For a
>>> binomial GLM prior weights are used to give the number of trials when
>>> the response is the proportion of successes ....".  Hence in the
>>> binomial case the weights are frequencies.
>>> With y <- 0.51 and w <- 100 you get the same result.
>>> 
>>>  Arie
>>> 
>>> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>>>> AFAIR, it is a little more subtle than that.
>>>> 
>>>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this.
>>>> 
>>>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like
>>>> 
>>>> y <- c(0,1)
>>>> w <- c(49,51)
>>>> glm(y~1, weights=w, family=binomial)
>>>> 
>>>> -pd
>>>> 
>>>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate at gmail.com> wrote:
>>>>> 
>>>>> Yes.  Thank you; I should have quoted it.
>>>>> I suggest to remove this text or to add the word "not" at the beginning.
>>>>> 
>>>>> Arie
>>>>> 
>>>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>>>>> <wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
>>>>>> Ah, I think you are referring to this part from ?lm:
>>>>>> 
>>>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)"
>>>>>> 
>>>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make.
>>>>>> 
>>>>>> Best,
>>>>>> Wolfgang
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>>>> Sent: Sunday, 08 October, 2017 14:55
>>>>>> To: r-devel at r-project.org
>>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>>> 
>>>>>> Indeed: Using 'weights' is not meant to indicate that the same
>>>>>> observation is repeated 'n' times.  As I showed, this gives erroneous
>>>>>> results. Hence I suggested that it is discouraged rather than
>>>>>> encouraged in the Details section of lm in the Reference manual.
>>>>>> 
>>>>>> Arie
>>>>>> 
>>>>>> ---Original Message-----
>>>>>> On Sat, 7 Oct 2017, wolfgang.viechtbauer at maastrichtuniversity.nl wrote:
>>>>>> 
>>>>>> Using 'weights' is not meant to indicate that the same observation is
>>>>>> repeated 'n' times. It is meant to indicate different variances (or to
>>>>>> be precise, that the variance of the last observation in 'x' is
>>>>>> sigma^2 / n, while the first three observations have variance
>>>>>> sigma^2).
>>>>>> 
>>>>>> Best,
>>>>>> Wolfgang
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>>>> Sent: Saturday, 07 October, 2017 9:36
>>>>>> To: r-devel at r-project.org
>>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>>> 
>>>>>> In the Details section of lm (linear models) in the Reference manual,
>>>>>> it is suggested to use the weights= option for summarized data. This
>>>>>> must be discouraged rather than encouraged. The motivation for this is
>>>>>> as follows.
>>>>>> 
>>>>>> With summarized data the standard errors get smaller with increasing
>>>>>> numbers of observations. However, the standard errors in lm do not get
>>>>>> smaller when for instance all weights are multiplied with the same
>>>>>> constant larger than one, since the inverse weights are merely
>>>>>> proportional to the error variances.
>>>>>> 
>>>>>> Here is an example of the estimated standard errors being too large
>>>>>> with the weights= option. The p value and the number of degrees of
>>>>>> freedom are also wrong. The parameter estimates are correct.
>>>>>> 
>>>>>> n <- 10
>>>>>> x <- c(1,2,3,4)
>>>>>> y <- c(1,2,5,4)
>>>>>> w <- c(1,1,1,n)
>>>>>> xb <- c(x,rep(x[4],n-1))  # restore the original data
>>>>>> yb <- c(y,rep(y[4],n-1))
>>>>>> print(summary(lm(yb ~ xb)))
>>>>>> print(summary(lm(y ~ x, weights=w)))
>>>>>> 
>>>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>>>>>> FREQ statement (for summarized data).
>>>>>> 
>>>>>>  Arie
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>>> --
>>>> Peter Dalgaard, Professor,
>>>> Center for Statistics, Copenhagen Business School
>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>> Phone: (+45)38153501
>>>> Office: A 4.23
>>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>>> 
>> 
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com