[R] Unexpected scores from weighted PCA with svyprcomp()
Leonardo Ferreira Fontenelle
leonardof at leonardof.med.br
Wed May 4 02:43:36 CEST 2016
Thanks for remembering me to cc him!
Thomas Lumley is the package maintainer, and he frequently answers
questions in this list, but it is obviously hard for anyone to keep up
with so many emails.
Att,
Leonardo Ferreira Fontenelle
http://lattes.cnpq.br/9234772336296638
Em Ter 3 mai. 2016, às 21:16, Jeff Newmiller escreveu:
> Your question is a mixture of statistical and implementation (package) issues. This isn't really the right forum for "what is the statistically-correct answer" questions, and as to whether the package is correct or you are using it right would require someone familiar with that particular CONTRIBUTED package to be reading this list. (While this is probably one of the more widely used contributed packages, there are over 8000 contributed packages so far and I for one don't use it...)
>
> You could ask on stats.stackexchange.com where theory is more on topic, or you could try to get the maintainer to chime in (use the maintainer() function to find out who to cc), or you could just be patient.
> --
> Sent from my phone. Please excuse my brevity.
>
> On May 3, 2016 3:34:15 PM PDT, Leonardo Ferreira Fontenelle <leonardof em leonardof.med.br> wrote:
>> Is there something I could do to improve my chances of getting an
>> answer?
>>
>> Leonardo Ferreira Fontenelle
>> http://lattes.cnpq.br/9234772336296638
>>
>> Em Sex 29 abr. 2016, às 23:40, Leonardo Ferreira Fontenelle escreveu:
>>> Hello!
>>>
>>> I'd like to create an assets-based economic indicator using data from a
>>> national household survey. The economic indicator is to be the first
>>> principal component from a principal components analysis, which (given
>>> the source of the data) I believe should take in consideration the
>>> sampling weights of the observations. After running the PCA with
>>> svyprcomp(), from the survey package, I wanted to list the loading (with
>>> regard to the first principal component) and the scale of the variables,
>>> so
that I can tell people how to "reconstitute" the economic indicator
>>> from the variables without any knowledge of PCA. This reconstituted
>>> indicator wouldn't be centered, but that's OK because the important
>>> thing for the application is the relative position of the observations.
>>> The unexpected (at least for me) behavior was that the principal
>>> component returned by svyprcomp() was very different from from the
>>> reconstituted indicator as well as from the indicator returned by
>>> predict(). "Different" here means weak correlation and different
>>> distributions.
>>>
>>> I hope the following code illustrates what I mean:
>>>
>>> =====
>>>
>>> svycor <- function(formula, design) {
>>> # https://stat.ethz.ch/pipermail/r-help/2003-July/036645.html
>>> stopifnot(require(survey))
>>> covariance.matrix <- svyvar(formula, design)
>>> variables <-
diag(covariance.matrix)
>>> correlation.matrix <- covariance.matrix / sqrt(variables %*%
>>> t(variables))
>>> return(correlation.matrix)
>>> }
>>>
>>> library(survey)
>>> data(api)
>>> dclus2 <- svydesign(ids = ~ dnum + snum, fpc = ~ fpc1 + fpc2, data =
>>> apiclus2)
>>> pc <- svyprcomp( ~ api99 + api00, design = dclus2, scale = TRUE, scores
>>> = TRUE)
>>> dclus2$variables$pc1 <- pc$x[, "PC1"]
>>> dclus2$variables$pc2 <- predict(pc, apiclus2)[, "PC1"]
>>> mycoef <- pc$rotation[, "PC1"] / pc$scale
>>> dclus2$variables$pc3 <- with(apiclus2, api99 * mycoef["api99"] + api00 *
>>> mycoef["api00"])
>>> svycor(~ pc1 + pc2 + pc3, dclus2)[, ]
>>> # pc1 pc2 pc3
>>> # pc1 1.0000000 0.2078789 0.2078789
>>> # pc2 0.2078789 1.0000000 1.0000000
>>> # pc3 0.2078789 1.0000000 1.0000000
>>> plot(svysmooth(~ pc1, dclus2), xlim = c(-2.5, 5), ylim = 0:1)
>>> lines(svysmooth(~ pc2, dclus2), col = 2)
>>>
lines(svysmooth(~ pc3, dclus2), col = 3)
>>> legend("topright", legend = c('pc$x[, "PC1"]', 'predict(pc, apiclus2)[,
>>> "PC1"]', 'Reconstituted indicator'), col = 1:3, lty = 1)
>>>
>>> sessionInfo()
>>> # R version 3.2.4 Revised (2016-03-16 r70336)
>>> # Platform: x86_64-pc-linux-gnu (64-bit)
>>> # Running under: Arch Linux
>>> #
>>> # locale:
>>> # [1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C
>>> # [3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=pt_BR.UTF-8
>>> # [5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=pt_BR.UTF-8
>>> # [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
>>> # [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> # [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
>>> #
>>> # attached base packages:
>>> # [1] grid stats graphics utils datasets grDevices
>>> # [7] methods base
>>> #
>>> # other attached packages:
>>> # [1] KernSmooth_2.23-15
survey_3.30-3
>>> #
>>> # loaded via a namespace (and not attached):
>>> # [1] tools_3.2.4
>>>
>>> =====
>>>
>>> This lack of correlation doesn't happen if the survey design object has
>>> uniform sampling weights or if the the data is analyzed with prcomp().
>>>
>>> Why does the returned principal component is so different from the
>>> predicted and the reconstituted ones? Are predict() and my
>>> "reconstitution" missing something? Are the three methods equally valid
>>> but with different interpretations? Is there a bug in svyprcomp() ??
>>>
>>> Thanks in advance,
>>>
>>> Leonardo Ferreira Fontenelle
>>> http://lattes.cnpq.br/9234772336296638
>>>
>>>
>>> R-help em r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html[http://www.r-project.org/posting-guide.html]
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> R-help em r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html[http://www.r-project.org/posting-guide.html]
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list