--- title: "Generalized Cross Entropy framework" author: "Jorge Cabral" output: rmarkdown::html_vignette: toc: true toc_depth: 4 link-citations: yes bibliography: references.bib csl: american-medical-association-brackets.csl description: | Working with prior information on probabilities. vignette: > %\VignetteIndexEntry{Generalized Cross Entropy framework} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ![](GCEstim_logo.png) ## Introduction Although the common situation is the absence of prior information on $\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})$, in some particular cases pre-sample information exists in the form of $\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})$. This $\mathbf{q}$ distribution can be used as an initial hypothesis to be incorporated in the consistency relations of maximum entropy formalism. Kullback and Leibler @Kullback1951 defined cross-entropy (CE) between $\mathbf{p}$ and $\mathbf{q}$ as \begin{align} I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right). \end{align} $I(\mathbf{p},\mathbf{q})$ measures the discrepancy between the $\mathbf{p}$ and $\mathbf{q}$ distributions. It is non-negative, and when $\mathbf{p}=\mathbf{q}$ one gets $I(\mathbf{p},\mathbf{q})=0$. So, according to the principle of minimum cross-entropy [@Kullback1959;@Good1963] probabilities that are as close as possible to the prior probabilities should be chosen. ## Generalized Cross Entropy estimator Given the previous, and for the reparameterized linear regression model, \begin{equation} \mathbf{y}=\mathbf{XZp} + \mathbf{Vw}, \end{equation} the Generalized Cross Entropy (GCE) estimator is given by \begin{equation} \hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) = \underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}} \left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\}, \end{equation} subject to the same model constraints as the GME estimator (see ["Generalized Maximum Entropy framework"](V2_GME_framework.html#GMEestimator)). Using set notation the minimization problem can be rewritten as follows: \begin{align} &\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\ &\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\ & & \sum_{m=1}^M p_{km} = 1, \forall k\\ & & \sum_{j=1}^J w_{kj} = 1, \forall k. \end{align} The Lagrangian equation \begin{equation} \mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) + \boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} - \mathbf{Vw} \right) + \boldsymbol{\theta}'\left( \mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p} \right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N \otimes \mathbf{1}'_J)\mathbf{w}\right) \end{equation} can be used to find the interior solution, where $\lambda$, $\theta$, and $\tau$ are $(N\times 1)$, $((K+1)\times 1)$, $(N\times 1)$ associated vectors of Lagrangian multipliers, respectively. Taking the gradient of the Lagrangian and solving the first-order conditions yields the solutions for $\mathbf{\hat p}$ and $\mathbf{\hat w}$ \begin{equation} \hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})} \end{equation} and \begin{equation} \hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}. \end{equation} Note that when the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results. ## Examples {#Examples} Consider `dataGCE` (see ["Generalized Maximum Entropy framework"](V2_GME_framework.html#Examples)). Again under a "no *a priori* information" scenario for the parameters, one can assume that $z_k^{upper}=100$, $k\in\left\lbrace 0,\dots,6\right\rbrace$ is a "wide upper bound" for the signal support space. Using `lmgce` a model can be fitted under the GME or GCE framework. If `support.signal.points` is an integer, a constant vector or a constant matrix one is assuming a uniform distribution on $\mathbf{q}$ and therefore considering the GME framework. ```{r,echo=FALSE,eval=TRUE} coef.dataGCE <- c(1, 0, 0, 3, 6, 9) ``` ```{r,echo=TRUE,eval=TRUE} library(GCEstim) ``` ```{r,echo=TRUE,eval=TRUE} res.lmgce.100.GME <- GCEstim::lmgce( y ~ ., data = dataGCE, cv = TRUE, cv.nfolds = 5, support.signal = c(-100, 100), support.signal.points = 5, twosteps.n = 0, seed = 230676 ) ``` The estimated GME coefficients are $\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=$ `r paste0("(", paste(round(coef(res.lmgce.100.GME), 3), collapse = ", "), ")")`. ```{r,echo=TRUE,eval=TRUE} (coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME)) ``` But if there is some information, for instance, on $\beta_1$ and $\beta_2$, that can be reflected on `support.signal.points`. Lets suppose that one suspects that $\beta_1=\beta_2=0$. Since the support spaces are centered in zero one can assign a higher probability to the support point in or around the center. One can set $\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1, 0.1)'$, for instance. `support.signal.points` accepts information on the distribution of probabilities in the form of a $(K+1)\times M$ matrix. The first line corresponds to $\mathbf{q_0}$, the second to $\mathbf{q_1}$, and so on. ```{r,echo=TRUE,eval=TRUE} (support.signal.points.matrix <- matrix( c(rep(1/5, 5), c(0.1, 0.1, 0.6, 0.1, 0.1), c(0.1, 0.1, 0.6, 0.1, 0.1), rep(1/5, 5), rep(1/5, 5), rep(1/5, 5) ), ncol = 5, byrow = TRUE)) ``` ```{r,echo=TRUE,eval=TRUE} res.lmgce.100.GCE <- GCEstim::lmgce( y ~ ., data = dataGCE, cv = TRUE, cv.nfolds = 5, support.signal = c(-100, 100), support.signal.points = support.signal.points.matrix, twosteps.n = 0, seed = 230676 ) ``` The estimated GCE coefficients are $\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=$ `r paste0("(", paste(round(coef(res.lmgce.100.GCE), 3), collapse = ", "), ")")`. ```{r,echo=TRUE,eval=TRUE} (coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE)) ``` The prediction errors are approximately equal ( $RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx$ `r round(GCEstim::accmeasure(fitted(res.lmgce.100.GME), dataGCE$y, which = "RMSE"), 3)` and $RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx$ `r round(GCEstim::accmeasure(fitted(res.lmgce.100.GCE), dataGCE$y, which = "RMSE"), 3)`) as well as the prediction cross-validation errors ( $CV\text{-}RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx$ `r round(res.lmgce.100.GME$error.measure.cv.mean, 3)` and $CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx$ `r round(res.lmgce.100.GCE$error.measure.cv.mean, 3)`). The precision errors is lower for the GCE approach: $RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}} \approx$ `r round(GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"), 3)` and $RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}} \approx$ `r round(GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"), 3)`. ```{r,echo=TRUE,eval=TRUE} (RMSE_beta.lmgce.100.GME <- GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE")) (RMSE_beta.lmgce.100.GCE <- GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE")) ``` If there was some information on the distribution of $\mathbf{w}$, a similar analysis could be done for `noise.signal.points`. ## Conclusion The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations. ## References ::: {#refs} ::: ## Acknowledgements This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects and .