---
title: "Generalized Cross Entropy framework"
author: "Jorge Cabral"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 4
link-citations: yes
bibliography: references.bib
csl: american-medical-association-brackets.csl
description: |
Working with prior information on probabilities.
vignette: >
%\VignetteIndexEntry{Generalized Cross Entropy framework}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: 72
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

## Introduction
Although the common situation is the absence of prior information on
$\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})$, in some particular
cases pre-sample information exists in the form of
$\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})$. This $\mathbf{q}$
distribution can be used as an initial hypothesis to be incorporated in the
consistency relations of maximum entropy formalism.
Kullback and Leibler @Kullback1951 defined cross-entropy (CE) between $\mathbf{p}$ and $\mathbf{q}$ as
\begin{align}
I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right).
\end{align}
$I(\mathbf{p},\mathbf{q})$ measures the discrepancy between the $\mathbf{p}$ and
$\mathbf{q}$ distributions. It is non-negative, and when $\mathbf{p}=\mathbf{q}$ one gets $I(\mathbf{p},\mathbf{q})=0$.
So, according to the principle of minimum cross-entropy [@Kullback1959;@Good1963]
probabilities that are as close as possible to the prior probabilities should be
chosen.
## Generalized Cross Entropy estimator
Given the previous, and for the reparameterized linear regression model,
\begin{equation}
\mathbf{y}=\mathbf{XZp} + \mathbf{Vw},
\end{equation}
the Generalized Cross Entropy (GCE) estimator is given by
\begin{equation}
\hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) = \underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}}
\left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\},
\end{equation}
subject to the same model constraints as the GME estimator (see ["Generalized Maximum Entropy framework"](V2_GME_framework.html#GMEestimator)).
Using set notation the minimization problem can be rewritten as follows:
\begin{align}
&\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\
&\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\
& & \sum_{m=1}^M p_{km} = 1, \forall k\\
& & \sum_{j=1}^J w_{kj} = 1, \forall k.
\end{align}
The Lagrangian equation
\begin{equation}
\mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) + \boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} - \mathbf{Vw} \right) + \boldsymbol{\theta}'\left( \mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p} \right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N \otimes \mathbf{1}'_J)\mathbf{w}\right)
\end{equation}
can be used to find the interior solution, where $\lambda$, $\theta$, and $\tau$
are $(N\times 1)$, $((K+1)\times 1)$, $(N\times 1)$ associated vectors of
Lagrangian multipliers, respectively.
Taking the gradient of the Lagrangian and solving the first-order conditions
yields the solutions for $\mathbf{\hat p}$ and $\mathbf{\hat w}$
\begin{equation}
\hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}
\end{equation}
and
\begin{equation}
\hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}.
\end{equation}
Note that when the prior distribution is uniform, maximum entropy and minimum cross
entropy produce the same results.
## Examples {#Examples}
Consider `dataGCE`
(see ["Generalized Maximum Entropy framework"](V2_GME_framework.html#Examples)).
Again under a "no *a priori* information" scenario for the parameters, one can
assume that $z_k^{upper}=100$, $k\in\left\lbrace 0,\dots,6\right\rbrace$ is a
"wide upper bound" for the signal support space. Using `lmgce` a model can be
fitted under the GME or GCE framework. If `support.signal.points` is an integer,
a constant vector or a constant matrix one is assuming a uniform distribution on
$\mathbf{q}$ and therefore considering the GME framework.
```{r,echo=FALSE,eval=TRUE}
coef.dataGCE <- c(1, 0, 0, 3, 6, 9)
```
```{r,echo=TRUE,eval=TRUE}
library(GCEstim)
```
```{r,echo=TRUE,eval=TRUE}
res.lmgce.100.GME <-
GCEstim::lmgce(
y ~ .,
data = dataGCE,
cv = TRUE,
cv.nfolds = 5,
support.signal = c(-100, 100),
support.signal.points = 5,
twosteps.n = 0,
seed = 230676
)
```
The estimated GME coefficients are $\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=$ `r paste0("(", paste(round(coef(res.lmgce.100.GME), 3), collapse = ", "), ")")`.
```{r,echo=TRUE,eval=TRUE}
(coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME))
```
But if there is some information, for instance, on $\beta_1$ and $\beta_2$, that
can be reflected on `support.signal.points`. Lets suppose that one suspects that
$\beta_1=\beta_2=0$. Since the support spaces are centered in zero one can assign
a higher probability to the support point in or around the center. One can set
$\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1, 0.1)'$, for instance. `support.signal.points` accepts information on the distribution of probabilities in the form of a $(K+1)\times M$ matrix. The first line corresponds to $\mathbf{q_0}$, the second to $\mathbf{q_1}$, and so on.
```{r,echo=TRUE,eval=TRUE}
(support.signal.points.matrix <-
matrix(
c(rep(1/5, 5),
c(0.1, 0.1, 0.6, 0.1, 0.1),
c(0.1, 0.1, 0.6, 0.1, 0.1),
rep(1/5, 5),
rep(1/5, 5),
rep(1/5, 5)
),
ncol = 5,
byrow = TRUE))
```
```{r,echo=TRUE,eval=TRUE}
res.lmgce.100.GCE <-
GCEstim::lmgce(
y ~ .,
data = dataGCE,
cv = TRUE,
cv.nfolds = 5,
support.signal = c(-100, 100),
support.signal.points = support.signal.points.matrix,
twosteps.n = 0,
seed = 230676
)
```
The estimated GCE coefficients are $\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=$ `r paste0("(", paste(round(coef(res.lmgce.100.GCE), 3), collapse = ", "), ")")`.
```{r,echo=TRUE,eval=TRUE}
(coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE))
```
The prediction errors are approximately equal (
$RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx$
`r round(GCEstim::accmeasure(fitted(res.lmgce.100.GME), dataGCE$y, which = "RMSE"), 3)`
and $RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx$
`r round(GCEstim::accmeasure(fitted(res.lmgce.100.GCE), dataGCE$y, which = "RMSE"), 3)`)
as well as the prediction cross-validation errors (
$CV\text{-}RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx$
`r round(res.lmgce.100.GME$error.measure.cv.mean, 3)`
and $CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx$
`r round(res.lmgce.100.GCE$error.measure.cv.mean, 3)`).
The precision errors is lower for the GCE approach: $RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}} \approx$
`r round(GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"), 3)`
and $RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}} \approx$
`r round(GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"), 3)`.
```{r,echo=TRUE,eval=TRUE}
(RMSE_beta.lmgce.100.GME <-
GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"))
(RMSE_beta.lmgce.100.GCE <-
GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"))
```
If there was some information on the distribution of $\mathbf{w}$, a similar
analysis could be done for `noise.signal.points`.
## Conclusion
The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations.
## References
::: {#refs}
:::
## Acknowledgements
This work was supported by Fundação para a Ciência e Tecnologia (FCT)
through CIDMA and projects
and .