Although the common situation is the absence of prior information on \(\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})\), in some particular cases pre-sample information exists in the form of \(\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})\). This \(\mathbf{q}\) distribution can be used as an initial hypothesis to be incorporated in the consistency relations of maximum entropy formalism. Kullback and Leibler [1] defined cross-entropy (CE) between \(\mathbf{p}\) and \(\mathbf{q}\) as
\[\begin{align} I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right). \end{align}\]
\(I(\mathbf{p},\mathbf{q})\) measures the discrepancy between the \(\mathbf{p}\) and \(\mathbf{q}\) distributions. It is non-negative, and when \(\mathbf{p}=\mathbf{q}\) one gets \(I(\mathbf{p},\mathbf{q})=0\). So, according to the principle of minimum cross-entropy [2,3] probabilities that are as close as possible to the prior probabilities should be chosen.
Given the previous, and for the reparameterized linear regression
model, \[\begin{equation}
    \mathbf{y}=\mathbf{XZp} + \mathbf{Vw},
\end{equation}\]
the Generalized Cross Entropy (GCE) estimator is given by
\[\begin{equation}
   \hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) =
\underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}}
   \left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) +
\mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\},
\end{equation}\]
subject to the same model constraints as the GME estimator (see “Generalized Maximum Entropy
framework”).
Using set notation the minimization problem can be rewritten as follows: \[\begin{align} &\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\ &\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\ & & \sum_{m=1}^M p_{km} = 1, \forall k\\ & & \sum_{j=1}^J w_{kj} = 1, \forall k. \end{align}\]
The Lagrangian equation \[\begin{equation}
    \mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) +
\mathbf{w}' \ln \left(\mathbf{w/u}\right)  +
\boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} -
\mathbf{Vw}  \right) + \boldsymbol{\theta}'\left(
\mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p}
\right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N
\otimes \mathbf{1}'_J)\mathbf{w}\right)
\end{equation}\]
can be used to find the interior solution, where \(\lambda\), \(\theta\), and \(\tau\) are \((N\times 1)\), \(((K+1)\times 1)\), \((N\times 1)\) associated vectors of
Lagrangian multipliers, respectively.
Taking the gradient of the Lagrangian and solving the first-order
conditions yields the solutions for \(\mathbf{\hat p}\) and \(\mathbf{\hat w}\)
\[\begin{equation} \hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})} \end{equation}\] and \[\begin{equation} \hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}. \end{equation}\]
Note that when the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results.
Consider dataGCE (see “Generalized Maximum Entropy
framework”).
Again under a “no a priori information” scenario for the
parameters, one can assume that \(z_k^{upper}=100\), \(k\in\left\lbrace 0,\dots,6\right\rbrace\)
is a “wide upper bound” for the signal support space. Using
lmgce a model can be fitted under the GME or GCE framework.
If support.signal.points is an integer, a constant vector
or a constant matrix one is assuming a uniform distribution on \(\mathbf{q}\) and therefore considering the
GME framework.
res.lmgce.100.GME <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = 5,
    twosteps.n = 0,
    seed = 230676
  )The estimated GME coefficients are \(\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=\) (1.026, -0.155, 1.822, 3.319, 8.393, 11.467).
(coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME))
#> (Intercept)        X001        X002        X003        X004        X005 
#>   1.0255630  -0.1552375   1.8221235   3.3194530   8.3932055  11.4670530But if there is some information, for instance, on \(\beta_1\) and \(\beta_2\), that can be reflected on
support.signal.points. Lets suppose that one suspects that
\(\beta_1=\beta_2=0\). Since the
support spaces are centered in zero one can assign a higher probability
to the support point in or around the center. One can set \(\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1,
0.1)'\), for instance. support.signal.points
accepts information on the distribution of probabilities in the form of
a \((K+1)\times M\) matrix. The first
line corresponds to \(\mathbf{q_0}\),
the second to \(\mathbf{q_1}\), and so
on.
(support.signal.points.matrix <- 
  matrix(
    c(rep(1/5, 5),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      rep(1/5, 5),
      rep(1/5, 5),
      rep(1/5, 5)
      ),
    ncol = 5,
    byrow = TRUE))
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  0.2  0.2  0.2  0.2  0.2
#> [2,]  0.1  0.1  0.6  0.1  0.1
#> [3,]  0.1  0.1  0.6  0.1  0.1
#> [4,]  0.2  0.2  0.2  0.2  0.2
#> [5,]  0.2  0.2  0.2  0.2  0.2
#> [6,]  0.2  0.2  0.2  0.2  0.2res.lmgce.100.GCE <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = support.signal.points.matrix,
    twosteps.n = 0,
    seed = 230676
  )The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=\) (1.026, -0.143, 1.655, 3.228, 8.189, 11.269).
(coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE))
#> (Intercept)        X001        X002        X003        X004        X005 
#>    1.026345   -0.143421    1.654828    3.227839    8.189040   11.269391The prediction errors are approximately equal ( \(RMSE_{\mathbf{\hat y}}^{GME_{(100)}}
\approx\) 0.407 and \(RMSE_{\mathbf{\hat y}}^{GCE_{{100}}}
\approx\) 0.407) as well as the prediction cross-validation
errors ( \(CV\text{-}RMSE_{\mathbf{\hat
y}}^{GME_{(100)}} \approx\) 0.428 and \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}}
\approx\) 0.427).
The precision errors is lower for the GCE approach: \(RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}}
\approx\) 1.595 and \(RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}}
\approx\) 1.458.
(RMSE_beta.lmgce.100.GME <-
   GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"))
#> [1] 1.594821
(RMSE_beta.lmgce.100.GCE <-
    GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"))
#> [1] 1.457947If there was some information on the distribution of \(\mathbf{w}\), a similar analysis could be
done for noise.signal.points.
The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations.
This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.