\documentclass[nojss]{jss} \usepackage{enumitem} \usepackage{amsmath} \usepackage{float} \clubpenalty = 10000 \widowpenalty = 10000 \displaywidowpenalty = 10000 \author{Manuel J. A. Eugster\\{\small Ludwig-Maximilians-Universit{\"a}t M{\"u}nchen} \And Friedrich Leisch\\{\small Ludwig-Maximilians-Universit{\"a}t M{\"u}nchen}} \title{From Spider-Man to Hero --\\ Archetypal Analysis in \proglang{R}} \Plainauthor{Manuel J. A. Eugster, Friedrich Leisch} \Plaintitle{From Spider-Man to Hero -- Archetypal Analysis in R} \Shorttitle{Archetypal Analysis in \proglang{R}} \Keywords{archetypal analysis, convex hull, \proglang{R}} \Plainkeywords{archetypal analysis, convex hull, R} \Abstract{ This introduction to the \proglang{R} package \pkg{archetypes} is a (slightly) modified version of \citet{Eugster+Leisch@2009}, published in the {\it Journal of Statistical Software}. \bigskip Archetypal analysis has the aim to represent observations in a multivariate data set as convex combinations of extremal points. This approach was introduced by \citet{Cutler+Breiman@1994}; they defined the concrete problem, laid out the theoretical foundations and presented an algorithm written in \proglang{Fortran}. In this paper we present the \proglang{R} package \pkg{archetypes} which is available on the Comprehensive \proglang{R} Archive Network. The package provides an implementation of the archetypal analysis algorithm within \proglang{R} and different exploratory tools to analyze the algorithm during its execution and its final result. The application of the package is demonstrated on two examples. } \Address{ Manuel J. A. Eugster\\ Department of Statistics\\ Ludwig-Maximilians-Universit{\"a}t M{\"u}nchen\\ 80539 Munich, Germany\\ E-mail: \email{Manuel.Eugster@stat.uni-muenchen.de}\\ URL: \url{http://www.statistik.lmu.de/~eugster/} \medskip Friedrich Leisch\\ Department of Statistics\\ Ludwig-Maximilians-Universit{\"a}t M{\"u}nchen\\ 80539 Munich, Germany\\ E-mail: \email{Friedrich.Leisch@R-project.org}\\ URL: \url{http://www.statistik.lmu.de/~leisch/} } %%\usepackage{Sweave} %% already provided by jss.cls %%\VignetteIndexEntry{From Spider-Man to Hero -- Archetypal Analysis in R} %%\VignetteDepends{archetypes} %%\VignetteKeywords{archetypal analysis, convex hull, R} %%\VignettePackage{archetypes} \SweaveOpts{eps=FALSE, keep.source=TRUE} <>= options(width=80, prompt='R> ') @ \begin{document} \section[Introduction]{Introduction\label{intro}} The \citet{merriam-webster:archetype} defines an archetype as {\it the original pattern or model of which all things of the same type are representations or copies}. The aim of archetypal analysis is to find ``pure types'', the archetypes, within a set defined in a specific context. The concept of archetypes is used in many different areas, the set can be defined in terms of literature, philosophy, psychology and also statistics. Here, the concrete problem is to find a few, not necessarily observed, points (archetypes) in a set of multivariate observations such that all the data can be well represented as convex combinations of the archetypes. The title of this article illustrates the concept of archetypes on the basis of archetypes in literature: the {\it Spider-Man} personality belongs to the generic {\it Hero} archetype, and archetypal analysis tries to find this coherence. In statistics archetypal analysis was first introduced by \citet{Cutler+Breiman@1994}. In their paper they laid out the theoretical foundations, defined the concrete problem as a nonlinear least squares problem and presented an alternating minimizing algorithm to solve it. It has found applications in different areas, with recently grown popularity in economics, e.g., \citet{Li+Wang+Louviere+Carson@2003} and \citet{Porzio+Ragozini+Vistocco@2008}. In spite of the rising interest in this computer-intensive but numerically sensitive method, no ``easy-to-use'' and freely available software package has been developed yet. In this paper we present the software package \pkg{archetypes} within the \proglang{R} statistical environment \citep{R} which provides an implementation of the archetypal analysis algorithm. Additionally, the package provides exploratory tools to visualize the algorithm during the minimization steps and its final result. The newest released version of \pkg{archetypes} is always available from the Comprehensive R Archive Network at \url{http://CRAN.R-project.org/package=archetypes}. The paper is organized as follows: In Section~\ref{algorithm} we outline the archetypal analysis with its different conceptual parts. We present the theoretical background as far as we need it for a sound introduction of our implementation; for a complete explanation we refer to the original paper. Section~\ref{pkg} demonstrates how to use \pkg{archetypes} based on a simple artificial data set, with details about numerical problems and the behavior of the algorithm. Section~\ref{comp} presents a simulation study to show how the implementation scales with numbers of observations, attributes and archetypes. In Section~\ref{body} we show a real word example -- the archetypes of human skeletal diameter measurements. Section~\ref{outlook} concludes the article with future investigations. \section[Archetypal analysis]{Archetypal analysis\label{algorithm}} Given is an $n \times m$ matrix $X$ representing a multivariate data set with $n$ observations and $m$ attributes. For a given $k$ the archetypal analysis finds the matrix $Z$ of $k$ $m$-dimensional archetypes according to the two fundamentals: \begin{enumerate} [label=(\arabic{enumi})] \item The data are best approximated by convex combinations of the archetypes, i.e., they minimize \[ \mbox{RSS} = \|X - \alpha Z^{\top}\|_2 \] with $\alpha$, the coefficients of the archetypes, an $n \times k$ matrix; the elements are required to be greater equal $0$ and their sum must be $1$, i.e., $\sum_{j=1}^{k} \alpha_{ij} = 1$ with $\alpha_{ij} \geq 0$ and $i = 1, \ldots, n$. $\|\cdot\|_2$ denotes an appropriate matrix norm. \item The archetypes are convex combinations of the data points: \[ Z = X^{\top} \beta \] with $\beta$, the coefficients of the data set, a $n \times k$ matrix where the elements are required to be greater equal $0$ and their sum must be $1$, i.e., $\sum_{i=1}^{n} \beta_{ji} = 1$ with $\beta_{ji} \geq 0$ and $j = 1, \ldots, k$. \end{enumerate} These two fundamentals also define the basic principles of the estimation algorithm: it alternates between finding the best $\alpha$ for given archetypes $Z$ and finding the best archetypes $Z$ for given $\alpha$; at each step several convex least squares problems are solved, the overall $\mbox{RSS}$ is reduced successively. With a view to the implementation, the algorithm consists of the following steps: \begin{enumerate} [label=\arabic{enumi}.,ref=\arabic{enumi},itemsep=6pt] \item[] Given the number of archetypes $k$: \item\label{alg:pre} Data preparation and initialization: scale data, add a dummy row (see below) and initialize $\beta$ in a way that the the constraints are fulfilled to calculate the starting archetypes $Z$. \item\label{alg:loop} Loop until $\mbox{RSS}$ reduction is sufficiently small or the number of maximum iterations is reached: \begin{enumerate} [label=\arabic{enumi}.\arabic{enumii}., ref=\arabic{enumi}.\arabic{enumii}, itemsep=6pt] \item\label{alg:loop-alpha} Find best $\alpha$ for the given set of archetypes $Z$: solve $n$ convex least squares problems ($i = 1, \ldots, n$) \[ \min_{\alpha_i} \frac{1}{2} \|X_i - Z \alpha_i\|_2 \mbox{ subject to} \; \alpha_i \geq 0 \; \mbox{ and } \; \sum_{j=1}^{k} \alpha_{ij} = 1\mbox{.} \] \item\label{alg:loop-zt} Recalculate archetypes $\tilde{Z}$: solve system of linear equations $X = \alpha \tilde{Z}^{\top}$. \item\label{alg:loop-beta} Find best $\beta$ for the given set of archetypes $\tilde{Z}$: solve $k$ convex least squares problems ($j = 1, \ldots, k$) \[ \min_{\beta_j} \frac{1}{2} \|\tilde{Z}_j - X \beta_j\|_2 \mbox{ subject to} \; \beta_j \ge 0 \; \mbox{ and } \; \sum_{i=1}^{n} \beta_{ji} = 1\mbox{.} \] \item\label{alg:loop-z} Recalculate archetypes $Z$: $Z = X \beta$. \item\label{alg:loop-rss} Calculate residual sum of squares $\mbox{RSS}$. \end{enumerate} \item\label{alg:post} Post-processing: remove dummy row and rescale archetypes. \end{enumerate} The algorithm has to deal with several numerical problems, i.e. systems of linear equations and convex least squares problems. In the following we explain each step in detail. \paragraph{Solving the convex least squares problems:} In Step \ref{alg:loop-alpha} and \ref{alg:loop-beta} several convex least squares problems have to be solved. \citet{Cutler+Breiman@1994} use a penalized version of the non-negative least squares algorithm by \citet{Lawson+Hanson@1974} \citep[as general reference see,e.g.,][]{Luenberger@1984}. In detail, the problems to solve are of the form $\|u - Tw\|_2$ with $u, w$ vectors and $T$ a matrix, all of appropriate dimensions, and the non-negativity and equality constraints. The penalized version adds an extra element $M$ to $u$ and to each observation of $T$; then \[ \|u - Tw\|_2 + M^2 \|1 - w\|_2 \] is minimized under non-negativity restrictions. For large $M$, the second term dominates and forces the equality constraint to be approximately satisfied while maintaining the non-negativity constraint. The hugeness of the value $M$ varies from problem to problem and thus can be seen as a hyperparameter of the algorithm. Default value in the package is $200$. \paragraph{Solving the system of linear equations:} In Step \ref{alg:loop-zt} the system of linear equations \[ \tilde{Z} = \alpha^{-1} X \] has to be solved. A lot of methods exist, one approach is the Moore-Penrose pseudoinverse which provides an approximated unique solution by a least squares approach: given the pseudoinverse $\alpha^{+}$ of $\alpha$, \[ \tilde{Z} = \alpha^{+} X\mbox{,} \] is solved. Another approach is the usage of $QR$ decomposition: $\alpha = QR$, where $Q$ is an orthogonal and $R$ an upper triangular matrix, then \[ \tilde{Z} = Q^{\top} X R^{-1}\mbox{,} \] is solved. Default approach in the package is the $QR$ decomposition using the \code{solve()} function. \paragraph{Calculating the residual sum of squares:} In Step \ref{alg:loop-rss} the $\mbox{RSS}$ is calculated. It uses the spectral norm \citep[see, e.g.,][]{Golub+Loan@1996}. The spectral norm of a matrix $X$ is the largest singular value of $X$ or the square root of the largest eigenvalue of $X^{*} X$, \[ \|X\|_2 = \sqrt{\lambda_{max}(X^{*} X)}\mbox{,} \] where $X^{*}$ is the conjugate transpose of $X$. \paragraph{Avoiding local minima:} \citet{Cutler+Breiman@1994} show that the algorithm converges in all cases, but not necessarily to a global minimum. Hence, the algorithm should be started several times with different initial archetypes. It is important that these are not too close together, this can cause slow convergence or convergence to a local minimum. \paragraph{Choosing the correct number of archetypes:} As in many cases there is no rule for the correct number of archetypes $k$. A simple method the determine the value of $k$ is to run the algorithm for different numbers of $k$ and use the ``elbow criterion'' on the $\mbox{RSS}$ where a ``flattening'' of the curve indicates the correct value of $k$. \paragraph{Approximation of the convex hull:} Through the definition of the problem, archetypes lie on the boundary of the convex hull of the data. Let $N$ be the number of data points which define the boundary of the convex hull, then \citet{Cutler+Breiman@1994} showed: if $1 < k < N$, there are $k$ archetypes on the boundary which minimize $\mbox{RSS}$; if $k = N$, exactly the data points which define the convex hull are the archetypes with $\mbox{RSS} = 0$; and if $k = 1$, the sample mean minimizes the $\mbox{RSS}$. In practice, these theoretical results can not always be achieved as we will see in the following two sections. \section[Using archetypes]{Using package \pkg{archetypes}\label{pkg}} The package is loaded into \proglang{R} using the \code{library()} or \code{require()} command: <>= library(archetypes) @ \begin{Schunk} \begin{Sinput} R> library("archetypes") \end{Sinput} \begin{Soutput} Loading required package: nnls \end{Soutput} \end{Schunk} It requires the packages \pkg{nnls} \citep{nnls} for solving the convex least squares problems. We use a simple artificial two-dimensional data set to explain the usage of the implementation, and the behavior of the archetypal analysis at all. The advantage is that we can imagine the results and simply visualize them, Section \ref{body} then shows a more realistic example. Note that in the following the plot commands do not produce exactly the shown graphics concerning primitives coloring, width, etc.; to increase readability we have reduced the presented commands to the significant arguments. E.g., if we use a different plotting symbol in a scatter plot the corresponding (\code{pch = \ldots}) will be omitted. <>= data("toy") plot(toy) @ \begin{figure}[h!t] \centering \setkeys{Gin}{width=3in} <>= data("toy") par(mar = c(2,2,0,0) + 0.1, ps = 9) plot(toy, xlab = "", ylab = "", xlim = c(0,20), ylim = c(0,20), pch = 19, col = gray(0.7), cex = 0.6) @ \caption{Data set {\tt toy}.} \label{fig:toy-data} \end{figure} Data set \code{toy} consists of the two attributes $x$ and $y$, and $\Sexpr{nrow(toy)}$ observations. According to the shape of the data, it seems to be a good idea to apply archetypal analysis with $k = 3$ archetypes. <<>>= suppressWarnings(RNGversion("3.5.0")) set.seed(1986) a <- archetypes(toy, 3, verbose = TRUE) @ During the fit, the function reports its improvement and stops after a maximum number of iterations (default is \code{maxIterations = 100}) or if the improvement is less than a defined value (default is \verb|minImprovement = sqrt(.Machine$double.eps)|). As basis for our %$ further research, the implementation is a flexible framework where the problem solving mechanisms of the individual steps can be exchanged. The default values are the ``original ones'' described in the previous section (\code{family = archetypesFamily()}). The result is an \proglang{S}3 \code{archetypes} object, <<>>= a @ containing the three final archetypes: <<>>= parameters(a) @ The \code{plot()} function visualizes archetypes for two-dimensional data sets; for higher-dimensional data sets parallel coordinates are used. <>= xyplot(a, toy, chull = chull(toy)) xyplot(a, toy, adata.show = TRUE) @ \begin{figure}[h!t] \centering \setkeys{Gin}{width=6in} <>= par(mfrow = c(1,2), mar = c(2,2,0,0)+0.1, ps = 9) xyplot(a, toy, chull = chull(toy), xlab = "", ylab = "", xlim = c(0, 20), ylim = c(0, 20), cex = 0.6) xyplot(a, toy, adata.show = TRUE, xlab = "", ylab = "", xlim = c(0, 20), ylim = c(0, 20), cex = 0.6) @ \caption{Visualization of three archetypes.} \label{fig:toy-a} \end{figure} The left plot of Figure~\ref{fig:toy-a} shows the archetypes, their approximation of the convex hull (red dots and lines) and the convex hull (black dots and lines) of the data. The right plot of Figure~\ref{fig:toy-a} additionally shows the approximation of the data through the archetypes and the corresponding $\alpha$ values (green symbols, and grey connection lines); as we can see, all data points outside the approximated convex hull are mapped on its boundary. This plot is based on an idea and \proglang{MATLAB} source code of \citet{pc:Pailthorpe}. With \code{saveHistory = TRUE} (which is set per default) each step of the execution is saved and we can examine the archetypes in each iteration using the \code{a\$history} memento object; the initial archetypes, for example, are \code{a\$history\$get(0)}. This can be used to create an ``evolution movie'' of the archetypes, <>= movieplot(a, toy) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= par(mfrow = c(2, 4), mar = c(0, 0, 0, 0) + 0.1, ps = 9) movieplot(a, toy, xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, axes = FALSE, postfn = function(iter) { box() text(1, 19, paste(iter + 1, ".", sep = ""), cex = 1) }) @ \caption{Visualization of the algorithm iterations.} \label{fig:toy-movieplot} \end{figure} The figure shows the plots of the seven steps (the random initialization and the six iterations) from top to bottom and left to right. In each step the three archetypes move further to the three corners of the data set. A movie of the approximated data is shown when setting parameter \code{show = "adata"}\footnote{Real animations are as Flash movies along with this paper and available from \url{http://www.jstatsoft.org/v30/i08/}.}. In the previous section we mentioned that the algorithm should be started several times to avoid local minima. This is done using the \code{stepArchetypes()} function; it passes all arguments to the \code{archetypes()} function and additionally has the argument \code{nrep} which specifies the number of repetitions. <<>>= set.seed(1986) a4 <- stepArchetypes(data = toy, k = 3, verbose = FALSE, nrep = 4) @ The result is an \proglang{S}3 \code{stepArchetypes} object, <<>>= a4 @ where \code{summary()} provides an overview of each repetition by showing the final residual sum of squares and number of iterations: <<>>= summary(a4) @ There are no huge differences in the residual sum of squares, thus if there are different local minima then they are all equally good. But the following plot shows that the repetition starts all nearly found the same final archetypes (and thus the same local minima), <>= xyplot(a4, toy) @ \begin{figure}[H] \centering \setkeys{Gin}{width=3in} <>= par(mar = c(2, 2, 0, 0) + 0.1, ps = 9) xyplot(a4, toy, cex = 0.6, xlim = c(0, 20), ylim = c(0, 20), xlab = "", ylab = "") @ \caption{Visualization of different repetitions.} \label{fig:toy-a4} \end{figure} However, the model of repetition $3$ has the lowest residual sum of squares and is the best model: <>= bestModel(a4) @ <>= print(bestModel(a4), full = FALSE) @ At the beginning of the example we decided by looking at the data that three archetypes may be a good choice. Visual inspection does not necessarly lead to the best choice, and is not an option for higher-dimensional data sets. As already mentioned in the previous section, one simple way to choose a good number of archetypes is to run the algorithm for different numbers of $k$ and use the ``elbow criterion'' on the residual sum of squares. The \code{stepArchetypes()} function allows a vector as value of argument $k$ and executes for each $k_i$ the \code{archetypes()} function $nrep$ times. <>= file <- "as.RData" if ( file.exists(file) ) { load(file = file) } else { set.seed(1986) as <- stepArchetypes(data = toy, k = 1:10, verbose = FALSE, nrep = 4) save(as, file = file) } @ \begin{Schunk} \begin{Sinput} R> set.seed(1986) R> as <- stepArchetypes(data = toy, k = 1:10, verbose = FALSE, nrep = 4) \end{Sinput} \begin{Soutput} There were 23 warnings (use warnings() to see) \end{Soutput} \end{Schunk} The occurred warnings indicate that errors occured during the execution, in this case, singular matrices in solving the linear equation system in Step \ref{alg:loop-zt} as from $k = 4$: \begin{Schunk} \begin{Sinput} R> warnings() \end{Sinput} \begin{Soutput} Warnings: 1: In archetypes(..., k = k[i], verbose = verbose) ... : k=4: Error in qr.solve(alphas %*% t(alphas)): singular matrix 'a' in solve 2: In archetypes(..., k = k[i], verbose = verbose) ... : k=5: Error in qr.solve(alphas %*% t(alphas)): singular matrix 'a' in solve 3: In archetypes(..., k = k[i], verbose = verbose) ... : k=5: Error in qr.solve(alphas %*% t(alphas)): singular matrix 'a' in solve [...] \end{Soutput} \end{Schunk} In these cases the residual sum of squares is \code{NA}: <<>>= rss(as) @ And all errors occured during the first iteration, <<>>= t(sapply(as, function(a) sapply(a, '[[', 'iters'))) @ which is an indication for an afflicted random initialisation. If warnings occur within repetitions for $k_i$ archetypes, the pretended meaningful solutions (according to the $RSS$) have to be examined carefully; Section~\ref{ginv} illustrates a pretended meaningful solution where the plot then uncovers problems. But up to $k = 5$ there is always at least one start with a meaningful result and the residual sum of squares curve of the best models shows that by the ``elbow criterion'' three or maybe seven is the best number of archetypes: <>= screeplot(as) @ \begin{figure}[H] \centering \setkeys{Gin}{width=4in} <>= par(mar = c(3, 4, 0.1, 0) + 0.1, ps = 9) screeplot(as, cex = 0.6, ylim = c(0, 0.08), axes = FALSE) mtext("Archetypes", side = 1, line = 2) axis(2, las = 2) box() @ \caption{Screeplot of the residual sum of squares.} \label{fig:toy-rssplot} \end{figure} We already have seen the three archetypes in detail; the seven archetypes of the best repetition and their approximation of the data are: <>= a7 <- bestModel(as[[7]]) xyplot(a7, toy, chull = chull(toy)) xyplot(a7, toy, adata.show = TRUE) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= a7 <- bestModel(as[[7]]) par(mfrow = c(1, 2), mar = c(2, 2, 0, 0) + 0.1, ps = 9) xyplot(a7, toy, chull = chull(toy), xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, xlab = "", ylab = "") xyplot(a7, toy, adata.show = TRUE, xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, xlab = "", ylab = "") @ \caption{Visualization of seven archetypes.} \label{fig:toy-a7} \end{figure} The approximation of the convex hull is now clearly visible. \subsection[Alternative numerical methods]{Alternative numerical methods\label{ginv}} As we mentioned in Section \ref{algorithm}, there are many ways to solve linear equation systems. One other possibility is the Moore-Penrose pseudoinverse: <>= file <- "gas.RData" if ( file.exists(file) ) { load(file = file) } else { set.seed(1986) gas <- stepArchetypes(data = toy, k = 1:10, family = archetypesFamily("original", zalphasfn = archetypes:::ginv.zalphasfn), verbose = FALSE, nrep = 4) save(gas, file = file) } @ \begin{Schunk} \begin{Sinput} R> set.seed(1986) R> gas <- stepArchetypes(data = toy, k = 1:10, + family = archetypesFamily("original", + zalphasfn = archetypes:::ginv.zalphasfn), + verbose = FALSE, nrep = 4) \end{Sinput} \begin{Soutput} Loading required package: MASS There were 23 warnings (use warnings() to see) \end{Soutput} \end{Schunk} We use the \code{ginv()} function from the \pkg{MASS} package to calculate the pseudoinverse. The function ignores ill-conditioned matrices and ``just solves the linear equation system'', but the \code{archetypes} function throws warnings of ill-conditioned matrices if the matrix condition number $\kappa$ is bigger than an upper bound (default is \code{maxKappa = 1000}): \begin{Schunk} \begin{Sinput} R> warnings() \end{Sinput} \begin{Soutput} Warnings: 1: In archetypes(..., k = k[i], verbose = verbose) ... : k=4: alphas > maxKappa 2: In archetypes(..., k = k[i], verbose = verbose) ... : k=5: alphas > maxKappa 3: In archetypes(..., k = k[i], verbose = verbose) ... : k=5: alphas > maxKappa [...] \end{Soutput} \end{Schunk} In comparison with the $QR$ decomposition, the warnings occured for the same number of archetypes $k_i$ during the same repetition. In most of these cases the residual sum of squares is about $12$, <<>>= rss(gas) @ and the randomly chosen initial archetypes ``collapse'' to the center of the data as we exemplarily see for $k = 9$, $r = 3$: <>= movieplot(gas[[9]][[3]], toy) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= par(mfrow = c(1, 4), mar = c(0, 0, 0, 0) + 0.1, ps = 9) movieplot(gas[[9]][[3]], toy, xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, axes = FALSE, postfn = function(iter) { box() text(1, 19, paste(iter + 1, ".", sep = ""), cex = 1) }) @ \caption{Visualization of algorithm iterations until ``collapsing''.} \label{fig:toy-movieplot-gas} \end{figure} The figure shows the four steps (from left to right), the random initialization and the three iterations until all archetypes are in the center of the data. All other residual sum of squares are nearly equivalent to the ones calculated with $QR$ decomposition. Further investigations would show that three or maybe seven is the best number of archetypes, and in case of $k = 3$ nearly the same three points are the best archetypes. An interesting exception is the case $k = 7, r = 2$; the residual sum of squares is exactly the same, but not the archetypes. The plots of the archetypes and their approximation of the data: <>= ga7 <- bestModel(gas[[7]]) xyplot(ga7, toy, chull = chull(toy)) xyplot(ga7, toy, adata.show = TRUE) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= ga7 <- bestModel(gas[[7]]) par(mfrow = c(1, 2), mar = c(2, 2, 0, 0) + 0.1, ps = 9) xyplot(ga7, toy, chull = chull(toy), xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, xlab = "", ylab = "") xyplot(ga7, toy, adata.show = TRUE, xlim = c(0, 20), ylim = c(0, 20), cex = 0.6, xlab = "", ylab = "") @ \caption{Visualization of seven archetypes; one archetype ``collapsed''.} \label{fig:toy-a7-gas} \end{figure} Interesting is the one archetype in the center of the data set and especially the approximation of the data in the right area of it. As the data are approximated by a linear combination of archetypes and non-negative $\alpha$, the only possibility for this kind of approximation is when $\alpha$ for this archetype is always zero: <<>>= apply(coef(ga7, 'alphas'), 2, range) @ As we can see, $\alpha$ of archetype $1$ (column one) is $0$ for all data points. Theoretically, this is not possible, but ill-conditioned matrices during the fit process lead to such results in practice. The occurred warnings (\code{k=7: alphas > max.kappa}) notify that solving the convex least squares problems lead to the ill-conditioned matrices. Our simulations showed that this behavior mostly appears when requesting a relatively large number of archetypes in relation to size of the data set. \section[Computational complexity]{Computational complexity\label{comp}} In Section~\ref{intro} we stated that the archetypal analysis algorithm is computer-intensive; in this section we now present a simulation to show how the implementation scales with numbers of observations, attributes and archetypes. In general, the speed of the algorithm is determined by the efficiency of the convex least squares method. The penalized non-negative least squares method is quite slow, but is appealing because it can be used when the number of attributes is larger than the number of observations \citep[according to][]{Cutler+Breiman@1994}. The simulation setup is the following \citep[cf.~the {\it simulation problem} by][]{Hothorn+Leisch+Zeileis+Hornik@2005}: we consider the multivariate standard normal distribution as data generating process (spherical data). A data set is generated for each combination of $m = 5, 10, 25, 50$ attributes, $n = 100, 1000, 10000, 10000$ observations and $20$ replications in each case. On each data set $k = 1, \ldots, 10$ archetypes are fitted; stop criteria are $100$ iterations or an improvement less than \verb|sqrt(.Machine$double.eps)|. %$ Each $m, n, k$ configuration is fitted with randomly chosen initial archetypes. For each of the $m \times n \times k \times 20$ fits the computation time and the number of iterations until convergence are measured. We present two main results of the simulation: (1) the computation time per iteration, i.e., focus on the efficiency of the convex least squares methods; (2) the computation time per fit, where the convergence behavior matters. The results are presented in Figure~\ref{fig:rt-iteration} and Figure~\ref{fig:rt-algorithm}, respectively. Both show four panels, one for each possible number of observations and each panel contains four lines, one for each possible number of attributes. The $x$-axis is the number of archetypes and the $y$-axis the measured value; notice that the scale of the $y$-axis varies between the panels. Figure~\ref{fig:rt-iteration} shows the computation time per iteration in seconds, i.e., the computation time per fit divided by the number of iterations until convergence. In general the behavior is as expected: in case of fixed $n$ and $m$ (each individual line), increasing number of archetypes imply a linear increase of the computation time. Similarly, in case of fixed $k$ and $m$ (dots of same color and $x$-axis position, different panels), increasing observations imply approximately $10$ times more computation time. And in the case of fixed $k$ and $n$ (dots of different colors, same $x$-axis position and panel), increasing attributes indicate a polynomial increase of the computation time. Noticeable is that in high dimensions ($n = 10000, 100000$ and $m = 50$) two, three and four archetypes need more computation time than the remaining numbers of archetypes. \begin{figure}[h!t] \centering \setkeys{Gin}{width=5in} \includegraphics{rt-002} \caption{Computation time per iteration in seconds.} \label{fig:rt-iteration} \end{figure} Figure~\ref{fig:rt-algorithm} shows the (median) computation time per fit in seconds. The figure shows that an increasing number of observations approximately imply a linear increase of the computation time. By contrast, for fixed $n$ the computation time is approximately constant in $k$. Except from $n = 100000$ the peak at two archetypes is strongly distinct. Currently we have no reasonable explanation and more detailed simulations need to be done. \begin{figure}[h!t] \centering \setkeys{Gin}{width=5in} \includegraphics{rt-003} \caption{Computation time per fit in seconds.} \label{fig:rt-algorithm} \end{figure} \clearpage \section[Example]{Example: Skeletal archetypes\label{body}} In this section we apply archetypal analysis on an interesting real world example: in \citet{Heinz+Peterson+Johnson+Kerk@2003} the authors took body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender on $247$ men and $260$ women in their twenties and early thirties, with a scattering of older man and woman, and all physically active. The full data are available within the package as \code{data("body")}, but we are only interested in a subset, the skeletal measurements and the height (all measured in centimeters), <<>>= data("skel") skel2 <- subset(skel, select = -Gender) @ The skeletal measurements consist of nine diameter measurements: biacromial (\code{Biac}), shoulder diameter; biiliac (\code{Biil}), pelvis diameter; bitrochanteric (\code{Bitro}) hip diameter; chest depth between spine and sternum at nipple level, mid-expiration (\code{ChestDp}); chest diameter at nipple level, mid-expiration (\code{ChestDiam}); elbow diameter, sum of two elbows (\code{ElbowDiam}); wrist diameter, sum of two wrists (\code{WristDiam}); knee diameter, sum of two knees (\code{KneeDiam}); ankle diameter, sum of two ankles (\code{AnkleDiam}). See the original publication for a full anatomical explanation of the skeletal measurements and the process of measuring. We use basic elements of {\it Human Modeling and Animation} to model the skeleton and create a schematic representation of an individual, e.g., \code{skeletonplot(skel2[1, ])} for observation number one. The function \code{jd()} (for ``John Doe'') uses this plot and shows a generic individual with explanations of the measurements: <>= jd() @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= par(mar = c(1, 4, 0, 0) + 0.1, ps = 9) jd() @ \caption{Generic skeleton with explanations of the measurements.} \label{fig:body-jd} \end{figure} For visualizing the full data set, parallel coordinates with axes arranged according to the ``natural order'' are used, <>= pcplot(skel2) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= datacol <- rgb(178, 178, 178, maxColorValue = 255, alpha = round(255 * 0.2)) @ <>= par(mar = c(5, 0.4, 0, 0.4) + 0.1, ps = 9) pcplot(skel2, las = 2, col = datacol) @ <>= png(filename = "body-pcplot-raw.png", units = "px", width = 590, height = 430, pointsize = 12) par(mar = c(5.5, 0.4, 0, 0.4) + 0.1) pcplot(skel2, las = 2, col = datacol) graphics.off() cat("\\includegraphics{body-pcplot-raw.png}\n") @ \caption{Parallel coordinates of the {\tt skel2} data set.} \label{fig:body-data} \end{figure} At first view no patterns are visible and it is difficult to guess a meaningful number of archetypes. Therefore, we calculate the archetypes for $k = 1, \ldots, 15$ with three repetitions each time, <>= file <- "bas.RData" if ( file.exists(file) ) { load(file = file) } else { set.seed(1981) as <- stepArchetypes(skel2, k = 1:15, verbose = FALSE) save(as, file = file) } @ \begin{Schunk} \begin{Sinput} R> set.seed(1981) R> as <- stepArchetypes(skel2, k = 1:15, verbose = FALSE, nrep = 3) \end{Sinput} \begin{Soutput} There were 12 warnings (use warnings() to see) \end{Soutput} \end{Schunk} The warnings indicate ill-conditioned matrices as from $k = 11$, but, not as in the previous section, the residual sum of squares contains no \code{NA} values. The corresponding curve of the best model in each case is: <>= screeplot(as) @ \begin{figure}[H] \centering \setkeys{Gin}{width=5in} <>= par(mar = c(3, 4, 0.4, 0) + 0.1, ps = 9) screeplot(as, cex = 0.6, axes = FALSE) mtext("Archetypes", side = 1, line = 2) axis(2, las = 2) box() @ \caption{Screeplot of the residual sum of squares.} \label{fig:body-rss} \end{figure} And according to the ``elbow criterion'' $k = 3$ or maybe $k = 7$ is the best number of archetypes. Corresponding to Occam's razor we proceed with three archetypes, <<>>= a3 <- bestModel(as[[3]]) @ The three archetypes are (transposed for better readability): <<>>= t(parameters(a3)) @ Or as a barplot in relation to the original data: <>= barplot(a3, skel2, percentiles = TRUE) @ \begin{figure}[H] \centering \setkeys{Gin}{width=5in} <>= par(mar = c(5, 4, 0.4, 0) + 0.1, ps = 9) barplot(a3, skel2, percentiles = TRUE, below.compressed.height = 0.4, below.compressed.srt = 90) @ \caption{Barplot visualizing the three archetypes.} \label{fig:body-barplot} \end{figure} Archetype 2 (gray) represents individuals which are ``huge'' in all measurements; on the other hand, archetype 3 (lightgray) represents individuals which are ``small''. Archetype 1 (darkgray) represents individuals with average measures except the bitrochanteric and biiliac -- the meaning of this is best visible when looking at the data with gender information (men are blue, women are green colored, with alpha transparency) and the archetypes (red), <>= pcplot(a3, skel2, data.col = as.numeric(skel$Gender)) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= datacol <- c(rgb(0, 205, 0, maxColorValue = 255, alpha = round(255 * 0.2)), rgb(0, 0, 255, maxColorValue = 255, alpha=round(255 * 0.2))) @ <>= par(mar = c(5, 0.4, 0, 0.4) + 0.1, ps = 9) pcplot(a3, skel2, las = 2, data.col = datacol[skel$Gender]) @ <>= png(filename = "body-pcplot-gender.png", units = "px", width = 590, height = 430, pointsize = 12) par(mar = c(5.5, 0.4, 0, 0.4) + 0.1) pcplot(a3, skel2, las = 2, data.col = datacol[skel$Gender]) graphics.off() cat("\\includegraphics{body-pcplot-gender.png}\n") @ \caption{Parallel coordinates with colors related to the gender and the three archetypes.} \label{fig:body-a3} \end{figure} Archetype 2 reflects the primary difference between men and women in body structure -- the comparatively wider hip and pelvis of women. A verification of this interpretation can be done by looking at the coefficients $\alpha$ and see how much each archetype contributes to the approximation of each individual observation. For three archetypes, a ternary plot is a usable graphical representation \citep[e.g., package \pkg{vcd} by][]{vcd}: <>= ternaryplot(coef(a3, 'alphas'), col = as.numeric(skel$Gender)) @ \begin{figure}[H] \centering \setkeys{Gin}{width=2in} <>= library("vcd") ternaryplot(coef(a3, 'alphas'), dimnames = 1:3, cex = 0.3, col = datacol[skel$Gender], main = NULL, labels = "none", grid = FALSE) grid.text("1", x = 3, y = 3) @ \caption{Ternary plot visualizing the $\alpha$, colored according to the gender.} \label{fig:body-alpha-ternary} \end{figure} Clearly, males cluster close to archetype $2$ and women mixes mainly the first and the third archetype. Therefore, males are primarly approximated by archetype $2$, women by linear combination of archetypes $1$ and $3$. For more than three archetypes parallel coordinates with an axis for each archetype projecting the corresponding coefficients (in range $[0,1]$) can be used to investigate the coefficients $\alpha$. Finally, the \code{skeletonplot()} visualizes the three skeleton archetypes: <>= skeletonplot(parameters(a3)) @ \begin{figure}[H] \centering \setkeys{Gin}{width=6in} <>= par(mar = c(1, 4, 0, 0) + 0.1, ps = 9) skeletonplot(parameters(a3), skel.height = 190) @ \caption{Skeleton plot visualizing the three archetypes.} \label{fig:body-a3-real} \end{figure} The left skeleton visualizes archetype $2$ with the wider hip and pelvis; the middle skeleton visualizes archetype $1$ which is ``huge'', the right skeleton visualizes archetype $3$ which is ``small'' in all measurements. Assume that we want to automatically fabricate blue worker overalls from given templates (e.g., paper patterns). Combinations of archetypes $1$ and $3$ gives overalls in different sizes which each have similar width at shoulder and waist. If we add archetype $2$, we get overalls with a wider waist (compared to the shoulder). \section[Summary and outlook]{Summary and outlook\label{outlook}} In Section~\ref{algorithm} we explained the different steps of the algorithms according to the original paper. However, for each problem a number of methods exist and the choice of the right method often depends on the concrete data set. In Section~\ref{pkg} we have already used the \code{archetypesFamily()} function to use different linear equations solver, but the function has to be extendend to allow abritary problem solving blocks for each step of the algorithm. Of primary interest is the usage of different optimization methods to solve the optimization problem. Additionally, the two fundamentals defined in Section~\ref{algorithm} can be generalized to allow for example arbitrary loss functions, arbitrary conditions on the coefficients $\alpha$ or arbitrary matrix norms. As the algorithm strategy is similar to the least squares formulation of the $k$-means algorithm, this leads to a general framework for this class of $k$-means-like algorithms \citep[$k$-means, $k$-median, Fuzzy $k$-means, etc.; see, e.g.,][]{Steinley@2006}. Altogether, the short-term goal for the package \pkg{archetypes} was the implementation of a general framework with interchangeable blocks for the concrete problem solving mechanisms of each algorithm step. Now, further work is the design of a clean archetypes family object, especially with a view to our long-term goal, the generalization towards the class of $k$-means-like algorithms. \appendix \section*{Computational details} <>= sessionInfo() @ The newest released version of \pkg{archetypes} is always available from the Comprehensive R Archive Network at \url{http://CRAN.R-Project.org/package=archetypes}. The development version is hosted on R-Forge as project \pkg{Archetypal Analysis (archetypes)} and available from \url{http://r-forge.r-project.org/projects/archetypes}. The source code is documented using the Roxygen documentation system for \proglang{R} \citep{roxygen}. An up-to-date version of this manuscript is contained in the package as a vignette. \section*{Acknowledgments} The authors thank Bernard Pailthorpe, University of Queensland, for providing his \proglang{MATLAB} code for comparison with our code. \bibliography{archetypes} \end{document}