--- title: "Introduction to glmnetUtils" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo=FALSE} library(glmnetUtils) library(MASS) ``` The [glmnetUtils package](https://github.com/hongooi73/glmnetUtils) provides a collection of tools to streamline the process of fitting elastic net models with [glmnet](https://cran.r-project.org/package=glmnet). I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also features a function `cva.glmnet` to do crossvalidation for both $\alpha$ and $\lambda$, as well as some utility functions. ## The formula interface The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here are some simple examples for different types of data: ```{r} # least squares regression (mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars)) # multinomial logistic regression with specified elastic net alpha parameter (irisMod <- glmnet(Species ~ ., data=iris, family="multinomial", alpha=0.5)) # Poisson regression with an offset (InsMod <- glmnet(Claims ~ District + Group + Age, data=MASS::Insurance, family="poisson", offset=log(Holders))) ``` Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. A simple `print` method is also provided, to show the main model details at a glance. I'll describe shortly what the "sparse model matrix" and "use model.frame" options do. Predicting from a model works as you'd expect: just pass a data frame containing the new observations to the `predict` method. You can also specify any arguments that `predict.glmnet` accepts. ```{r, eval=FALSE} # least squares regression: get predictions for lambda=1 predict(mtcarsMod, newdata=mtcars, s=1) # multinomial logistic regression: get predicted class predict(irisMod, newdata=iris, type="class") # Poisson regression: need to specify offset predict(InsMod, newdata=MASS::Insurance, offset=log(Holders)) ``` If you want, you can still use the original model matrix-plus-response syntax: ```{r} mtcarsX <- as.matrix(mtcars[c("cyl", "disp", "hp")]) mtcarsY <- mtcars$mpg mtcarsMod2 <- glmnet(mtcarsX, mtcarsY) summary(as.numeric(predict(mtcarsMod, mtcars) - predict(mtcarsMod2, mtcarsX))) ``` This shows that the resulting models are identical, in terms of the predictions they make and the regularisation parameters used. ## Generating the model matrix There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising `model.frame` and `model.matrix`; and the second is to build the matrix one variable at a time. ### Using `model.frame` This is the simpler option, and the one that is most compatible with other R modelling functions. The `model.frame` function takes a formula and data frame and returns a _model frame_: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like `exp(x)` or `I(x^2)` on the RHS, `model.frame` will evaluate these expressions and include them in the output. The major disadvantage of using `model.frame` is that it generates a `terms` object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a $p \times p$ square matrix where $p$ is the number of main effects in the model. For wide datasets with $p > 10000$, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can be very slow. Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an $N$-level factor into an indicator matrix with $N-1$ columns, with one column being dropped. This is necessary for unregularised models as fit with `lm` and `glm`, since the full set of $N$ columns is linearly dependent. With the usual [treatment contrasts](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/contrasts.html), the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline. This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level. ### Manually building the model matrix To deal with the problems above, glmnetUtils by default will avoid using `model.frame`, instead building up the model matrix term-by-term. This avoids the memory cost of creating a `terms` object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for _all_ levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero _is_ meaningful (usually). This works in an additive fashion, ie the formula `~ a + b:c + d*e` is treated as consisting of three terms, `a`, `b:c` and `d*e` each of which is processed independently of the others. A dot in the formula includes all main effect terms, ie `~ . + a:b + f(x)` expands to `~ a + b + x + a:b + f(x)` (assuming a, b and x are the only columns in the data). Note that a formula like `~ (a + b) + (c + d)` will be treated as two terms, `a + b` and `c + d`. ### Speed comparisons To examine the speed impact of using `model.frame`, let's do some simple comparisons of run times. We'll generate sample datasets with 100, 1,000 and 10,000 predictors, and then run `glmnet` with both options for generating the model matrix. ```{r, eval=FALSE} # generate sample (uncorrelated) data of a given size makeSampleData <- function(N, P) { X <- matrix(rnorm(N*P), nrow=N) data.frame(y=rnorm(N), X) } # test for three sizes: 100/1000/10000 predictors df1 <- makeSampleData(N=1000, P=100) df2 <- makeSampleData(N=1000, P=1000) df3 <- makeSampleData(N=1000, P=10000) library(microbenchmark) res <- microbenchmark( glmnet(y ~ ., df1, use.model.frame=TRUE), glmnet(y ~ ., df1, use.model.frame=FALSE), glmnet(y ~ ., df2, use.model.frame=TRUE), glmnet(y ~ ., df2, use.model.frame=FALSE), glmnet(y ~ ., df3, use.model.frame=TRUE), glmnet(y ~ ., df3, use.model.frame=FALSE), times=10 ) print(res, unit="s", digits=2) ``` ``` ## Unit: seconds ## expr min lq mean median uq max neval ## glmnet(y ~ ., df1, use.model.frame = TRUE) 0.024 0.024 0.027 0.025 0.029 0.032 10 ## glmnet(y ~ ., df1, use.model.frame = FALSE) 0.023 0.026 0.029 0.026 0.028 0.051 10 ## glmnet(y ~ ., df2, use.model.frame = TRUE) 3.703 3.916 4.258 4.272 4.428 5.153 10 ## glmnet(y ~ ., df2, use.model.frame = FALSE) 3.756 3.874 4.291 4.352 4.561 5.073 10 ## glmnet(y ~ ., df3, use.model.frame = TRUE) 11.973 12.353 13.262 13.350 13.864 14.622 10 ## glmnet(y ~ ., df3, use.model.frame = FALSE) 4.295 4.639 4.992 4.822 5.060 6.111 10 ``` From this, we can see that for datasets with up to 1,000 predictors, both methods are about as fast as each other. However, for 10,000 predictors (not uncommon these days), the `model.frame` method takes three times as long as building the model matrix term by term. What happens if we take it up to 100,000 predictors? As seen below, the standard approach of using `model.frame` fails when R runs out of memory: the data frame itself is about 800MB in size, but trying to allocate the `terms` object requires more then 67GB. However, building the model matrix by term still works, and (on this machine) finishes in about two minutes. ```{r, eval=FALSE} df4 <- makeSampleData(N=1000, P=100000) glmnet(y ~ ., df4, use.model.frame=TRUE) ``` ``` ## Warning in terms.formula(formula, data = data): Reached total allocation of ## 32666Mb: see help(memory.size) ## Error: cannot allocate vector of size 37.3 Gb ``` ```{r, eval=FALSE} glmnet(y ~ ., df4, use.model.frame=FALSE) ``` ``` ## Call: ## glmnet.formula(formula = y ~ ., data = df4, use.model.frame = FALSE) ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha: 1 ## Lambda summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.002393 0.006506 0.017680 0.032470 0.048080 0.130700 ``` ### Sparse model matrix As an option, glmnetUtils can also generate a _sparse_ model matrix, using the `sparse.model.matrix` function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels. This can be combined with both of the previously mentioned options for generating model matrices. ## Crossvalidation for $\alpha$ One piece missing from the standard glmnet package is a way of choosing $\alpha$, the elastic net mixing parameter, similar to how `cv.glmnet` chooses $\lambda$, the shrinkage parameter. To fix this, glmnetUtils provides the `cva.glmnet` function, which uses crossvalidation to examine the impact on the model of changing $\alpha$ and $\lambda$. The interface is the same as for the other functions: ```r # Leukemia dataset from Trevor Hastie's website: # https://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData leuk <- do.call(data.frame, Leukemia) leukMod <- cva.glmnet(y ~ ., data=leuk, family="binomial") leukMod ``` ``` ## Call: ## cva.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial") ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1 ## Number of crossvalidation folds for lambda: 10 ``` `cva.glmnet` uses the algorithm described in the help for `cv.glmnet`, which is to fix the distribution of observations across folds and then call `cv.glmnet` in a loop with different values of $\alpha$. Optionally, you can parallelise this outer loop, by setting the `outerParallel` argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation: - Via `parLapply` in the parallel package. To use this, set `outerParallel` to a valid cluster object created by `makeCluster`. - Via `rxExec` as supplied by the (now retired) RevoScaleR package from Microsoft R Server. To use this, set `outerParallel` to a valid compute context created by `RxComputeContext`, or a character string specifying such a context. If the outer loop is run in parallel, `cva.glmnet` can check if the inner loop (over $\lambda$) is also set to run in parallel, and disable this if it would lead to contention for cores. Because crossvalidation is often a statistically noisy procedure, it doesn't try to automatically choose $\alpha$ and $\lambda$ for you. Instead you can plot the output, to see how the results depend on the values of these parameters. Using this information, you can choose appropriate values for your data. ```r plot(leukMod) ``` ![](figures/leukModCVA.png) In this case, we see that values of $\alpha$ close to $1$ tend to lead to better accuracy. The curves don't have a well-defined minimum, but they do flatten out for lower values of $\lambda$. As the `cv.glmnet` documentation recommends though, it's a good idea to run `cva.glmnet` multiple times to reduce the impact of noise. A `cva.glmnet` object contains a list of individual `cv.glmnet` objects, corresponding to the different $\alpha$ values tried. This lets you plot the crossvalidation results easily for a given $\alpha$: ```r plot(leukMod$modlist[[10]]) # alpha = 0.729 ``` ![](figures/leukModList.png) ## Conclusion The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it's always under development; you can get the latest version from my [GitHub repo](https://github.com/hongooi73/glmnetUtils). If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at [hongooi73@gmail.com](mailto:hongooi73@gmail.com).