[Rd] informal conventions/checklist for new predictive modeling packages

Wed Jan 4 15:19:11 CET 2012

Working on the caret package has exposed me to the wide variety of
approaches that different authors have taken to creating predictive
modeling functions (aka machine learning)(aka pattern recognition).

I suspect that many package authors are neophyte R users and are
stumbling through the process of writing their first R package (or R
code). As such, they may not have been exposed to some of the informal
conventions that have evolved over time. Also, their package may be
intended to demonstrate their research and not for "production"
modeling. In any case, it might be a good idea to print up a few
points for consideration when creating a predictive modeling package.
I don't propose changes to existing code.

Some of this is obvious and not limited to this class of modeling
packages. Many of these points are arguable, so please do so.

If this seems useful, perhaps we could repost the final list to R-Help
to use as a checklist.

Those of you who have used my code will probably realize that I am not
a grand architect of R packages =] I'd love to get feedback from those
of you with a broader perspective and better software engineering
skills than I (a low bar to step over).

I have marked a few of these items with an OCD tag since I might be
taking it a bit too far.

The list:

(1) Extend the work of others. There is an amazing amount of unneeded
redundancy. There are plenty of times that users implement their own
version of a function because there is an missing feature, but a lot
of time is spent re-creating duplicate functions. For example, kernlab
has an excellent set of kernel functions that are really efficient and
have useful ancillary functions. People may not new aware of these
functions, but they are one RSiteSearch away. (Perhaps we could
nominate a few packages like kernlab that implement a specific tool
well)

(2) When modeling a categorical outcome, use a factor as input (as
opposed to 0/1 indicators or integers). Factors are exactly the kind
of feature that separates R from other languages (I'm looking at you
SAS) and is a natural structure for this data type.

corollary (2a): save the factor levels in the model object somewhere

corollary (2b): return predicted classes as factors with the same
levels (and ordering of levels).

(3) Implement a separate prediction function. Some packages only make
predictions when the model is built, so effectively the model cannot
be used at any point in the future.

corollary (3a): use object-orientation (eg. predict.{class}) and not
some made-up function name "modelPredict()" for predicting new
samples.

(4) If the method only accepts a specific type of input (eg. matrix or
data frame), please do the conversion whenever appropriate.

(5) Provide a formula interface (eg. foo(y~x, data = dat)) and
non-formula interface (foo(x, y) to the function. Formula methods are
really inefficient at this time for large dimensional data but are
fantastically convenient. There are some good reasons to not use
formulas, such as functions that do not use a design matrix (eg.
cforest()) or need factors to be handled in a non-standard way (eg.
cubist()).

(6) Don't require a test set when model building.

(7) Control all written output during model-building time with a
verbose option. Resampling can make a mess out of things if
output/logging is always exposed.

(8) Please use RSiteSearch to avoid name collisions between packages
(eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.

(9) Allow the predict function to generate results from many different
sub-models simultaneously. For example, pls() can return predictions
across many values of ncomp. enet(), cubist(), blackboost() are other
examples.

corollary (9a): [OCD] ensure the same object type for predictions.
There are occasions where predict() will return a vector or a matrix
depending on the context. I would argue that this is not optimal.

(10) Use a limited vocabulary for options. For example, some predict()
functions have a "type" options to switch between predicted classes
and class probabilities. Values of "type" pertaining to class
probabilities range from "prob", "probability", "posterior", "raw",
"response", etc. I'll make a suggestion of "prob" as a possible
standard for this situation.

(11) Make sure that class probabilities sum to one. Seriously.

(12) If the model implicitly conducts feature selection, do not
require un-used predictors to be present in future data sets for
prediction. This may be a problem when the formula interface to models
is used, but it looks like many functions reference columns by
position and not name.

(13) Packages that have their own cross-validation functions should
allow the users to pass in the specific folds/resamping indicators to
maintain consistency across similar functions in other packages.

(14) [OCD] For binary classification models, model the probability of
the first level of a factor as the event of interest (again, for
consistency) Note that glm() does not do this but most others use the
first level.

Thanks,

Max