A modern implementation of the Super Learner algorithm for ensemble learning and model stacking

Authors: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Oleg Sofrygin


What’s sl3?

sl3 is a modern implementation of the Super Learner algorithm of @vdl2007super. The Super Learner algorithm performs ensemble learning in one of two fashions:

  1. The “discrete” Super Learner can be used to select the best prediction algorithm among a supplied library of learning algorithms (“learners” in the sl3 nomenclature) – that is, that algorithm which minimizes the cross-validated risk with respect to some appropriate loss function.
  2. The “ensemble” Super Learner can be used to assign weights to specified learning algorithms (in a user-supplied library) in order to create a combination of these learners that minimizes the cross-validated risk with respect to an appropriate loss function. This notion of weighted combinations has also been called stacked regression [@breiman1996stacked].

Installation

Install the most recent stable release from GitHub via devtools:

devtools::install_github("jeremyrcoyle/sl3")

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.


Documentation

The best places to start are the vignettes:


Examples

sl3 makes the process of applying screening algorithms, learning algorithms, combining both types of algorithms into a stacked regression model, and cross-validating this whole process essentially trivial. The best way to understand this is to see the sl3 package in action:

set.seed(49753)

# packages we'll be using
library(data.table)
library(SuperLearner)
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
library(origami)
library(sl3)

# load example data set
data(cpp_imputed)

# here are the covariates we are interested in and, of course, the outcome
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
outcome <- "haz"

task <- make_sl3_Task(data = cpp_imputed, covariates = covars,
                      outcome = outcome, outcome_type="continuous")

# set up screeners and learners via built-in functions and pipelines
slscreener <- make_learner(Lrnr_pkg_SuperLearner_screener, "screen.glmnet")
glm_learner <- make_learner(Lrnr_glm)
screen_and_glm <- make_learner(Pipeline, slscreener, glm_learner)
lrnr_glmnet <- make_learner(Lrnr_glmnet)

# stack learners into a model (including screeners and pipelines)
learner_stack <- make_learner(Stack, lrnr_glmnet, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-13
preds <- stack_fit$predict()
head(preds)
#>    Lrnr_glmnet_NULL_deviance_10_1_100   Lrnr_glm
#> 1:                         0.35345519 0.36298498
#> 2:                         0.35345519 0.36298498
#> 3:                         0.24554305 0.25993072
#> 4:                         0.24554305 0.25993072
#> 5:                         0.24554305 0.25993072
#> 6:                         0.02953193 0.05680264
#>    Lrnr_pkg_SuperLearner_screener_screen.glmnet___Lrnr_glm
#> 1:                                              0.36228209
#> 2:                                              0.36228209
#> 3:                                              0.25870995
#> 4:                                              0.25870995
#> 5:                                              0.25870995
#> 6:                                              0.05600958

Contributions

It is our hope that sl3 will grow to be widely used for creating stacked regression models and the cross-validation of pipelines that make up such models, as well as the variety of other applications in which the Super Learner algorithm plays a role. To that end, contributions are very welcome, though we ask that interested contributors consult our contribution guidelines prior to submitting a pull request.


After using the sl3 R package, please cite the following:

    @misc{coyle2017sl3,
      author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
        Sofrygin, Oleg},
      title = {{sl3}: Modern Pipelines for Machine Learning and {Super
        Learning}},
      year  = {2017},
      howpublished = {\url{https://github.com/jeremyrcoyle/sl3}},
      url = {http://dx.doi.org/DOI_HERE},
      doi = {DOI_HERE}
    }

License

© 2017 Jeremy R. Coyle, Nima S. Hejazi, Ivana Malenica, Oleg Sofrygin

The contents of this repository are distributed under the GPL-3 license. See file LICENSE for details.


References