Title: | Model Consistent Lasso Estimation Through the Bootstrap |
---|---|
Description: | Implements the bolasso algorithm for consistent variable selection and estimation accuracy. Includes support for many parallel backends via the future package. For details see: Bach (2008), 'Bolasso: model consistent Lasso estimation through the bootstrap', <doi:10.48550/arXiv.0804.1302>. |
Authors: | Daniel Molitor [aut, cre] |
Maintainer: | Daniel Molitor <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2025-01-20 05:57:13 UTC |
Source: | https://github.com/dmolitor/bolasso |
This function implements model-consistent Lasso estimation through the bootstrap. It supports parallel processing by way of the future package, allowing the user to flexibly specify many parallelization methods. This method was developed as a variable-selection algorithm, but this package also supports making ensemble predictions on new data using the bagged Lasso models.
bolasso( formula, data, n.boot = 100, progress = TRUE, implement = c("glmnet", "gamlr"), x = NULL, y = NULL, fast = FALSE, ... )
bolasso( formula, data, n.boot = 100, progress = TRUE, implement = c("glmnet", "gamlr"), x = NULL, y = NULL, fast = FALSE, ... )
formula |
An optional object of class formula (or one that can be
coerced to that class): a symbolic description of the model to be fitted.
Can be omitted when |
data |
An optional object of class data.frame that contains the
modeling variables referenced in |
n.boot |
An integer specifying the number of bootstrap replicates. |
progress |
A boolean indicating whether to display progress across bootstrap folds. |
implement |
A character; either 'glmnet' or 'gamlr', specifying which
Lasso implementation to utilize. For specific modeling details, see
|
x |
An optional predictor matrix in lieu of |
y |
An optional response vector in lieu of |
fast |
A boolean. Whether or not to fit a "fast" bootstrap procedure.
If |
... |
Additional parameters to pass to either
|
An object of class bolasso
. This object is a list of length
n.boot
of cv.glmnet
or cv.gamlr
objects.
glmnet::cv.glmnet and gamlr::cv.gamlr for full details on the
respective implementations and arguments that can be passed to ...
.
mtcars[, c(2, 10:11)] <- lapply(mtcars[, c(2, 10:11)], as.factor) idx <- sample(nrow(mtcars), 22) mtcars_train <- mtcars[idx, ] mtcars_test <- mtcars[-idx, ] ## Formula Interface # Train model set.seed(123) bolasso_form <- bolasso( form = mpg ~ ., data = mtcars_train, n.boot = 20, nfolds = 5 ) # Retrieve a tidy tibble of bootstrap coefficients for each covariate tidy(bolasso_form) # Extract selected variables selected_variables(bolasso_form, threshold = 0.9, select = "lambda.min") # Bagged ensemble prediction on test data predict(bolasso_form, new.data = mtcars_test, select = "lambda.min") ## Alternate Matrix Interface # Train model set.seed(123) bolasso_mat <- bolasso( x = model.matrix(mpg ~ . - 1, mtcars_train), y = mtcars_train[, 1], data = mtcars_train, n.boot = 20, nfolds = 5 ) # Bagged ensemble prediction on test data predict(bolasso_mat, new.data = model.matrix(mpg ~ . - 1, mtcars_test), select = "lambda.min")
mtcars[, c(2, 10:11)] <- lapply(mtcars[, c(2, 10:11)], as.factor) idx <- sample(nrow(mtcars), 22) mtcars_train <- mtcars[idx, ] mtcars_test <- mtcars[-idx, ] ## Formula Interface # Train model set.seed(123) bolasso_form <- bolasso( form = mpg ~ ., data = mtcars_train, n.boot = 20, nfolds = 5 ) # Retrieve a tidy tibble of bootstrap coefficients for each covariate tidy(bolasso_form) # Extract selected variables selected_variables(bolasso_form, threshold = 0.9, select = "lambda.min") # Bagged ensemble prediction on test data predict(bolasso_form, new.data = mtcars_test, select = "lambda.min") ## Alternate Matrix Interface # Train model set.seed(123) bolasso_mat <- bolasso( x = model.matrix(mpg ~ . - 1, mtcars_train), y = mtcars_train[, 1], data = mtcars_train, n.boot = 20, nfolds = 5 ) # Bagged ensemble prediction on test data predict(bolasso_mat, new.data = model.matrix(mpg ~ . - 1, mtcars_test), select = "lambda.min")
bolasso
object.The method plots coefficient distributions for the selected covariates
in the bolasso
model. If there are more than 30 selected covariates,
this will plot the 30 selected covariates with the largest
absolute mean coefficient. The user can also plot coefficient distributions
for a specified subset of selected covariates.
plot_selected_variables( x, covariates = NULL, threshold = 0.95, method = c("vip", "qnt"), ... )
plot_selected_variables( x, covariates = NULL, threshold = 0.95, method = c("vip", "qnt"), ... )
x |
An object of class bolasso or |
covariates |
A subset of the selected covariates to plot. This should be a
vector of covariate names either as strings or bare. E.g.
|
threshold |
A numeric between 0 and 1, specifying the variable selection threshold to use. |
method |
The variable selection method to use. The two valid options
are |
... |
Additional arguments to pass to |
Plot the results of the selection_thresholds function.
plot_selection_thresholds(object = NULL, data = NULL, ...)
plot_selection_thresholds(object = NULL, data = NULL, ...)
object |
An object of class bolasso or |
data |
A dataframe containing the selection thresholds. E.g.
obtained via |
... |
Additional arguments to pass directly to selection_thresholds. |
A ggplot
object
bolasso
objectThe method plots coefficient distributions for the covariates included
in the bolasso
model. If there are more than 30 covariates included in
the full model, this will plot the 30 covariates with the largest
absolute mean coefficient. The user can also plot coefficient distributions
for a specified subset of covariates.
## S3 method for class 'bolasso' plot(x, covariates = NULL, ...)
## S3 method for class 'bolasso' plot(x, covariates = NULL, ...)
x |
An object of class bolasso or |
covariates |
A subset of the covariates to plot. This should be a
vector of covariate names either as strings or bare. E.g.
|
... |
Additional arguments to pass directly to |
Identifies covariates that are selected by the Bolasso algorithm at the user-defined threshold. There are two variable selection criterion to choose between; Variable Inclusion Probability ("vip") introduced in the original Bolasso paper (Bach, 2008) and further developed by Bunea et al. (2011), and the Quantile ("qnt") approach proposed by Abram et al. (2016). The desired threshold value is 1 - alpha, where alpha is some (typically small) significance level.
selected_variables( object, threshold = 0.95, method = c("vip", "qnt"), var_names_only = FALSE, ... )
selected_variables( object, threshold = 0.95, method = c("vip", "qnt"), var_names_only = FALSE, ... )
object |
An object of class bolasso. |
threshold |
A numeric between 0 and 1, specifying the variable selection threshold to use. |
method |
The variable selection method to use. The two valid options
are |
var_names_only |
A boolean value. When |
... |
Additional arguments to pass to |
This function returns either a tibble::tibble of selected covariates and their corresponding coefficients across all bootstrap replicates, or a vector of selected covariate names.
A tibble with each selected variable and its respective coefficient for each bootstrap replicate OR a vector of the names of all selected variables.
glmnet::coef.glmnet()
and gamlr:::coef.gamlr
for details
on additional arguments to pass to ...
.
There are two methods of variable selection for covariates. The first is the Variable Inclusion Probability (VIP) introduced by Bach (2008) and generalized by Bunea et al (2011). The second is the Quantile confidence interval (QNT) proposed by Abram et al (2016). For a given level of significance alpha, each method selects covariates for the given threshold = 1 - alpha. The higher the threshold (lower alpha), the more stringent the variable selection criterion.
selection_thresholds(object, grid = seq(0, 1, by = 0.01), ...)
selection_thresholds(object, grid = seq(0, 1, by = 0.01), ...)
object |
An object of class bolasso or |
grid |
A vector of numbers between 0 and 1 (inclusive) specifying
the grid of threshold values to calculate variable inclusion criterion
at. Defaults to |
... |
Additional parameters to pass to |
This function returns a tibble that, for each covariate, returns the largest threshold (equivalently smallest alpha) at which it would be selected for both the VIP and the QNT methods. Consequently the number of rows in the returned tibble is 2*p where p is the number of covariates included in the model.
A tibble with dimension (2*p)x5 where p is the number of covariates.
Tidy a bolasso object
## S3 method for class 'bolasso' tidy(x, select = c("lambda.min", "lambda.1se", "min", "1se"), ...)
## S3 method for class 'bolasso' tidy(x, select = c("lambda.min", "lambda.1se", "min", "1se"), ...)
x |
A |
select |
One of "min", "1se", "lambda.min", "lambda.1se". Both "min" and "lambda.min" are equivalent and are the lambda value that minimizes cv MSE. Similarly "1se" and "lambda.1se" are equivalent and refer to the lambda that achieves the most regularization and is within 1se of the minimal cv MSE. |
... |
Additional arguments to pass directly to |
A tidy tibble::tibble()
summarizing bootstrap-level
coefficients for each covariate.
Predict whether customers will make a specific transaction based on a rich set of user features.
transactions
transactions
Dataframe with columns
An integer indicating whether a customer engaged in a transaction.
200 numeric features of various customer characteristics.