Model descriptions

This package implements a number of models.

Ridge

Let XRn×p be a feature matrix with n samples and p features, yRn a target vector, and α>0 a fixed regularization hyperparameter. Ridge regression 1 defines the weight vector bRp as:

b=argminbXby22+αb22.

The equation has a closed-form solution b=My, where M=(XX+αIp)1XRp×n.

This model is implemented in

KernelRidge

By the Woodbury matrix identity, b can be written as b=X(XX+αIn)1y, or b=Xw for some wRn. Noting the linear kernel K=XXRn×n, this leads to the equivalent formulation:

w=argminwKwy22+αwKw.

This model can be extended to arbitrary positive semidefinite kernels K, leading to the more general kernel ridge regression 2.

This model is implemented in

RidgeCV and KernelRidgeCV

In practice, because the ridge regression and kernel ridge regression hyperparameter α is unknown, it is typically selected through a grid-search with cross-validation. In cross-validation, we split the data set into a training set (Xtrain,ytrain) and a validation set (Xval,yval). Then, we train the model on the training set, and evaluate the generalization performance on the validation set. We perform this process for multiple hyperparameter candidates α, typically defined over a grid of log-spaced values. Finally, we keep the candidate leading to the best generalization performance, as measured by the validation loss, averaged over all cross-validation splits.

These models are implemented in

GroupRidgeCV / BandedRidgeCV

In some applications, features are naturally grouped into groups (or feature spaces). To adapt the regularization level to each feature space, ridge regression can be extended to group-regularized ridge regression (also known as banded ridge regression 3). In this model, a separate hyperparameter is optimized for each feature space:

b=argminbi=1mXibiy22+i=1mαibi22.

This is equivalent to solving a ridge regression:

b=argminbZbY22+b22

where the feature space Xi is scaled by a group scaling Zi=eδiXi. The hyperparameters δi=log(αi) are then learned over cross-validation.

This model is implemented in

See also multiple-kernel ridge regression, which is equivalent to group-regularization ridge regression when using one linear kernel per group of features:

Note

“Group ridge regression” is also sometimes called “Banded ridge regression”.

WeightedKernelRidge

To extend kernel ridge to group-regularization, we can compute the kernel as a weighted sum of multiple kernels, K=i=1meδiKi. Then, we can use Ki=XiXi for different groups of features Xi. The model becomes:

w=argminwi=1meδiKiwy22+αi=1meδiwKiw.

This model is called weighted kernel ridge regresion. The log-kernel-weights δi are here fixed. When all the targets use the same log-kernel-weights, a single weighted kernel can be precomputed and used in a kernel ridge regression. However, when the log-kernel-weights are different for each target, the kernel sum cannot be precomputed, and the model requires some specific algorithms to be fit.

MultipleKernelRidgeCV

In weighted kernel ridge regression, when the log-kernel-weights δi are unknown, we can learn them over cross-validation. This model is called multiple-kernel ridge regression. When the kernels are defined by Ki=XiXi for different groups of features Xi, multiple-kernel ridge regression is equivalent with group-ridge regression (aka banded ridge regression).

This model is implemented in

Model flowchart

The following flowchart can be used as a guide to select the right estimator.

one

multiple

more samples

more features

more samples

more features

known

unknown

known

unknown

known

unknown

unknown

known

How many feature space ?

Data size ?

Data size ?

Hyperparameters ?

Hyperparameters ?

Hyperparameters ?

Hyperparameters ?

KernelRidge

KernelRidgeCV

Ridge

RidgeCV

WeightedKernelRidge

MultipleKernelRidgeCV

BandedRidgeCV

References

1

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

2

Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables.

3

Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). Voxelwise encoding models with non-spherical multivariate normal priors. Neuroimage, 197, 482-492.