Model descriptions

This package implements a number of models.

Ridge

Let \(X\in \mathbb{R}^{n\times p}\) be a feature matrix with \(n\) samples and \(p\) features, \(y\in \mathbb{R}^n\) a target vector, and \(\alpha > 0\) a fixed regularization hyperparameter. Ridge regression 1 defines the weight vector \(b^*\in \mathbb{R}^p\) as:

\[b^* = \arg\min_b \|Xb - y\|_2^2 + \alpha \|b\|_2^2.\]

The equation has a closed-form solution \(b^* = M y\), where \(M = (X^\top X + \alpha I_p)^{-1}X^\top \in \mathbb{R}^{p \times n}\).

This model is implemented in

KernelRidge

By the Woodbury matrix identity, \(b^*\) can be written as \(b^* = X^\top(XX^\top + \alpha I_n)^{-1}y\), or \(b^* = X^\top w^*\) for some \(w^*\in \mathbb{R}^n\). Noting the linear kernel \(K = X X^\top \in \mathbb{R}^{n\times n}\), this leads to the equivalent formulation:

\[w^* = \arg\min_w \|Kw - y\|_2^2 + \alpha w^\top Kw.\]

This model can be extended to arbitrary positive semidefinite kernels \(K\), leading to the more general kernel ridge regression 2.

This model is implemented in

RidgeCV and KernelRidgeCV

In practice, because the ridge regression and kernel ridge regression hyperparameter \(\alpha\) is unknown, it is typically selected through a grid-search with cross-validation. In cross-validation, we split the data set into a training set \((X_{train}, y_{train})\) and a validation set \((X_{val}, y_{val})\). Then, we train the model on the training set, and evaluate the generalization performance on the validation set. We perform this process for multiple hyperparameter candidates \(\alpha\), typically defined over a grid of log-spaced values. Finally, we keep the candidate leading to the best generalization performance, as measured by the validation loss, averaged over all cross-validation splits.

These models are implemented in

GroupRidgeCV / BandedRidgeCV

In some applications, features are naturally grouped into groups (or feature spaces). To adapt the regularization level to each feature space, ridge regression can be extended to group-regularized ridge regression (also known as banded ridge regression 3). In this model, a separate hyperparameter is optimized for each feature space:

\[b^* = \arg\min_b \|\sum_{i=1}^m X_i b_i - y\|_2^2 + \sum_{i=1}^m \alpha_i \|b_i\|_2^2.\]

This is equivalent to solving a ridge regression:

\[b^* = \arg\min_b \|Z b - Y\|_2^2 + \|b\|_2^2\]

where the feature space \(X_i\) is scaled by a group scaling \(Z_i = e^{\delta_i} X_i\). The hyperparameters \(\delta_i = - \log(\alpha_i)\) are then learned over cross-validation.

This model is implemented in

See also multiple-kernel ridge regression, which is equivalent to group-regularization ridge regression when using one linear kernel per group of features:

Note

“Group ridge regression” is also sometimes called “Banded ridge regression”.

WeightedKernelRidge

To extend kernel ridge to group-regularization, we can compute the kernel as a weighted sum of multiple kernels, \(K = \sum_{i=1}^m e^{\delta_i} K_i\). Then, we can use \(K_i = X_i X_i^\top\) for different groups of features \(X_i\). The model becomes:

\[w^* = \arg\min_w \left\|\sum_{i=1}^m e^{\delta_i} K_{i} w - y\right\|_2^2 + \alpha \sum_{i=1}^m e^{\delta_i} w^\top K_{i} w.\]

This model is called weighted kernel ridge regresion. The log-kernel-weights \(\delta_i\) are here fixed. When all the targets use the same log-kernel-weights, a single weighted kernel can be precomputed and used in a kernel ridge regression. However, when the log-kernel-weights are different for each target, the kernel sum cannot be precomputed, and the model requires some specific algorithms to be fit.

MultipleKernelRidgeCV

In weighted kernel ridge regression, when the log-kernel-weights \(\delta_i\) are unknown, we can learn them over cross-validation. This model is called multiple-kernel ridge regression. When the kernels are defined by \(K_i = X_i X_i^\top\) for different groups of features \(X_i\), multiple-kernel ridge regression is equivalent with group-ridge regression (aka banded ridge regression).

This model is implemented in

Model flowchart

The following flowchart can be used as a guide to select the right estimator.

graph TD; A(How many feature space ?) O(Data size ?) M(Data size ?) OR(Hyperparameters ?) OK(Hyperparameters ?) MR(Hyperparameters ?) MK(Hyperparameters ?) A-- one-->O; A--multiple-->M; O--more samples-->OR; O--more features-->OK; M--more samples-->MR; M--more features-->MK; OK--known-->OKH[KernelRidge]; OK--unknown-->OKCV[KernelRidgeCV]; OR--known-->ORH[Ridge]; OR--unknown-->ORCV[RidgeCV]; MK--known-->MKH[WeightedKernelRidge]; MK--unknown-->MKCV[MultipleKernelRidgeCV]; MR--unknown-->MRCV[BandedRidgeCV]; MR--known-->MKH; classDef fork fill:#FFDC97 class A,O,M,OR,OK,MR,MK fork; classDef leaf fill:#ABBBE1 class ORH,OKH,MRH,MKH leaf; class ORCV,OKCV,MRCV,MKCV leaf; click ORH "https://gallantlab.github.io/himalaya/_generated/himalaya.ridge.Ridge.html" click ORCV "https://gallantlab.github.io/himalaya/_generated/himalaya.ridge.RidgeCV.html" click MRCV "https://gallantlab.github.io/himalaya/_generated/himalaya.ridge.BandedRidgeCV.html" click OKH "https://gallantlab.github.io/himalaya/_generated/himalaya.kernel_ridge.KernelRidge.html" click OKCV "https://gallantlab.github.io/himalaya/_generated/himalaya.kernel_ridge.KernelRidgeCV.html" click MKH "https://gallantlab.github.io/himalaya/_generated/himalaya.kernel_ridge.WeightedKernelRidge.html" click MKCV "https://gallantlab.github.io/himalaya/_generated/himalaya.kernel_ridge.MultipleKernelRidgeCV.html"

References

1

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

2

Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables.

3

Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). Voxelwise encoding models with non-spherical multivariate normal priors. Neuroimage, 197, 482-492.