.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "_auto_examples/shortclips/02_plot_ridge_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download__auto_examples_shortclips_02_plot_ridge_regression.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr__auto_examples_shortclips_02_plot_ridge_regression.py:


================================================
Understand ridge regression and cross-validation
================================================

In future examples, we will model the fMRI responses using a regularized linear
regression known as *ridge regression*. This example explains why we use ridge
regression, and how to use cross-validation to select the appropriate
regularization hyper-parameter.

Linear regression is a method to model the relation between some input
variables :math:`X \in \mathbb{R}^{(n \times p)}` (the features) and an output
variable :math:`y \in \mathbb{R}^{n}` (the target). Specifically, linear
regression uses a vector of coefficient :math:`w \in \mathbb{R}^{p}`` to
predict the output

.. math::
    \hat{y} = Xw

The model is considered accurate if the predictions :math:`\hat{y}` are close
to the true output values :math:`y`. Therefore,  a good linear regression model
is given by the vector :math:`w` that minimizes the sum of squared errors:

.. math::
    w = \arg\min_w ||Xw - y||^2

This is the simplest model for linear regression, and it is known as *ordinary
least squares* (OLS).

.. GENERATED FROM PYTHON SOURCE LINES 30-30

.. code-block:: Python
   :dedent: 1


.. GENERATED FROM PYTHON SOURCE LINES 32-42

Ordinary least squares (OLS)
----------------------------

To illustrate OLS, let's use a toy dataset with a single features ``X[:,0]``.
On the plot below (left panel), each dot is a sample ``(X[i,0], y[i])``, and
the linear regression model is the line ``y = X[:,0] * w[0]``. On each
sample, the error between the prediction and the true value is shown by a
gray line. By summing the squared errors over all samples, we get the squared
loss. Plotting the squared loss for every value of ``w`` leads to a parabola
(right panel).

.. GENERATED FROM PYTHON SOURCE LINES 42-50

.. code-block:: Python

    import numpy as np
    from voxelwise_tutorials.regression_toy import create_regression_toy
    from voxelwise_tutorials.regression_toy import plot_1d

    X, y = create_regression_toy(n_features=1)

    plot_1d(X, y, w=[0])


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_001.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 51-53

By varying the linear coefficient ``w``, we can change the prediction
accuracy of the model, and thus the squared loss.

.. GENERATED FROM PYTHON SOURCE LINES 53-54

.. code-block:: Python

    plot_1d(X, y, w=[0.2])


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_002.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 55-57

.. code-block:: Python

    plot_1d(X, y, w=[0.7])


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_003.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 58-65

The linear coefficient leading to the minimum squared loss can be found
analytically with the formula:

.. math::
    w = (X^\top X)^{-1}  X^\top y

This is the OLS solution.

.. GENERATED FROM PYTHON SOURCE LINES 65-70

.. code-block:: Python


    w_ols = np.linalg.solve(X.T @ X, X.T @ y)

    plot_1d(X, y, w=w_ols)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_004.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 71-76

Linear regression can also be used on more than one feature. On the next toy
dataset, we will use two features ``X[:,0]`` and ``X[:,1]``. The linear
regression model is a now plane. Here again, summing the squared errors over
all samples gives the squared loss.Plotting the squared loss for every value
of ``w[0]`` and ``w[1]`` leads to a 2D parabola (right panel).

.. GENERATED FROM PYTHON SOURCE LINES 76-82

.. code-block:: Python

    from voxelwise_tutorials.regression_toy import create_regression_toy
    from voxelwise_tutorials.regression_toy import plot_2d

    X, y = create_regression_toy(n_features=2)

    plot_2d(X, y, w=[0, 0], show_noiseless=False)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_005.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 83-84

.. code-block:: Python

    plot_2d(X, y, w=[0.4, 0], show_noiseless=False)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_006.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 85-87

.. code-block:: Python

    plot_2d(X, y, w=[0, 0.3], show_noiseless=False)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_007.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_007.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 88-94

Here again, the OLS solution can be found analytically with the same formula.
Note that the OLS solution is not equal to the ground-truth coefficients used
to generate the toy dataset (black cross), because we added some noise to the
target values ``y``. We want the solution we find to be as close as possible
to the ground-truth coefficients, because it will allow the regression to
generalize correctly to new data.

.. GENERATED FROM PYTHON SOURCE LINES 94-97

.. code-block:: Python

    w_ols = np.linalg.solve(X.T @ X, X.T @ y)
    plot_2d(X, y, w=w_ols)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_008.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_008.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 98-107

The situation becomes more interesting when the features in ``X`` are
correlated. Here, we add a correlation between the first feature ``X[:, 0]``
and the second feature ``X[:, 1]``. With this correlation, the squared loss
function is no more isotropic, so the lines of equal loss are now ellipses
instead of circles. Thus, when starting from the OLS solution, moving ``w``
toward the top left leads to a small change in the loss, whereas moving it
toward the top right leads to a large change in the loss. This anisotropy
makes the OLS solution less robust to noise in some particular directions
(deviating more from the ground-truth coefficients).

.. GENERATED FROM PYTHON SOURCE LINES 107-113

.. code-block:: Python


    X, y = create_regression_toy(n_features=2, correlation=0.9)

    w_ols = np.linalg.solve(X.T @ X, X.T @ y)
    plot_2d(X, y, w=w_ols)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_009.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_009.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 114-123

The different robustness to noise can be understood mathematically by the
fact that the OLS solution requires inverting the matrix :math:`(X^T X)`. The
matrix inversion amounts to inverting the eigenvalues :math:`\lambda_k` of
the matrix. When the features are highly correlated, some eigenvalues
:math:`\lambda_k` are close to zero, and a small change in the features can
have a large effect on the inverse. Thus, having small eigenvalues reduces
the stability of the inversion. If the correlation is even higher, the
smallest eigenvalues get closer to zero, and the OLS solution becomes even
less stable.

.. GENERATED FROM PYTHON SOURCE LINES 123-129

.. code-block:: Python


    X, y = create_regression_toy(n_features=2, correlation=0.999)

    w_ols = np.linalg.solve(X.T @ X, X.T @ y)
    plot_2d(X, y, w=w_ols)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_010.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_010.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 130-132

The instability can become even more pronounced with larger number of
features, or with smaller numbers of samples.

.. GENERATED FROM PYTHON SOURCE LINES 132-138

.. code-block:: Python


    X, y = create_regression_toy(n_samples=10, n_features=2, correlation=0.999)

    w_ols = np.linalg.solve(X.T @ X, X.T @ y)
    plot_2d(X, y, w=w_ols)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_011.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_011.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 139-143

When the number of features is larger than the number of samples, the linear
system becomes under-determined, which means that the OLS problem has an
infinite number of solutions, most of which do not generalize well to new
data.

.. GENERATED FROM PYTHON SOURCE LINES 145-170

Ridge regression
----------------

To solve the instability and under-determinacy issues of OLS, OLS can be
extended to *ridge regression*. Ridge regression considers a different
optimization problem:

.. math:: w = \arg\min_w ||Xw - y||^2 + \alpha ||w||^2

This optimization problem contains two terms: (i) a *data-fitting term*
:math:`||Xw - y||^2`, which ensures the regression correctly fits the
training data; and (ii) a regularization term :math:`\alpha||w||^2`, which
forces the coefficients :math:`w` to be close to zero. The regularization
term increases the stability of the solution, at the cost of a bias toward
zero.

In the regularization term, ``alpha`` is a positive hyperparameter that
controls the regularization strength. With a smaller ``alpha``, the solution
will be closer to the OLS solution, and with a larger ``alpha``, the solution
will be further from the OLS solution and closer to the origin.

To illustrate this effect, the following plot shows the ridge solution for a
particular value of ``alpha``. The black circle corresponds to the line of
equal regularization, whereas the blue ellipses are the lines of equal
squared loss.

.. GENERATED FROM PYTHON SOURCE LINES 170-177

.. code-block:: Python


    X, y = create_regression_toy(n_features=2, correlation=0.9)

    alpha = 23
    w_ridge = np.linalg.solve(X.T @ X + np.eye(X.shape[1]) * alpha, X.T @ y)
    plot_2d(X, y, w_ridge, alpha=alpha)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_012.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_012.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 178-196

To understand why the regularization term makes the solution more robust to
noise, let's consider the ridge solution. The ridge solution can be found
analytically with the formula:

.. math:: w = (X^\top X + \alpha I)^{-1}  X^\top y

where ``I`` is the identity matrix. In this formula, we can see that the
inverted matrix is now :math:`(X^\top X + \alpha I)`. Compared to OLS, the
additional term :math:`\alpha I` adds a positive value ``alpha`` to all
eigenvalues :math:`\lambda_k` of :math:`(X^\top X)` before the matrix
inversion. Inverting :math:`(\lambda_k + \alpha)` instead of
:math:`\lambda_k` reduces the instability caused by small eigenvalues. This
explains why the ridge solution is more robust to noise than the OLS
solution.

In the following plots, we can see that even with a stronger correlation, the
ridge solution is still reasonably close to the noiseless ground truth, while
the OLS solution would be far off.

.. GENERATED FROM PYTHON SOURCE LINES 196-203

.. code-block:: Python


    X, y = create_regression_toy(n_features=2, correlation=0.999)

    alpha = 23
    w_ridge = np.linalg.solve(X.T @ X + np.eye(X.shape[1]) * alpha, X.T @ y)
    plot_2d(X, y, w_ridge, alpha=alpha)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_013.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_013.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 204-206

Changing the regularization hyperparameter :math:`\alpha` leads to another
ridge solution.

.. GENERATED FROM PYTHON SOURCE LINES 206-213

.. code-block:: Python


    X, y = create_regression_toy(n_features=2, correlation=0.999)

    alpha = 200
    w_ridge = np.linalg.solve(X.T @ X + np.eye(X.shape[1]) * alpha, X.T @ y)
    plot_2d(X, y, w_ridge, alpha=alpha)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_014.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_014.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 214-218

Side note: For every :math:`\alpha`, at the corresponding ridge solution, the
line of equal regularization and the line of equal loss are tangent. If the
two lines were crossing, one could improve the ridge solution by moving along
one line. It would improve one term while keeping the other term constant.

.. GENERATED FROM PYTHON SOURCE LINES 220-232

Hyperparameter selection
------------------------
One issue with ridge regression is that the hyperparameter :math:`\alpha` is
arbitrary. Different choices of hyperparameter lead to different models. To
compare these models, we cannot compare the ability to fit the training data,
because the best model would just be OLS (:math:`alpha = 0`). Instead, we
want to compare the ability of each model to generalize to new data. To
estimate a model ability to generalize, we can compute its prediction
accuracy on a separate dataset that was not used during the model fitting
(i.e. not used to find the coefficients :math:`w`).

To illustrate this idea, we can split the data into two subsets.

.. GENERATED FROM PYTHON SOURCE LINES 232-240

.. code-block:: Python


    from voxelwise_tutorials.regression_toy import create_regression_toy
    from voxelwise_tutorials.regression_toy import plot_kfold2

    X, y = create_regression_toy(n_features=1)

    plot_kfold2(X, y, fit=False)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_015.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_015.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 241-242

Then, we can fit a model on each subset.

.. GENERATED FROM PYTHON SOURCE LINES 242-246

.. code-block:: Python


    alpha = 0.1
    plot_kfold2(X, y, alpha, fit=True)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_016.png
   :alt: model 1, model 2
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_016.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 247-248

And compute the prediction accuracy of each model on the other subset.

.. GENERATED FROM PYTHON SOURCE LINES 248-251

.. code-block:: Python


    plot_kfold2(X, y, alpha, fit=True, flip=True)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_017.png
   :alt: model 1, model 2
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_017.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 252-256

In this way, we can evaluate the ridge regression (fit with a specific
:math:`\alpha`) on its ability to generalize to new data. If we do that for
different hyperparameter candidates :math:`\alpha`, we can select the model
leading to the best out-of-set prediction accuracy.

.. GENERATED FROM PYTHON SOURCE LINES 256-263

.. code-block:: Python


    from voxelwise_tutorials.regression_toy import plot_cv_path

    noise = 0.1
    X, y = create_regression_toy(n_features=2, noise=noise)
    plot_cv_path(X, y)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_018.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_018.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 264-269

In the example above, the noise level is low, so the best hyperparameter
alpha is close to zero, and ridge regression is not much better than OLS.
However, if the dataset has more noise, a lower number of samples, or more
correlated features, the best hyperparameter can be higher. In this case,
ridge regression is better than OLS.

.. GENERATED FROM PYTHON SOURCE LINES 269-274

.. code-block:: Python


    noise = 3
    X, y = create_regression_toy(n_features=2, noise=noise)
    plot_cv_path(X, y)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_019.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_019.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 275-280

When the noise level is too high, the best hyperparameter can be the largest
on the grid. It either means that the grid is too small, or that the
regression does not find a predictive link between the features and the
target. In this case, the model with the lowest generalization error always
predict zero (:math:`w=0`).

.. GENERATED FROM PYTHON SOURCE LINES 280-285

.. code-block:: Python


    noise = 100
    X, y = create_regression_toy(n_features=2, noise=noise)
    plot_cv_path(X, y)


.. image-sg:: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_020.png
   :alt: 02 plot ridge regression
   :srcset: /_auto_examples/shortclips/images/sphx_glr_02_plot_ridge_regression_020.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 286-308

To summarize, to select the best hyperparameter :math:`\alpha`, the standard
method is to perform a grid search:

  - Split the training set into two subsets: one subset used to fit the
    models, and one subset to estimate the prediction accuracy (*validation
    set*)
  - Define a number of hyperparameter candidates, for example [0.1, 1, 10,
    100].
  - Fit a separate ridge model with each hyperparameter candidate
    :math:`\alpha`.
  - Compute the prediction accuracy on the validation set.
  - Select the hyperparameter candidate leading to the best validation
    accuracy.

To make the grid search less sensitive to the choice of how the training data
was split, the process can be repeated for multiple splits. Then, the
different prediction accuracies can be averaged over splits before the
hyperparameter selection. Thus, the process is called a *cross-validation*.

Learn more about hyperparameter selection and cross-validation on the
`scikit-learn documentation
<https://scikit-learn.org/stable/modules/cross_validation.html>`_.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 6.038 seconds)


.. _sphx_glr_download__auto_examples_shortclips_02_plot_ridge_regression.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 02_plot_ridge_regression.ipynb <02_plot_ridge_regression.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 02_plot_ridge_regression.py <02_plot_ridge_regression.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_