Note

Go to the end to download the full example code

Fit a banded ridge model with both wordnet and motion energy features¶

In this example, we model the fMRI responses with a banded ridge regression, with two different feature spaces: motion energy and wordnet categories.

Banded ridge regression: Since the relative scaling of both feature spaces is unknown, we use two regularization hyperparameters (one per feature space) in a model called banded ridge regression [1]. Just like with ridge regression, we optimize the hyperparameters over cross-validation. An efficient implementation of this model is available in the himalaya package.

Running time: This example is more computationally intensive than the previous examples. With a GPU backend, model fitting takes around 6 minutes. With a CPU backend, it can last 10 times more.

Path of the data directory¶

from voxelwise_tutorials.io import get_data_home
directory = get_data_home(dataset="shortclips")
print(directory)

/home/jlg/mvdoc/voxelwise_tutorials_data/shortclips

# modify to use another subject
subject = "S01"

Load the data¶

As in the previous examples, we first load the fMRI responses, which are our regression targets.

import os
import numpy as np
from voxelwise_tutorials.io import load_hdf5_array

file_name = os.path.join(directory, "responses", f"{subject}_responses.hdf")
Y_train = load_hdf5_array(file_name, key="Y_train")
Y_test = load_hdf5_array(file_name, key="Y_test")

print("(n_samples_train, n_voxels) =", Y_train.shape)
print("(n_repeats, n_samples_test, n_voxels) =", Y_test.shape)

(n_samples_train, n_voxels) = (3600, 84038)
(n_repeats, n_samples_test, n_voxels) = (10, 270, 84038)

We also compute the explainable variance, to exclude voxels with low explainable variance from the fit, and speed up the model fitting.

from voxelwise_tutorials.utils import explainable_variance
ev = explainable_variance(Y_test)
print("(n_voxels,) =", ev.shape)

mask = ev > 0.1
print("(n_voxels_mask,) =", ev[mask].shape)

(n_voxels,) = (84038,)
(n_voxels_mask,) = (6849,)

We average the test repeats, to remove the non-repeatable part of fMRI responses.

Y_test = Y_test.mean(0)

print("(n_samples_test, n_voxels) =", Y_test.shape)

(n_samples_test, n_voxels) = (270, 84038)

We fill potential NaN (not-a-number) values with zeros.

Y_train = np.nan_to_num(Y_train)
Y_test = np.nan_to_num(Y_test)

And we make sure the targets are centered.

Y_train -= Y_train.mean(0)
Y_test -= Y_test.mean(0)

Then we load both feature spaces, that are going to be used for the linear regression model.

feature_names = ["wordnet", "motion_energy"]

Xs_train = []
Xs_test = []
n_features_list = []
for feature_space in feature_names:
    file_name = os.path.join(directory, "features", f"{feature_space}.hdf")
    Xi_train = load_hdf5_array(file_name, key="X_train")
    Xi_test = load_hdf5_array(file_name, key="X_test")

    Xs_train.append(Xi_train.astype(dtype="float32"))
    Xs_test.append(Xi_test.astype(dtype="float32"))
    n_features_list.append(Xi_train.shape[1])

# concatenate the feature spaces
X_train = np.concatenate(Xs_train, 1)
X_test = np.concatenate(Xs_test, 1)

print("(n_samples_train, n_features_total) =", X_train.shape)
print("(n_samples_test, n_features_total) =", X_test.shape)
print("[n_features_wordnet, n_features_motion_energy] =", n_features_list)

(n_samples_train, n_features_total) = (3600, 8260)
(n_samples_test, n_features_total) = (270, 8260)
[n_features_wordnet, n_features_motion_energy] = [1705, 6555]

Define the cross-validation scheme¶

We define again a leave-one-run-out cross-validation split scheme.

from sklearn.model_selection import check_cv
from voxelwise_tutorials.utils import generate_leave_one_run_out

# indice of first sample of each run
run_onsets = load_hdf5_array(file_name, key="run_onsets")
print(run_onsets)

[   0  300  600  900 1200 1500 1800 2100 2400 2700 3000 3300]

We define a cross-validation splitter, compatible with scikit-learn API.

n_samples_train = X_train.shape[0]
cv = generate_leave_one_run_out(n_samples_train, run_onsets)
cv = check_cv(cv)  # copy the cross-validation splitter into a reusable list

Fit the model¶

We fit on the train set, and score on the test set.

To speed up the fit and to limit the memory peak in Colab, we only fit on voxels with explainable variance above 0.1.

With a GPU backend, the fitting of this model takes around 6 minutes. With a CPU backend, it can last 10 times more.

pipeline.fit(X_train, Y_train[:, mask])

scores_mask = pipeline.score(X_test, Y_test[:, mask])
scores_mask = backend.to_numpy(scores_mask)
print("(n_voxels_mask,) =", scores_mask.shape)

# Then we extend the scores to all voxels, giving a score of zero to unfitted
# voxels.
n_voxels = Y_train.shape[1]
scores = np.zeros(n_voxels)
scores[mask] = scores_mask
print("(n_voxels,) =", scores.shape)

[                                        ] 0% | 0.00 sec | 20 random sampling with cv |
[..                                      ] 5% | 4.91 sec | 20 random sampling with cv |
[....                                    ] 10% | 9.74 sec | 20 random sampling with cv |
[......                                  ] 15% | 14.19 sec | 20 random sampling with cv |
[........                                ] 20% | 18.47 sec | 20 random sampling with cv |
[..........                              ] 25% | 22.76 sec | 20 random sampling with cv |
[............                            ] 30% | 27.24 sec | 20 random sampling with cv |
[..............                          ] 35% | 31.90 sec | 20 random sampling with cv |
[................                        ] 40% | 36.31 sec | 20 random sampling with cv |
[..................                      ] 45% | 40.81 sec | 20 random sampling with cv |
[....................                    ] 50% | 45.24 sec | 20 random sampling with cv |
[......................                  ] 55% | 50.00 sec | 20 random sampling with cv |
[........................                ] 60% | 54.41 sec | 20 random sampling with cv |
[..........................              ] 65% | 58.99 sec | 20 random sampling with cv |
[............................            ] 70% | 63.24 sec | 20 random sampling with cv |
[..............................          ] 75% | 68.01 sec | 20 random sampling with cv |
[................................        ] 80% | 72.37 sec | 20 random sampling with cv |
[..................................      ] 85% | 76.87 sec | 20 random sampling with cv |
[....................................    ] 90% | 81.16 sec | 20 random sampling with cv |
[......................................  ] 95% | 85.98 sec | 20 random sampling with cv |
[........................................] 100% | 90.25 sec | 20 random sampling with cv |
(n_voxels_mask,) = (6849,)
(n_voxels,) = (84038,)

Compare with a ridge model¶

We can compare with a baseline model, which does not use one hyperparameter per feature space, but instead shares the same hyperparameter for all feature spaces.

from himalaya.kernel_ridge import KernelRidgeCV

pipeline_baseline = make_pipeline(
    StandardScaler(with_mean=True, with_std=False),
    Delayer(delays=[1, 2, 3, 4]),
    KernelRidgeCV(
        alphas=alphas, cv=cv,
        solver_params=dict(n_targets_batch=n_targets_batch,
                           n_alphas_batch=n_alphas_batch,
                           n_targets_batch_refit=n_targets_batch_refit)),
)
pipeline_baseline

Pipeline(steps=[('standardscaler', StandardScaler(with_std=False)),
                ('delayer', Delayer(delays=[1, 2, 3, 4])),
                ('kernelridgecv',
                 KernelRidgeCV(alphas=array([1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08,
       1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14, 1.e+15, 1.e+16,
       1.e+17, 1.e+18, 1.e+19, 1.e+20]),
                               cv=_CVIterableWrapper(cv=[(array([   0,    1, ..., 3598, 3599]), array...)), (array([   0,    1, ..., 3598, 3599]), array([600, 601, ..., 898, 899])), (array([   0,    1, ..., 3598, 3599]), array([ 900,  901, ..., 1198, 1199])), (array([   0,    1, ..., 3598, 3599]), array([3000, 3001, ..., 3298, 3...1, ..., 2398, 2399])), (array([   0,    1, ..., 3598, 3599]), array([2400, 2401, ..., 2698, 2699]))]),
                               solver_params={'n_alphas_batch': 5,
                                              'n_targets_batch': 200,
                                              'n_targets_batch_refit': 200}))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pipeline_baseline.fit(X_train, Y_train[:, mask])
scores_baseline_mask = pipeline_baseline.score(X_test, Y_test[:, mask])
scores_baseline_mask = backend.to_numpy(scores_baseline_mask)

# extend to unfitted voxels
n_voxels = Y_train.shape[1]
scores_baseline = np.zeros(n_voxels)
scores_baseline[mask] = scores_baseline_mask

Here we plot the comparison of model prediction accuracies with a 2D histogram. All 70k voxels are represented in this histogram, where the diagonal corresponds to identical model prediction accuracy for both models. A distribution deviating from the diagonal means that one model has better predictive performance than the other.

import matplotlib.pyplot as plt
from voxelwise_tutorials.viz import plot_hist2d

ax = plot_hist2d(scores_baseline, scores)
ax.set(title='Generalization R2 scores', xlabel='KernelRidgeCV',
       ylabel='MultipleKernelRidgeCV')
plt.show()

We see that the banded ridge model (MultipleKernelRidgeCV) outperforms the ridge model (KernelRidegeCV). Indeed, banded ridge regression is able to find the optimal scalings of each feature space, independently on each voxel. Banded ridge regression is thus able to perform a soft selection between the available feature spaces, based on the cross-validation performances.

Plot the banded ridge split¶

On top of better prediction accuracy, banded ridge regression also gives a way to disentangle the contribution of the two feature spaces. To do so, we take the kernel weights and the ridge (dual) weights corresponding to each feature space, and use them to compute the prediction from each feature space separately.

\[\hat{y} = \sum_i^m \hat{y}_i = \sum_i^m \gamma_i K_i \hat{w}\]

Then, we use these split predictions to compute split \(\tilde{R}^2_i\) scores. These scores are corrected so that their sum is equal to the \(R^2\) score of the full prediction \(\hat{y}\).

from himalaya.scoring import r2_score_split

Y_test_pred_split = pipeline.predict(X_test, split=True)
split_scores_mask = r2_score_split(Y_test[:, mask], Y_test_pred_split)

print("(n_kernels, n_samples_test, n_voxels_mask) =", Y_test_pred_split.shape)
print("(n_kernels, n_voxels_mask) =", split_scores_mask.shape)

# extend to unfitted voxels
n_kernels = split_scores_mask.shape[0]
n_voxels = Y_train.shape[1]
split_scores = np.zeros((n_kernels, n_voxels))
split_scores[:, mask] = backend.to_numpy(split_scores_mask)
print("(n_kernels, n_voxels) =", split_scores.shape)

(n_kernels, n_samples_test, n_voxels_mask) = torch.Size([2, 270, 6849])
(n_kernels, n_voxels_mask) = torch.Size([2, 6849])
(n_kernels, n_voxels) = (2, 84038)

We can then plot the split scores on a flatmap with a 2D colormap.

from voxelwise_tutorials.viz import plot_2d_flatmap_from_mapper

mapper_file = os.path.join(directory, "mappers", f"{subject}_mappers.hdf")
ax = plot_2d_flatmap_from_mapper(split_scores[0], split_scores[1],
                                 mapper_file, vmin=0, vmax=0.25, vmin2=0,
                                 vmax2=0.5, label_1=feature_names[0],
                                 label_2=feature_names[1])
plt.show()

The blue regions are better predicted by the motion-energy features, the orange regions are better predicted by the wordnet features, and the white regions are well predicted by both feature spaces.

Compared to the last figure of the previous example, we see that most white regions have been replaced by either blue or orange regions. The banded ridge regression disentangled the two feature spaces in voxels where both feature spaces had good prediction accuracy (see previous example). For example, motion-energy features predict brain activity in early visual cortex, while wordnet features predict in semantic visual areas. For more discussions about these results, we refer the reader to the original publication [1].

References¶

Total running time of the script: (1 minutes 58.143 seconds)

Gallery generated by Sphinx-Gallery

voxelwise

Related Topics

Navigation