Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permutation_test (1/4): add comments and docstring of the functions #111

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc_conf/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ Functions
knockoff_aggregation
model_x_knockoff
multivariate_1D_simulation
permutation_test_cv
permutation_test
permutation_test_pval
reid
standardized_svr
zscore_from_pval
Expand Down
21 changes: 20 additions & 1 deletion doc_conf/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -177,4 +177,23 @@ @article{liuFastPowerfulConditional2021
archiveprefix = {arxiv},
keywords = {Statistics - Methodology},
file = {/home/ahmad/Zotero/storage/8HRQZX3H/Liu et al. - 2021 - Fast and Powerful Conditional Randomization Testin.pdf;/home/ahmad/Zotero/storage/YFNDKN2B/2006.html}
}
}

@book{westfall1993resampling,
title={Resampling-based multiple testing: Examples and methods for p-value adjustment},
author={Westfall, Peter H and Young, S Stanley},
volume={279},
year={1993},
publisher={John Wiley \& Sons}
}

@article{hirschhorn2005genome,
title={Genome-wide association studies for common diseases and complex traits},
author={Hirschhorn, Joel N and Daly, Mark J},
journal={Nature reviews genetics},
volume={6},
number={2},
pages={95--108},
year={2005},
publisher={Nature Publishing Group UK London}
}
14 changes: 10 additions & 4 deletions examples/plot_fmri_data_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,13 @@
from sklearn.cluster import FeatureAgglomeration
from sklearn.feature_extraction import image
from sklearn.linear_model import Ridge
from sklearn.svm import LinearSVR
from sklearn.utils import Bunch

from hidimstat.adaptive_permutation_threshold import ada_svr
from hidimstat.clustered_inference import clustered_inference
from hidimstat.ensemble_clustered_inference import ensemble_clustered_inference
from hidimstat.permutation_test import permutation_test, permutation_test_cv
from hidimstat.permutation_test import permutation_test, permutation_test_pval
from hidimstat.standardized_svr import standardized_svr
from hidimstat.stat_tools import pval_from_scale, zscore_from_pval

Expand Down Expand Up @@ -152,18 +153,23 @@ def preprocess_haxby(subject=2, memory=None):
SVR_permutation_test_inference = False
if SVR_permutation_test_inference:
# We computed the regularization parameter by CV (C = 0.1)
pval_corr_svr_perm_test, one_minus_pval_corr_svr_perm_test = permutation_test_cv(
X, y, n_permutations=50, C=0.1
estimator = LinearSVR(C=0.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C is the regularization parameter, it is not optimized via CV here. Or am I missing something?
I suggest using something like randomized search.

Suggested change
estimator = LinearSVR(C=0.1)
estimator = RandomizedSearchCV(
LinearSVR(random_state=42),
param_distributions={ "C": np.logspace(-3, 3, 10), },
n_iter=10,
n_jobs=5,
random_state=42,
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't provide an optimisation by CV because, in the original example, it wasn't the case.
Nevertheless, we need to be careful with optimisation because it will increase the time for running examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an estimate of the compute time ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my computer, the example without the CV is: 5m39s and with CV is: 7m16s.
In estimation, it increases 2 minutes the time of calculation.

One solution can be to store the best value of the parameter and avoid doing the refitting each time. .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's L153 SVR_permutation_test_inference = False so is this even running in the CI?
If not, I suggest leaving the CV to show this possibility to the user without increasing CI time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put SVR_permutation_test_inference = True for my test.

Including, SVR_permutation_test_inference options in my computation time, I get:
SVR_permutation_test_inference = True and CV: 7m16s.
SVR_permutation_test_inference = True and No CV: 5m39s
SVR_permutation_test_inference = False and CV: 1m57s.
SVR_permutation_test_inference = False and No CV: 1m48s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that this is toio much time for a method that does not enjoy any theoretical guarantee.
Out of curiosity, what do you get if you replace the SVR with a RidgeRegression ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, we should not change this if there is no explicit reason for that (e.g. significant reduction of documentation generation time).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that this is toio much time for a method that does not enjoy any theoretical guarantee. Out of curiosity, what do you get if you replace the SVR with a RidgeRegression ?

This is already done in the example.

weight_svr, weight_svr_distribution = permutation_test(X, y, n_permutations=50)
pval_corr_svr_perm_test, one_minus_pval_corr_svr_perm_test = permutation_test_pval(
weight_svr, weight_svr_distribution
)

# Another method is to compute the p-values by permutation test from the
# Ridge decoder. The solution provided by this method should be very close to
# the previous one and the computation time is much shorter: around 20 seconds.

estimator = Ridge()
lionelkusch marked this conversation as resolved.
Show resolved Hide resolved
pval_corr_ridge_perm_test, one_minus_pval_corr_ridge_perm_test = permutation_test(
weight_ridge, weight_ridge_distribution = permutation_test(
X, y, estimator=estimator, n_permutations=200
)
pval_corr_ridge_perm_test, one_minus_pval_corr_ridge_perm_test = permutation_test_pval(
weight_ridge, weight_ridge_distribution
)

#############################################################################
# Now, let us run the algorithm introduced by Gaonkar et al. (c.f. References).
Expand Down
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,8 @@ where = ["src"]


[tool.hatch.version]
source = "vcs"
source = "vcs"

#pyproject.toml
[tool.pytest.ini_options]
addopts = "--ignore=src" # ignore src directory
5 changes: 3 additions & 2 deletions src/hidimstat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from .knockoffs import model_x_knockoff
from .multi_sample_split import aggregate_quantiles
from .noise_std import group_reid, reid
from .permutation_test import permutation_test_cv
from .permutation_test import permutation_test, permutation_test_pval
from .scenario import multivariate_1D_simulation
from .standardized_svr import standardized_svr
from .stat_tools import zscore_from_pval
Expand All @@ -34,7 +34,8 @@
"knockoff_aggregation",
"model_x_knockoff",
"multivariate_1D_simulation",
"permutation_test_cv",
"permutation_test",
"permutation_test_pval",
"reid",
"standardized_svr",
"zscore_from_pval",
Expand Down
221 changes: 79 additions & 142 deletions src/hidimstat/permutation_test.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,18 @@
import numpy as np
from joblib import Parallel, delayed
from sklearn.base import clone
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVR
from sklearn.utils import _safe_indexing

from hidimstat.stat_tools import pval_from_two_sided_pval_and_sign


def permutation_test_cv(
X,
y,
n_permutations=1000,
C=None,
Cs=np.logspace(-7, 1, 9),
lionelkusch marked this conversation as resolved.
Show resolved Hide resolved
seed=0,
n_jobs=1,
verbose=1,

from hidimstat.stat_tools import pval_from_two_sided_pval_and_sign, step_down_max_t


def permutation_test(
X, y, estimator, n_permutations=1000, seed=0, n_jobs=None, verbose=0
):
"""Cross-validated permutation test shuffling the target
"""
Permutation test

This function compute the distribution of the weights of a linear model
by shuffling the target :footcite:t:`hirschhorn2005genome`.

Parameters
----------
Expand All @@ -29,16 +22,8 @@ def permutation_test_cv(
y : ndarray, shape (n_samples,)
Target.

C : float or None, optional (default=None)
If None, the linear SVR regularization parameter is set by cross-val
running a grid search on the list of hyper-parameters contained in Cs.
Otherwise, the regularization parameter is equal to C.
The strength of the regularization is inversely proportional to C.

Cs : ndarray, optional (default=np.logspace(-7, 1, 9))
If C is None, the linear SVR regularization parameter is set by
cross-val running a grid search on the list of hyper-parameters
contained in Cs.
estimator : object LinearModel
The linear model used to fit the data.

n_permutations : int, optional (default=1000)
Number of permutations used to compute the survival function
Expand All @@ -47,77 +32,58 @@ def permutation_test_cv(
seed : int, optional (default=0)
Determines the permutations used for shuffling the target

n_jobs : int or None, optional (default=1)
n_jobs : int or None, optional (default=None)
Number of CPUs to use during the cross validation.

verbose: int, optional (default=1)
The verbosity level: if non zero, progress messages are printed
when computing the permutation stats in parralel.
The frequency of the messages increases with the verbosity level.
verbose : int, optional (default=0)
The verbosity level of the joblib.Parallel.

Returns
-------
pval_corr : ndarray, shape (n_features,)
p-value corrected for multiple testing, with numerically accurate
values for positive effects (ie., for p-value close to zero).
weights : ndarray, shape (n_features,)
The weights of the original model.

one_minus_pval_corr : ndarray, shape (n_features,)
One minus the corrected p-value, with numerically accurate
values for negative effects (ie., for p-value close to one).
"""
weights_distribution : ndarray, shape (n_permutations, n_features)
The distribution of the weights of the model obtained by shuffling
the target n_permutations times.

if C is None:
References
----------
.. footbibliography::

steps = [("SVR", LinearSVR())]
pipeline = Pipeline(steps)
parameters = {"SVR__C": Cs}
grid = GridSearchCV(pipeline, param_grid=parameters, n_jobs=n_jobs)
grid.fit(X, y)
C = grid.best_params_["SVR__C"]
estimator = LinearSVR(C=C)
"""

else:
rng = np.random.default_rng(seed)

estimator = LinearSVR(C=C)
# Get the weights of the original model
if not hasattr(estimator, "coef_"):
weights = _fit_and_weights(estimator, X, y)
else:
weights = estimator.coef_

pval_corr, one_minus_pval_corr = permutation_test(
X,
y,
estimator,
n_permutations=n_permutations,
seed=seed,
n_jobs=n_jobs,
verbose=verbose,
# Get the distribution of the weights by shuffling the target
weights_distribution = Parallel(n_jobs=n_jobs, verbose=verbose)(
delayed(_fit_and_weights)(clone(estimator), X, _shuffle(y, rng))
for _ in range(n_permutations)
)

return pval_corr, one_minus_pval_corr
# Convert the list of weights into an array
weights_distribution = np.array(weights_distribution)

return weights, weights_distribution

def permutation_test(X, y, estimator, n_permutations=1000, seed=0, n_jobs=1, verbose=1):
"""Permutation test shuffling the target

def permutation_test_pval(weights, weights_distribution):
"""
Compute p-value from permutation test

Parameters
----------
X : ndarray, shape (n_samples, n_features)
Data.

y : ndarray, shape (n_samples,)
Target.

n_permutations : int, optional (default=1000)
Number of permutations used to compute the survival function
and cumulative distribution function scores.

seed : int, optional (default=0)
Determines the permutations used for shuffling the target
weights : ndarray, shape (n_features,)
The weights of the original model.

n_jobs : int or None, optional (default=1)
Number of CPUs to use during the cross validation.

verbose: int, optional (default=1)
The verbosity level: if non zero, progress messages are printed
when computing the permutation stats in parralel.
The frequency of the messages increases with the verbosity level.
weights_distribution : ndarray, shape (n_permutations, n_features)
The distribution of the weights of the model obtained by shuffling

Returns
-------
Expand All @@ -129,20 +95,9 @@ def permutation_test(X, y, estimator, n_permutations=1000, seed=0, n_jobs=1, ver
One minus the corrected p-value, with numerically accurate
values for negative effects (ie., for p-value close to one).
"""
two_sided_pval_corr = step_down_max_t(weights, weights_distribution)

rng = np.random.default_rng(seed)

stat = _permutation_test_stat(clone(estimator), X, y)

permutation_stats = Parallel(n_jobs=n_jobs, verbose=verbose)(
delayed(_permutation_test_stat)(clone(estimator), X, _shuffle(y, rng))
for _ in range(n_permutations)
)

permutation_stats = np.array(permutation_stats)
two_sided_pval_corr = step_down_max_T(stat, permutation_stats)

stat_sign = np.sign(stat)
stat_sign = np.sign(weights)

pval_corr, _, one_minus_pval_corr, _ = pval_from_two_sided_pval_and_sign(
two_sided_pval_corr, stat_sign
Expand All @@ -151,65 +106,47 @@ def permutation_test(X, y, estimator, n_permutations=1000, seed=0, n_jobs=1, ver
return pval_corr, one_minus_pval_corr


def _permutation_test_stat(estimator, X, y):
"""Fit estimator and get coef"""
stat = estimator.fit(X, y).coef_
return stat


def _shuffle(y, rng):
"""Shuffle vector"""
indices = rng.permutation(len(y))
return _safe_indexing(y, indices)


def step_down_max_T(stat, permutation_stats):
"""Step-down maxT algorithm for computing adjusted p-values
def _fit_and_weights(estimator, X, y):
"""
Fit the estimator and return the weights

Parameters
----------
stat : ndarray, shape (n_features,)
Statistic computed on the original (unpermutted) problem.
estimator : object
The estimator to fit.

permutation_stats : ndarray, shape (n_permutations, n_features)
Statistics computed on permutted problems.
X : ndarray, shape (n_samples, n_features)
Data.

y : ndarray, shape (n_samples,)
Target.

Returns
-------
two_sided_pval_corr : ndarray, shape (n_features,)
Two-sided p-values corrected for multiple testing.

References
----------
.. [1] Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple
testing: Examples and methods for p-value adjustment (Vol. 279).
John Wiley & Sons.
weights : ndarray, shape (n_features,)
The weights of the estimator.
"""
weights = estimator.fit(X, y).coef_
return weights

n_permutations, n_features = np.shape(permutation_stats)

index_ordered = np.argsort(np.abs(stat))
stat_ranked = np.empty(n_features)
stat_ranked[index_ordered] = np.arange(n_features)
stat_ranked = stat_ranked.astype(int)
stat_sorted = np.copy(np.abs(stat)[index_ordered])
permutation_stats_ordered = np.copy(np.abs(permutation_stats)[:, index_ordered])

for i in range(1, n_features):
permutation_stats_ordered[:, i] = np.maximum(
permutation_stats_ordered[:, i - 1], permutation_stats_ordered[:, i]
)

two_sided_pval_corr = (
np.sum(np.less_equal(stat_sorted, permutation_stats_ordered), axis=0)
/ n_permutations
)
def _shuffle(y, rng):
"""
Shuffle the target

for i in range(n_features - 1)[::-1]:
two_sided_pval_corr[i] = np.maximum(
two_sided_pval_corr[i], two_sided_pval_corr[i + 1]
)
Parameters
----------
y : ndarray, shape (n_samples,)
Target.

two_sided_pval_corr = np.copy(two_sided_pval_corr[stat_ranked])
rng : numpy.random.Generator
Random number generator.

return two_sided_pval_corr
Returns
-------
y_shuffled : ndarray, shape (n_samples,)
Shuffled target.
"""
y_copy = np.copy(y)
rng.shuffle(y_copy)
return y_copy
Loading
Loading