ENH add max_samples parameters in permutation_importances #20431

o1iv3r · 2021-06-30T14:41:55Z

Reference Issues/PRs

Implements a max_samples option for permutation feature importance as discussed in #20245

What does this implement?

If max_samples is < 1.0 or an integer smaller than the number of rows of X, feature importance is calculated on a subset of the data, which is drawn randomly in each iteration to increase the overall coverage of the data. This speeds up the computation on very large data sets and provides more honest uncertainty estimates than sampling only once before the data is passed to permutation_importance().

Other comments:

I've added this option to all tests where it made sense for me and also added an error handling test.
This is my first sklearn PR so please look at it thoroughly

o1iv3r · 2021-07-13T13:03:16Z

Pylint on the inspection folder works for me locally "Your code has been rated at 5.89/10".
Don't understand while it fails on CircleCI

thomasjpfan

Thank you for the PR @o1iv3r .

Is there a reference or paper of this technique or a blog post of the technique being used in industry?

sklearn/inspection/_permutation_importance.py

doc/whats_new/v1.0.rst

sklearn/inspection/_permutation_importance.py

sklearn/inspection/tests/test_permutation_importance.py

glemaitre · 2021-07-21T21:26:12Z

You probably want to merge main into your branch as well.

Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Description improved Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Improved documentation Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Removed comment Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Improved commenting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

o1iv3r · 2021-08-03T10:43:40Z

I addressed all comments and merged the upstream master into my feature branch

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…r to actually refer to error handling

…_indexing; corrected test_permutation_importance_max_samples_error to actually refer to error handling

o1iv3r · 2021-08-03T16:02:52Z

I hope it is now the way it should be

glemaitre · 2021-08-03T18:37:35Z

It was just not passing the test for black. I applied black to solve this issue.
In the future, you can apply black on the file or use pre-commit to make it for you.

glemaitre

LGTM on my side. We will need a second review. I am a bit annoyed with the pandas warning and I don't think that there is any other way than using the ignore_warnings.

glemaitre · 2021-08-03T19:14:06Z

maybe one of @ogrisel @rth @thomasjpfan has a bit of time to review this one.

glemaitre · 2021-08-04T20:24:23Z

sklearn/inspection/_permutation_importance.py

    for _ in range(n_repeats):
        random_state.shuffle(shuffling_idx)
        if hasattr(X_permuted, "iloc"):
            col = X_permuted.iloc[shuffling_idx, col_idx]
            col.index = X_permuted.index
-            X_permuted.iloc[:, col_idx] = col
+            with ignore_warnings():


If #20673 go in, then there is no need to ignore warning anymore

ogrisel

LGTM, just a few suggestions to explain better the motivation behind this option.

doc/whats_new/v1.0.rst

sklearn/inspection/_permutation_importance.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/inspection/_permutation_importance.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

sklearn/inspection/_permutation_importance.py

glemaitre · 2021-08-09T10:06:37Z

I will let the CIs pass and then we can merge.

glemaitre · 2021-08-09T16:36:04Z

Thanks @o1iv3r

…rn#20431) Co-authored-by: Oli.P <opfaffel@munichre.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Oli.P added 5 commits June 30, 2021 12:03

first code change

17b78d9

tests passing

0b21ca0

added tests

13382d8

black code formatting

fd6b349

full test coverage

f37d0df

github-actions bot added the module:inspection label Jun 30, 2021

o1iv3r changed the title ~~Feat imp max samples~~ Implement max_samples option for permutation feature importance to speed up the computation on large data sets Jun 30, 2021

o1iv3r mentioned this pull request Jun 30, 2021

Add option to compute permutation_importance on a random subset of rows #20245

Closed

Oli.P added 3 commits July 13, 2021 10:46

updated changelog

faf56ab

updated tests

9e1ab88

ran black

5a532e7

Enhanced doc; skip linting [lint skip]

08fe849

o1iv3r changed the title ~~Implement max_samples option for permutation feature importance to speed up the computation on large data sets~~ [MRG] Implement max_samples option for permutation feature importance to speed up the computation on large data sets Jul 14, 2021

thomasjpfan reviewed Jul 18, 2021

View reviewed changes

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved

Addressed reviewers comments

79aa068

glemaitre reviewed Jul 21, 2021

View reviewed changes

glemaitre changed the title ~~[MRG] Implement max_samples option for permutation feature importance to speed up the computation on large data sets~~ ENH add max_samples parameters in permutation_importances Jul 21, 2021

o1iv3r and others added 9 commits July 28, 2021 16:36

Update doc/whats_new/v1.0.rst

5d0b698

Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/inspection/_permutation_importance.py

4eb1ef6

Description improved Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/inspection/_permutation_importance.py

d894e5a

Improved documentation Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/inspection/_permutation_importance.py

0af28a3

Removed comment Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/inspection/_permutation_importance.py

a28cbc0

Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/inspection/tests/test_permutation_importance.py

4579365

Improved commenting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge branch 'main' into feat_imp_max_samples

ef29b78

Simplified max_samples test

264ee5f

Merge branch 'main' into feat_imp_max_samples

294ef1e

glemaitre self-assigned this Aug 3, 2021

o1iv3r and others added 3 commits August 3, 2021 17:12

Update sklearn/inspection/tests/test_permutation_importance.py

d3207a8

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

safe_indexing, corrected test_permutation_importance_max_samples_erro…

00d9654

…r to actually refer to error handling

_calculate_permutation_scores now more memory efficient and uses safe…

6e8288f

…_indexing; corrected test_permutation_importance_max_samples_error to actually refer to error handling

blackify

acc4d48

glemaitre added 4 commits August 3, 2021 20:47

apply changes in test

130824d

avoid warning from pandas

fc8bd18

cover for negative number as well

a0fae71

doc

4c7df77

glemaitre approved these changes Aug 3, 2021

View reviewed changes

glemaitre added this to NEED ADDITIONAL +1 in Guillaume's pet Aug 3, 2021

jorisvandenbossche mentioned this pull request Aug 4, 2021

FIX Use take in safe_indexing for pandas to avoid SettingWithCopyWarning #20673

Merged

rth self-requested a review August 4, 2021 11:15

glemaitre reviewed Aug 4, 2021

View reviewed changes

ogrisel approved these changes Aug 5, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

sklearn/inspection/_permutation_importance.py Show resolved Hide resolved

o1iv3r and others added 2 commits August 5, 2021 15:39

Update doc/whats_new/v1.0.rst

3d41dd0

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

improved documentation of max_samples

5a0a513

glemaitre reviewed Aug 5, 2021

View reviewed changes

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved

Update sklearn/inspection/_permutation_importance.py

67f3f68

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre reviewed Aug 9, 2021

View reviewed changes

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3a7c29d

remove import

3378a90

glemaitre merged commit 8c6dc48 into scikit-learn:main Aug 9, 2021

glemaitre moved this from NEED ADDITIONAL +1 to MERGED in Guillaume's pet Aug 9, 2021

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add max_samples parameters in permutation_importances #20431

ENH add max_samples parameters in permutation_importances #20431

ENH add max_samples parameters in permutation_importances #20431

ENH add max_samples parameters in permutation_importances #20431

Conversation

Reference Issues/PRs

What does this implement?

Other comments:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment