[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add max_samples parameters in permutation_importances #20431

Merged
merged 40 commits into from
Aug 9, 2021

Conversation

o1iv3r
Copy link
Contributor
@o1iv3r o1iv3r commented Jun 30, 2021

Reference Issues/PRs

Implements a max_samples option for permutation feature importance as discussed in #20245

What does this implement?

If max_samples is < 1.0 or an integer smaller than the number of rows of X, feature importance is calculated on a subset of the data, which is drawn randomly in each iteration to increase the overall coverage of the data. This speeds up the computation on very large data sets and provides more honest uncertainty estimates than sampling only once before the data is passed to permutation_importance().

Other comments:

  • I've added this option to all tests where it made sense for me and also added an error handling test.
  • This is my first sklearn PR so please look at it thoroughly

@o1iv3r o1iv3r changed the title Feat imp max samples Implement max_samples option for permutation feature importance to speed up the computation on large data sets Jun 30, 2021
@o1iv3r
Copy link
Contributor Author
o1iv3r commented Jul 13, 2021

Pylint on the inspection folder works for me locally "Your code has been rated at 5.89/10".
Don't understand while it fails on CircleCI

@o1iv3r o1iv3r changed the title Implement max_samples option for permutation feature importance to speed up the computation on large data sets [MRG] Implement max_samples option for permutation feature importance to speed up the computation on large data sets Jul 14, 2021
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @o1iv3r .

Is there a reference or paper of this technique or a blog post of the technique being used in industry?

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_permutation_importance.py Outdated Show resolved Hide resolved
@glemaitre glemaitre changed the title [MRG] Implement max_samples option for permutation feature importance to speed up the computation on large data sets ENH add max_samples parameters in permutation_importances Jul 21, 2021
@glemaitre
Copy link
Member

You probably want to merge main into your branch as well.

o1iv3r and others added 9 commits July 28, 2021 16:36
Formatting

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Description improved

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Improved documentation

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Removed comment

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Formatting

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Improved commenting

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@o1iv3r
Copy link
Contributor Author
o1iv3r commented Aug 3, 2021

I addressed all comments and merged the upstream master into my feature branch

@glemaitre glemaitre self-assigned this Aug 3, 2021
o1iv3r and others added 3 commits August 3, 2021 17:12
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…_indexing; corrected test_permutation_importance_max_samples_error to actually refer to error handling
@o1iv3r
Copy link
Contributor Author
o1iv3r commented Aug 3, 2021

I hope it is now the way it should be

@glemaitre
Copy link
Member

It was just not passing the test for black. I applied black to solve this issue.
In the future, you can apply black on the file or use pre-commit to make it for you.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side. We will need a second review. I am a bit annoyed with the pandas warning and I don't think that there is any other way than using the ignore_warnings.

@glemaitre
Copy link
Member
glemaitre commented Aug 3, 2021

maybe one of @ogrisel @rth @thomasjpfan has a bit of time to review this one.

for _ in range(n_repeats):
random_state.shuffle(shuffling_idx)
if hasattr(X_permuted, "iloc"):
col = X_permuted.iloc[shuffling_idx, col_idx]
col.index = X_permuted.index
X_permuted.iloc[:, col_idx] = col
with ignore_warnings():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #20673 go in, then there is no need to ignore warning anymore

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few suggestions to explain better the motivation behind this option.

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Show resolved Hide resolved
o1iv3r and others added 2 commits August 5, 2021 15:39
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@glemaitre
Copy link
Member

I will let the CIs pass and then we can merge.

@glemaitre glemaitre merged commit 8c6dc48 into scikit-learn:main Aug 9, 2021
@glemaitre
Copy link
Member

Thanks @o1iv3r

@glemaitre glemaitre moved this from NEED ADDITIONAL +1 to MERGED in Guillaume's pet Aug 9, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…rn#20431)

Co-authored-by: Oli.P <opfaffel@munichre.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants