-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add max_samples parameters in permutation_importances #20431
Conversation
Pylint on the inspection folder works for me locally "Your code has been rated at 5.89/10". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @o1iv3r .
Is there a reference or paper of this technique or a blog post of the technique being used in industry?
You probably want to merge |
Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Description improved Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Improved documentation Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Removed comment Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Formatting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Improved commenting Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
I addressed all comments and merged the upstream master into my feature branch |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…r to actually refer to error handling
…_indexing; corrected test_permutation_importance_max_samples_error to actually refer to error handling
I hope it is now the way it should be |
It was just not passing the test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on my side. We will need a second review. I am a bit annoyed with the pandas warning and I don't think that there is any other way than using the ignore_warnings
.
maybe one of @ogrisel @rth @thomasjpfan has a bit of time to review this one. |
for _ in range(n_repeats): | ||
random_state.shuffle(shuffling_idx) | ||
if hasattr(X_permuted, "iloc"): | ||
col = X_permuted.iloc[shuffling_idx, col_idx] | ||
col.index = X_permuted.index | ||
X_permuted.iloc[:, col_idx] = col | ||
with ignore_warnings(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #20673 go in, then there is no need to ignore warning anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few suggestions to explain better the motivation behind this option.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
I will let the CIs pass and then we can merge. |
Thanks @o1iv3r |
…rn#20431) Co-authored-by: Oli.P <opfaffel@munichre.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Implements a max_samples option for permutation feature importance as discussed in #20245
What does this implement?
If max_samples is < 1.0 or an integer smaller than the number of rows of X, feature importance is calculated on a subset of the data, which is drawn randomly in each iteration to increase the overall coverage of the data. This speeds up the computation on very large data sets and provides more honest uncertainty estimates than sampling only once before the data is passed to permutation_importance().
Other comments: