[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Optimize runtime for IsolationForest #23149

Merged
merged 6 commits into from
Apr 29, 2022

Conversation

MaxwellLZH
Copy link
Contributor
@MaxwellLZH MaxwellLZH commented Apr 17, 2022

Reference Issues/PRs

Towards #19275

What does this implement/fix? Explain your changes.

As shown here in the original issue discussion, the check_input argument is set to False for Forest classes, while for bagging estimators it's left as default value True, therefore we're validating the data repeatedly.

The proposed fix adds a check_input argument to the ensemble._bagging._parallel_build_estimators, which can be set to False when the base estimator actually supports that argument during the fit process.

Performance Impact

Code used for profiling:

from sklearn.datasets import make_classification
from scipy.sparse import csc_matrix, csr_matrix
from sklearn.ensemble import IsolationForest

X, y = make_classification(n_samples=50000, n_features=1000)
X = csc_matrix(X)
X.sort_indices()
IsolationForest(n_estimators=10, max_samples=256, n_jobs=1).fit(X)

Before (total time: 6.78s)
image

After (total time: 5.017s)
image

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

This looks reasonable to do. What does the benchmark look like for dense data?

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved
@MaxwellLZH
Copy link
Contributor Author

The benchmark result for dense data shows smaller improvement compared to sparse inputs

Before (8.29s):
image

After (7.92s):
image

@MaxwellLZH
Copy link
Contributor Author

After more profiling, it seems like the indexing operation of estimator.fit(X[:, features, y) in _bagging.py::_parallel_build_estimators is quite expensive, I tried removing the indexing operation when max_features is equal to feature numbers using the following logic:

# only index arrays when needed
X_ = X if max_features == X.shape[1] else X[:, features]
estimator_fit(X_, y, sample_weight=curr_sample_weight)

and see improvements for both sparse and dense inputs.

While the problem will still exist when max_features is not set to 1.0 for IsolationForest. I'm wondering since we're randomly pick a single feature to split anyway, can we always set the max_features to be 1.0 for IsolationForest so we can get a faster run time.

Sparse Input:
Before (4.97s)
image

After (0.241s)
image

Dense Input:
Before (7.99s)
image

After (5.19s)
image

@@ -120,10 +135,12 @@ def _parallel_build_estimators(
not_indices_mask = ~indices_to_mask(indices, n_samples)
curr_sample_weight[not_indices_mask] = 0

estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)
# only index arrays when needed
X_ = X if max_features == X.shape[1] else X[:, features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this in a separate follow up PR. The check_input change is already an improvement that can be merged by itself.

As for this change, we can only do this optimization if bootstrap_features is False. bootstrap_features is always False for IsolationForest, but not for Bagging*. (If bootstrap_features is True, then the features are sampled with replacement which can result in repeated features.)

Copy link
Contributor Author
@MaxwellLZH MaxwellLZH Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I've reverted the changes in this PR, and will draft another PR once this got merged :)

This reverts commit ff4e429.

revert last commit
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan added the Quick Review For PRs that are quick to review label Apr 28, 2022
@ogrisel
Copy link
Member
ogrisel commented Apr 29, 2022

Merging this one. The real fix for the sparse case should happen in a follow-up PR probably targeting 1.2.

@ogrisel ogrisel merged commit 767e9ae into scikit-learn:main Apr 29, 2022
@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Apr 29, 2022
@ogrisel
Copy link
Member
ogrisel commented Apr 29, 2022

@jeremiedbb I added the "to backport" label to not forget to include this one in the 1.1.X branch before the final 1.1.0 release.

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request May 10, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Quick Review For PRs that are quick to review To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants