-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix eulidean distances (float32) batch management #13910
[MRG] Fix eulidean distances (float32) batch management #13910
Conversation
…at each step of the x loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but maybe we could expose a batch_size="auto"
argument in euclidean_distances
to make it possible for the caller to pass a specific integer value instead of the memory size heuristic and make it easier to test the impact of batching.
In particular I would like to parameterize the test to make sure that it work with batch_size < n_samples
and batch_size >= n_samples
.
And also when n_samples % batch_size != 0
.
Seems fair. |
But should we make it public ? I'd rather put it as an argument of |
At least for the bugfix we can test it on I would not be opposed to expose it in |
Testability without an additional public parameter is one reason for working_memory being a global config |
Controlling the batch size with the working memory in the purpose of |
Alright but it makes the tests much less explicit. Maybe the code that compute the batch size from the |
The test failure is unrelated: #13903. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well.
Alright but it makes the tests much less explicit.
Well in general one could do,
def test_something():
with sklearn.config_context(working_memory=<somevalue>):
some_function()
that is fairly explicit.
:mod:`sklearn.metrics` | ||
...................... | ||
|
||
- |Fix| Fixed a bug in :func:`euclidean_distances` where a part of the distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention that is a regression in 0.20.1, not an earlier bug that was fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was introduced in 0.21.0 I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gokceneraslan yes, the below diff notes correctly "(regression introduced in 0.21)".
Should we be aiming to release this pronto?
|
I guess so. |
It's not explicit in the sense that we are testing the impact of the batch size. The rule that ties the working memory to the batch size is quite implicit and there is not easy way to write a test that guarantees that all the 3 cases that I mentioned above are covered: |
That indeed. Merging, thanks @jeremiedbb ! |
Fixes #13905
Silly mistake: the Y batch generator was consumed after the first step of the X loop. It needs to be reset at each step.
The tests didn't not cover the case of > 1 batch. I updated them by taking larger datasets.