[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix eulidean distances (float32) batch management #13910

Merged
merged 7 commits into from
May 20, 2019

Conversation

jeremiedbb
Copy link
Member

Fixes #13905

Silly mistake: the Y batch generator was consumed after the first step of the X loop. It needs to be reset at each step.

The tests didn't not cover the case of > 1 batch. I updated them by taking larger datasets.

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
@ogrisel ogrisel added this to the 0.21.2 milestone May 20, 2019
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but maybe we could expose a batch_size="auto" argument in euclidean_distances to make it possible for the caller to pass a specific integer value instead of the memory size heuristic and make it easier to test the impact of batching.

In particular I would like to parameterize the test to make sure that it work with batch_size < n_samples and batch_size >= n_samples.

And also when n_samples % batch_size != 0.

@jeremiedbb
Copy link
Member Author

Seems fair.

@jeremiedbb
Copy link
Member Author

But should we make it public ? I'd rather put it as an argument of _euclidean_distances_upcast and test this function. What do you think about that ?

@ogrisel
Copy link
Member
ogrisel commented May 20, 2019

But should we make it public ? I'd rather put it as an argument of _euclidean_distances_upcast and test this function. What do you think about that ?

At least for the bugfix we can test it on _euclidean_distances_upcast.

I would not be opposed to expose it in euclidean_distances as well, but I don't want to delay this PR if others do not agree.

@jnothman
Copy link
Member

Testability without an additional public parameter is one reason for working_memory being a global config

@jeremiedbb
Copy link
Member Author
jeremiedbb commented May 20, 2019

Testability without an additional public parameter is one reason for working_memory being a global config

Controlling the batch size with the working memory in the purpose of pairwise_distances_chunked. I'm not sure how it would interact if we control another batch size with the working memory in pairwise_distances.

@ogrisel
Copy link
Member
ogrisel commented May 20, 2019

Testability without an additional public parameter is one reason for working_memory being a global config

Alright but it makes the tests much less explicit. Maybe the code that compute the batch size from the working_memory parameter and the shape of the data could be externalized in a private function to be also tests in dedicated unit tests.

@ogrisel
Copy link
Member
ogrisel commented May 20, 2019

The test failure is unrelated: #13903.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

Alright but it makes the tests much less explicit.

Well in general one could do,

def test_something():
    with sklearn.config_context(working_memory=<somevalue>):
        some_function()

that is fairly explicit.

:mod:`sklearn.metrics`
......................

- |Fix| Fixed a bug in :func:`euclidean_distances` where a part of the distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that is a regression in 0.20.1, not an earlier bug that was fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was introduced in 0.21.0 I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gokceneraslan yes, the below diff notes correctly "(regression introduced in 0.21)".

@jnothman
Copy link
Member
jnothman commented May 20, 2019 via email

@ogrisel
Copy link
Member
ogrisel commented May 20, 2019

Should we be aiming to release this pronto?

I guess so.

@ogrisel
Copy link
Member
ogrisel commented May 20, 2019

that is fairly explicit.

It's not explicit in the sense that we are testing the impact of the batch size. The rule that ties the working memory to the batch size is quite implicit and there is not easy way to write a test that guarantees that all the 3 cases that I mentioned above are covered: batch_size < n_samples (with n_samples % batch_size != 0 and n_samples % batch_size == 0) and batch_size >= n_samples.

@rth
Copy link
Member
rth commented May 20, 2019

It's not explicit in the sense that we are testing the impact of the batch size.

That indeed.

Merging, thanks @jeremiedbb !

@rth rth merged commit 98aefc1 into scikit-learn:master May 20, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request May 21, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Untreated overflow (?) for float32 in euclidean_distances new in sklearn 21.1
5 participants