[MRG] Fix eulidean distances (float32) batch management #13910

jeremiedbb · 2019-05-20T08:47:11Z

Silly mistake: the Y batch generator was consumed after the first step of the X loop. It needs to be reset at each step.

The tests didn't not cover the case of > 1 batch. I updated them by taking larger datasets.

…at each step of the x loop

…-batches

doc/whats_new/v0.21.rst

sklearn/metrics/tests/test_pairwise.py

ogrisel

LGTM but maybe we could expose a batch_size="auto" argument in euclidean_distances to make it possible for the caller to pass a specific integer value instead of the memory size heuristic and make it easier to test the impact of batching.

In particular I would like to parameterize the test to make sure that it work with batch_size < n_samples and batch_size >= n_samples.

And also when n_samples % batch_size != 0.

jeremiedbb · 2019-05-20T09:19:30Z

Seems fair.

jeremiedbb · 2019-05-20T09:21:22Z

But should we make it public ? I'd rather put it as an argument of _euclidean_distances_upcast and test this function. What do you think about that ?

ogrisel · 2019-05-20T09:29:11Z

But should we make it public ? I'd rather put it as an argument of _euclidean_distances_upcast and test this function. What do you think about that ?

At least for the bugfix we can test it on _euclidean_distances_upcast.

I would not be opposed to expose it in euclidean_distances as well, but I don't want to delay this PR if others do not agree.

jnothman · 2019-05-20T09:36:21Z

Testability without an additional public parameter is one reason for working_memory being a global config

jeremiedbb · 2019-05-20T09:59:00Z

Testability without an additional public parameter is one reason for working_memory being a global config

Controlling the batch size with the working memory in the purpose of pairwise_distances_chunked. I'm not sure how it would interact if we control another batch size with the working memory in pairwise_distances.

ogrisel · 2019-05-20T11:32:27Z

Testability without an additional public parameter is one reason for working_memory being a global config

Alright but it makes the tests much less explicit. Maybe the code that compute the batch size from the working_memory parameter and the shape of the data could be externalized in a private function to be also tests in dedicated unit tests.

ogrisel · 2019-05-20T11:36:04Z

The test failure is unrelated: #13903.

rth

LGTM as well.

Alright but it makes the tests much less explicit.

Well in general one could do,

def test_something():
    with sklearn.config_context(working_memory=<somevalue>):
        some_function()

that is fairly explicit.

rth · 2019-05-20T14:02:56Z

doc/whats_new/v0.21.rst

+:mod:`sklearn.metrics`
+......................
+
+- |Fix| Fixed a bug in :func:`euclidean_distances` where a part of the distance


Maybe mention that is a regression in 0.20.1, not an earlier bug that was fixed.

It was introduced in 0.21.0 I think.

@gokceneraslan yes, the below diff notes correctly "(regression introduced in 0.21)".

jnothman · 2019-05-20T14:15:10Z

Should we be aiming to release this pronto?

ogrisel · 2019-05-20T14:20:22Z

Should we be aiming to release this pronto?

I guess so.

ogrisel · 2019-05-20T14:23:15Z

that is fairly explicit.

It's not explicit in the sense that we are testing the impact of the batch size. The rule that ties the working memory to the batch size is quite implicit and there is not easy way to write a test that guarantees that all the 3 cases that I mentioned above are covered: batch_size < n_samples (with n_samples % batch_size != 0 and n_samples % batch_size == 0) and batch_size >= n_samples.

rth · 2019-05-20T14:39:33Z

It's not explicit in the sense that we are testing the impact of the batch size.

That indeed.

Merging, thanks @jeremiedbb !

…ement (scikit-learn#13910)

update tests to have > 1 batch; re-instanciate the y batch generator …

b14cad2

…at each step of the x loop

jeremiedbb mentioned this pull request May 20, 2019

Untreated overflow (?) for float32 in euclidean_distances new in sklearn 21.1 #13905

Closed

jeremiedbb added 2 commits May 20, 2019 10:55

Merge remote-tracking branch 'upstream/master' into fix-pairwise-dist…

b5073ac

…-batches

what's new entry

78ee68b

ogrisel reviewed May 20, 2019

View reviewed changes

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved

ogrisel added this to the 0.21.2 milestone May 20, 2019

ogrisel reviewed May 20, 2019

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

ogrisel added the Blocker label May 20, 2019

0.21.2 what's new

71db4c1

ogrisel approved these changes May 20, 2019

View reviewed changes

batch_size param of _euclidean_dist_upcast & add tests for it

a7e2f8c

cos

29deb07

rth approved these changes May 20, 2019

View reviewed changes

mention regression

bd5e46f

rth merged commit 98aefc1 into scikit-learn:master May 20, 2019

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request May 21, 2019

FIX regression in eulidean distances (float32) related to batch manag…

ef9f42e

…ement (scikit-learn#13910)

gokceneraslan mentioned this pull request May 21, 2019

Scanpy not working correctly with scikit-learn 0.21.1 scverse/scanpy#654

Closed

jeremiedbb mentioned this pull request May 23, 2019

Wrong results from pairwise_distances #13929

Closed

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX regression in eulidean distances (float32) related to batch manag…

10380ea

…ement (scikit-learn#13910)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix eulidean distances (float32) batch management #13910

[MRG] Fix eulidean distances (float32) batch management #13910

[MRG] Fix eulidean distances (float32) batch management #13910

[MRG] Fix eulidean distances (float32) batch management #13910

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment