FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

Micky774 · 2022-04-02T22:27:22Z

Reference Issues/PRs

Resolves #18832 (stalled)
Fixes #18567

What does this implement/fix? Explain your changes.

PR #18832: Added warning to manifold.trustworthiness when n_neighbors > n_features.
This PR: Improved tests and wording and addressed reviewer comments.

Any other comments?

# Conflicts: # sklearn/manifold/tests/test_t_sne.py

Micky774 · 2022-04-02T22:27:45Z

@jjerphan Picking up from PR #18832

thomasjpfan

Thank you for the PR!

In this case, I prefer to error and have this be a bug fix, because the metric seems meaningless for n_neighbors >= n_samples / 2.

sklearn/manifold/tests/test_t_sne.py

sklearn/manifold/_t_sne.py

jeremiedbb · 2022-04-04T14:37:26Z

According to #18567 (comment), the metric is only valid when 2.0 * n_samples - 3.0 * n_neighbors - 1.0 >= 0. I think this is the valid range of values we should use (see the code of the function. Outside of the range gives result > 1).

jeremiedbb · 2022-04-04T14:37:52Z

In this case, I prefer to error and have this be a bug fix

I agree, an error seems more appropriate

thomasjpfan · 2022-04-04T19:02:09Z

From the referenced paper (Page 4 of the PDF), there is a footnote that states the bounds for n_neighbors:

For clarify we have only included the scaling for neighborhoods of size k < N/2

jeremiedbb · 2022-04-04T21:32:11Z

My bad, actually k < n/2 is stricter than 2n - 3k - 1 > 0, so there's no risk of having the result be > 1. I'm ok with the k < n/2 range.

thomasjpfan

Minor comments, otherwise LGTM

doc/whats_new/v1.1.rst

sklearn/manifold/tests/test_t_sne.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

sklearn/manifold/tests/test_t_sne.py

sklearn/manifold/_t_sne.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

sklearn/manifold/tests/test_t_sne.py

jeremiedbb

LGTM. Thanks @Micky774

taha-yassine · 2022-04-09T22:35:57Z

Hi @Micky774 @thomasjpfan @jeremiedbb , as I mentionned in my comment, I don't think this is the correct way to handle the issue. There is indeed a bug in the code and it's an easy one to fix. On the other hand, trustworthiness doesn't require that n_neighbors be <n_samples/2. I can submit a PR to revert the wrongful (IMO) bugfix and fix the actual bug.

thomasjpfan · 2022-04-10T01:08:16Z

From the original reference paper (Footnote on Page 4 of the PDF), the author is explicit about the requirement:

For clarify we have only included the scaling for neighborhoods of size k < N/2

When running code snippet in #18567 (comment), the inverted index and ranks are:

# inverted index
[[7 4 5 6 2 1 3]
 [2 7 5 4 3 1 6]
 [4 6 7 1 5 2 3]
 [5 6 1 7 3 2 4]
 [2 5 4 6 7 3 1]
 [1 2 3 6 5 7 4]
 [2 6 4 5 1 3 7]]

# ranks
[[ 0 -3 -1 -4 -2]
 [-3  0 -4 -2 -1]
 [-1  0 -3  1 -4]
 [-3 -2 -1 -4  0]
 [-2 -1 -4 -3  1]
 [ 0  1 -2 -4 -1]
 [-4  0 -2 -1 -3]]

which do not overflow with the sum. Furthermore, if running on 64-bit Python, the inverted_index is already int64:

import numpy as np
inverted_index = np.zeros((10, 2), dtype=int)
print(inverted_index.dtype)
# int64

taha-yassine · 2022-04-11T09:26:36Z

From the original reference paper (Footnote on Page 4 of the PDF), the author is explicit about the requirement:

For clarify we have only included the scaling for neighborhoods of size k < N/2

When running code snippet in #18567 (comment), the inverted index and ranks are:
# inverted index
[[7 4 5 6 2 1 3]
 [2 7 5 4 3 1 6]
 [4 6 7 1 5 2 3]
 [5 6 1 7 3 2 4]
 [2 5 4 6 7 3 1]
 [1 2 3 6 5 7 4]
 [2 6 4 5 1 3 7]]

# ranks
[[ 0 -3 -1 -4 -2]
 [-3  0 -4 -2 -1]
 [-1  0 -3  1 -4]
 [-3 -2 -1 -4  0]
 [-2 -1 -4 -3  1]
 [ 0  1 -2 -4 -1]
 [-4  0 -2 -1 -3]]
which do not overflow with the sum. Furthermore, if running on 64-bit Python, the inverted_index is already int64:
import numpy as np
inverted_index = np.zeros((10, 2), dtype=int)
print(inverted_index.dtype)
# int64

You are actually right @thomasjpfan, k should be <N/2. However, I still have the int32 issue. Running your code snippet returns int32 in my case although using a 64-bit Python interpreter. It seems like this is an issue related to how windows handles the long int type as discussed here. How do we take it from here? Should I open a new issue to further discuss the details or shoud I send a PR directly to force the use of int64?

thomasjpfan · 2022-04-11T16:27:57Z

Should I open a new issue to further discuss the details or shoud I send a PR directly to force the use of int64?

The first step is to open an issue describing the issue with using dtype=int. Since we use dtype=int in other parts of the code base, we may want to change it everywhere so the code behaviors the same on different platforms.

…rthiness (scikit-learn#23033) Co-authored-by: Shao Yang Hong <hongsy2006@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

hongshaoyang and others added 9 commits November 13, 2020 21:57

Add warning to trustworthiness

cde83a7

Add test

16526ca

Fix lint

95e2ba4

Merge remote-tracking branch 'upstream/main' into 18567-trustworthiness

5877b24

Test warning in separate test

2248d85

Merge remote-tracking branch 'upstream/main' into 18567-trustworthiness

9fab7b5

# Conflicts: # sklearn/manifold/tests/test_t_sne.py

Changelog

5203b2a

Merge branch 'main' into manifold_trust

c4477ef

Added non-regression test and improved wording

d7c4e97

github-actions bot added module:manifold Documentation labels Apr 2, 2022

Micky774 added 2 commits April 3, 2022 10:45

Fixed references formatting in sphinx

49a3013

Added changelog entry

8861f4a

thomasjpfan reviewed Apr 3, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/_t_sne.py Show resolved Hide resolved

thomasjpfan changed the title ~~[MRG] DOC: Add warning to manifold.trustworthiness~~ FIX: Add warning to manifold.trustworthiness Apr 3, 2022

Improved testing and promoted from UserWarning to ValueError

27bffdb

Merge branch 'main' into manifold_trust

8351361

thomasjpfan changed the title ~~FIX: Add warning to manifold.trustworthiness~~ FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness Apr 5, 2022

thomasjpfan approved these changes Apr 5, 2022

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

Micky774 and others added 4 commits April 7, 2022 13:45

Update doc/whats_new/v1.1.rst

655b448

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/manifold/tests/test_t_sne.py

c1f7e6f

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/manifold/tests/test_t_sne.py

224d0db

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Merge branch 'main' into manifold_trust

dbb5e1e

jeremiedbb reviewed Apr 8, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved

Apply suggestions from code review

8a153b5

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Merge branch 'main' into manifold_trust

7a559b9

jeremiedbb reviewed Apr 9, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

Update sklearn/manifold/tests/test_t_sne.py

5ba7655

jeremiedbb approved these changes Apr 9, 2022

View reviewed changes

jeremiedbb merged commit ade9014 into scikit-learn:main Apr 9, 2022

Micky774 deleted the manifold_trust branch April 10, 2022 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment