[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033

Merged
merged 20 commits into from
Apr 9, 2022

Conversation

Micky774
Copy link
Contributor
@Micky774 Micky774 commented Apr 2, 2022

Reference Issues/PRs

Resolves #18832 (stalled)
Fixes #18567

What does this implement/fix? Explain your changes.

PR #18832: Added warning to manifold.trustworthiness when n_neighbors > n_features.
This PR: Improved tests and wording and addressed reviewer comments.

Any other comments?

@Micky774
Copy link
Contributor Author
Micky774 commented Apr 2, 2022

@jjerphan Picking up from PR #18832

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

In this case, I prefer to error and have this be a bug fix, because the metric seems meaningless for n_neighbors >= n_samples / 2.

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Show resolved Hide resolved
@thomasjpfan thomasjpfan changed the title [MRG] DOC: Add warning to manifold.trustworthiness FIX: Add warning to manifold.trustworthiness Apr 3, 2022
@jeremiedbb
Copy link
Member

According to #18567 (comment), the metric is only valid when 2.0 * n_samples - 3.0 * n_neighbors - 1.0 >= 0. I think this is the valid range of values we should use (see the code of the function. Outside of the range gives result > 1).

@jeremiedbb
Copy link
Member

In this case, I prefer to error and have this be a bug fix

I agree, an error seems more appropriate

@thomasjpfan
Copy link
Member
thomasjpfan commented Apr 4, 2022

From the referenced paper (Page 4 of the PDF), there is a footnote that states the bounds for n_neighbors:

For clarify we have only included the scaling for neighborhoods of size k < N/2

@jeremiedbb
Copy link
Member

My bad, actually k < n/2 is stricter than 2n - 3k - 1 > 0, so there's no risk of having the result be > 1. I'm ok with the k < n/2 range.

@thomasjpfan thomasjpfan changed the title FIX: Add warning to manifold.trustworthiness FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness Apr 5, 2022
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
Micky774 and others added 4 commits April 7, 2022 13:45
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Micky774

@jeremiedbb jeremiedbb merged commit ade9014 into scikit-learn:main Apr 9, 2022
@taha-yassine
Copy link

Hi @Micky774 @thomasjpfan @jeremiedbb , as I mentionned in my comment, I don't think this is the correct way to handle the issue. There is indeed a bug in the code and it's an easy one to fix. On the other hand, trustworthiness doesn't require that n_neighbors be <n_samples/2. I can submit a PR to revert the wrongful (IMO) bugfix and fix the actual bug.

@thomasjpfan
Copy link
Member
thomasjpfan commented Apr 10, 2022

From the original reference paper (Footnote on Page 4 of the PDF), the author is explicit about the requirement:

For clarify we have only included the scaling for neighborhoods of size k < N/2

When running code snippet in #18567 (comment), the inverted index and ranks are:

# inverted index
[[7 4 5 6 2 1 3]
 [2 7 5 4 3 1 6]
 [4 6 7 1 5 2 3]
 [5 6 1 7 3 2 4]
 [2 5 4 6 7 3 1]
 [1 2 3 6 5 7 4]
 [2 6 4 5 1 3 7]]

# ranks
[[ 0 -3 -1 -4 -2]
 [-3  0 -4 -2 -1]
 [-1  0 -3  1 -4]
 [-3 -2 -1 -4  0]
 [-2 -1 -4 -3  1]
 [ 0  1 -2 -4 -1]
 [-4  0 -2 -1 -3]]

which do not overflow with the sum. Furthermore, if running on 64-bit Python, the inverted_index is already int64:

import numpy as np
inverted_index = np.zeros((10, 2), dtype=int)
print(inverted_index.dtype)
# int64

@Micky774 Micky774 deleted the manifold_trust branch April 10, 2022 03:07
@taha-yassine
Copy link

From the original reference paper (Footnote on Page 4 of the PDF), the author is explicit about the requirement:

For clarify we have only included the scaling for neighborhoods of size k < N/2

When running code snippet in #18567 (comment), the inverted index and ranks are:

# inverted index
[[7 4 5 6 2 1 3]
 [2 7 5 4 3 1 6]
 [4 6 7 1 5 2 3]
 [5 6 1 7 3 2 4]
 [2 5 4 6 7 3 1]
 [1 2 3 6 5 7 4]
 [2 6 4 5 1 3 7]]

# ranks
[[ 0 -3 -1 -4 -2]
 [-3  0 -4 -2 -1]
 [-1  0 -3  1 -4]
 [-3 -2 -1 -4  0]
 [-2 -1 -4 -3  1]
 [ 0  1 -2 -4 -1]
 [-4  0 -2 -1 -3]]

which do not overflow with the sum. Furthermore, if running on 64-bit Python, the inverted_index is already int64:

import numpy as np
inverted_index = np.zeros((10, 2), dtype=int)
print(inverted_index.dtype)
# int64

You are actually right @thomasjpfan, k should be <N/2. However, I still have the int32 issue. Running your code snippet returns int32 in my case although using a 64-bit Python interpreter. It seems like this is an issue related to how windows handles the long int type as discussed here. How do we take it from here? Should I open a new issue to further discuss the details or shoud I send a PR directly to force the use of int64?

@thomasjpfan
Copy link
Member

Should I open a new issue to further discuss the details or shoud I send a PR directly to force the use of int64?

The first step is to open an issue describing the issue with using dtype=int. Since we use dtype=int in other parts of the code base, we may want to change it everywhere so the code behaviors the same on different platforms.

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…rthiness (scikit-learn#23033)

Co-authored-by: Shao Yang Hong <hongsy2006@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: sklearn.manifold.turstworthiness outputs values larger than 1
5 participants