-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Raise error when n_neighbours >= n_samples / 2 in manifold.trustworthiness #23033
Conversation
# Conflicts: # sklearn/manifold/tests/test_t_sne.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
In this case, I prefer to error and have this be a bug fix, because the metric seems meaningless for n_neighbors >= n_samples / 2
.
manifold.trustworthiness
manifold.trustworthiness
According to #18567 (comment), the metric is only valid when |
I agree, an error seems more appropriate |
From the referenced paper (Page 4 of the PDF), there is a footnote that states the bounds for
|
My bad, actually k < n/2 is stricter than 2n - 3k - 1 > 0, so there's no risk of having the result be > 1. I'm ok with the k < n/2 range. |
manifold.trustworthiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Micky774
Hi @Micky774 @thomasjpfan @jeremiedbb , as I mentionned in my comment, I don't think this is the correct way to handle the issue. There is indeed a bug in the code and it's an easy one to fix. On the other hand, trustworthiness doesn't require that |
From the original reference paper (Footnote on Page 4 of the PDF), the author is explicit about the requirement:
When running code snippet in #18567 (comment), the inverted index and ranks are: # inverted index
[[7 4 5 6 2 1 3]
[2 7 5 4 3 1 6]
[4 6 7 1 5 2 3]
[5 6 1 7 3 2 4]
[2 5 4 6 7 3 1]
[1 2 3 6 5 7 4]
[2 6 4 5 1 3 7]]
# ranks
[[ 0 -3 -1 -4 -2]
[-3 0 -4 -2 -1]
[-1 0 -3 1 -4]
[-3 -2 -1 -4 0]
[-2 -1 -4 -3 1]
[ 0 1 -2 -4 -1]
[-4 0 -2 -1 -3]] which do not overflow with the sum. Furthermore, if running on 64-bit Python, the import numpy as np
inverted_index = np.zeros((10, 2), dtype=int)
print(inverted_index.dtype)
# int64 |
You are actually right @thomasjpfan, k should be <N/2. However, I still have the int32 issue. Running your code snippet returns |
The first step is to open an issue describing the issue with using |
…rthiness (scikit-learn#23033) Co-authored-by: Shao Yang Hong <hongsy2006@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Reference Issues/PRs
Resolves #18832 (stalled)
Fixes #18567
What does this implement/fix? Explain your changes.
PR #18832: Added warning to
manifold.trustworthiness
whenn_neighbors > n_features
.This PR: Improved tests and wording and addressed reviewer comments.
Any other comments?