[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add new function sort_graph_by_row_values #23139

Merged
merged 25 commits into from
May 11, 2022

Conversation

TomDLT
Copy link
Member
@TomDLT TomDLT commented Apr 14, 2022

Fixes #22880

This PR add a new public function called sort_by_row_values, to allow sorting precomputed sparse distance graphs before using them in estimators such as DBSCAN, TSNE, Isomap, etc.

Currently, if a precomputed sparse distance graph is not sorted by row values, the nearest neighbors code sorts the graph and raises an EfficiencyWarning. Making a public function allows users to sort the graph beforehand.

Happy to change to another name if needed. (sort_by_data, sort_graph_by_row_values, sort_csr_by_data, sort_csr_by_row_values, ... ?)

@TomDLT
Copy link
Member Author
TomDLT commented Apr 14, 2022

I also renamed _is_sorted_by_data into _is_sorted_by_row_values for consistency with the new function sort_by_row_values. Do you think it would make sense to also make the function _is_sorted_by_row_values public?

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

I think we need an entry in doc/modules/classes.rst for the new function.

sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
@thomasjpfan thomasjpfan changed the title ENH add new function sort_by_row_values ENH add new function sort_graph_by_row_values Apr 27, 2022
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last concern, otherwise this looks ready to go.

sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
@TomDLT
Copy link
Member Author
TomDLT commented Apr 29, 2022

Thanks for the comments!
Any opinion on making _is_sorted_by_row_values public as well?

@thomasjpfan
Copy link
Member
thomasjpfan commented Apr 29, 2022

Any opinion on making _is_sorted_by_row_values public as well?

I think the most common use case would be how we use it in _check_precomputed. If it is public, we would kind of force everyone to work on csr matrices because is_graph_sorted_by_row_values only works on CSRs.

graph = check_array(graph, accept_sparse="csr")
if not is_graph_sorted_by_row_values(graph):
    graph = sort_graph_by_row_values(graph)

In this case, we can not stop users from starting off with a graph that is bsr, dok or dia.

What if we put the is_graph_sorted_by_row_values check inside of sort_graph_by_row_values?

def sort_graph_by_row_values(graph, copy=False, warn_when_not_sorted=True):
    # check for accepted graph formats
    if is_graph_sorted_by_row_values(graph):
         return graph

    # Maybe always warn?
    if warn_when_not_sorted:
        warnings.warn(...)

    if graph.format != "csr":
        graph = graph.asformat("csr")
    elif copy:
        graph = graph.copy()
    
    # sort graph by rows
    return graph

This is similar to what you had before, but with the extra is_graph_sorted_by_row_values inside sort_graph_by_row_values.

@TomDLT
Copy link
Member Author
TomDLT commented May 2, 2022

What if we put the is_graph_sorted_by_row_values check inside of sort_graph_by_row_values?

I like it, see my latest commit.

sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be cool to add an entry in the dev utilities:

https://scikit-learn.org/dev/developers/utilities.html#developers-utils

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks. Otherwise LGTM

sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
@glemaitre glemaitre merged commit b94bc5e into scikit-learn:main May 11, 2022
@glemaitre
Copy link
Member

Thanks @TomDLT all good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encapsulate nearest neighbor code for sort_sparse_row_values in a function
3 participants