[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans processing n_init sequentially!! #23366

Closed
ShivKJ opened this issue May 13, 2022 · 1 comment
Closed

KMeans processing n_init sequentially!! #23366

ShivKJ opened this issue May 13, 2022 · 1 comment
Labels
Needs Triage Issue requires triage

Comments

@ShivKJ
Copy link
ShivKJ commented May 13, 2022

Hi,

I was looking into KMeans code and found that the following can be parallelized. For example, each work in for loop can be processed independently. I expect this to reduce the runtime. Please check.

for i in range(self._n_init):
# Initialize centers
centers_init = self._init_centroids(
X, x_squared_norms=x_squared_norms, init=init, random_state=random_state
)
if self.verbose:
print("Initialization complete")
# run a k-means once
labels, inertia, centers, n_iter_ = kmeans_single(
X,
sample_weight,
centers_init,
max_iter=self.max_iter,
verbose=self.verbose,
tol=self._tol,
x_squared_norms=x_squared_norms,
n_threads=self._n_threads,
)
# determine if these results are the best so far
# we chose a new run if it has a better inertia and the clustering is
# different from the best so far (it's possible that the inertia is
# slightly better even if the clustering is the same with potentially
# permuted labels, due to rounding errors)
if best_inertia is None or (
inertia < best_inertia
and not _is_same_clustering(labels, best_labels, self.n_clusters)
):
best_labels = labels
best_centers = centers
best_inertia = inertia
best_n_iter = n_iter_

@github-actions github-actions bot added the Needs Triage Issue requires triage label May 13, 2022
@glemaitre
Copy link
Member

You can have a look at the following blog post explaining the current optimization and to see where the parallelization is currently done in the code: https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23/

The associated PR was the following: #11950

Sometimes optimum parallelization is not as easy as it looks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

2 participants