[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans not running in parallel when init='random' #12949

Closed
fwillo opened this issue Jan 10, 2019 · 4 comments · Fixed by #12955
Closed

KMeans not running in parallel when init='random' #12949

fwillo opened this issue Jan 10, 2019 · 4 comments · Fixed by #12955
Labels
Bug good first issue Easy with clear instructions to resolve help wanted Regression
Milestone

Comments

@fwillo
Copy link
fwillo commented Jan 10, 2019

Description

Dear all,

I experience a difference in behaviour of sklearn.cluster.KMeans when using init='random' or init='k-means++' in combination with n_jobs=-1 (or unequal 1). Not all CPUs are used when init='random', n_jobs=-1 and n_clusers>1. I monitored this with htop. For init='k-means++' this is not the case. Interestingly, this is happening only on Linux (tested Red Hat and Ubuntu, specified in the Versions section is Ubuntu). Another intersting note is, that the behaviour is not observable on my Windows machine, here monitored with Task manager.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from tqdm import tqdm # to check the behaviour in dependence of cluster amount

A = make_blobs(60000, 48, 8)

# i=1 running on all cores, monitored with htop. i > 1 only one core
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='random').fit(A[0])

# For all i's this is using all cores
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='k-means++').fit(A[0])

Expected Results

No difference regarding usage of cores between 'random' and 'k-means++'.

Actual Results

Only working for all cores with 'random' when n_clusters=1, otherwise only using one core. 'k-means++' is using all cores for any value of n_clusters.

Versions

Windows:

Could not locate executable g77
Could not locate executable f77
Could not locate executable ifort
Could not locate executable ifl
Could not locate executable f90
Could not locate executable DF
Could not locate executable efl

System:
    python: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
executable: C:\ProgramData\Miniconda3\pythonw.exe
   machine: Windows-7-6.1.7601-SP1

BLAS:
    macros: 
  lib_dirs: 
cblas_libs: cblas

Python deps:
       pip: 18.1
setuptools: 39.0.1
   sklearn: 0.20.2
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.28.2
    pandas: 0.23.4
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [blas]) or by setting
    the BLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [blas_src]) or by setting
    the BLAS_SRC environment variable.
  self.calc_info()

Linux:

System:
    python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)  [GCC 7.2.0]
executable: /cluster/programs/miniconda/envs/miniconda-36/bin/python
   machine: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /cluster/programs/miniconda/envs/miniconda-36/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.0
setuptools: 38.4.0
   sklearn: 0.20.2
     numpy: 1.14.2
     scipy: 1.1.0
    Cython: 0.27.3
    pandas: 0.23.4
@jnothman jnothman added this to the 0.20.3 milestone Jan 11, 2019
@jnothman jnothman added Bug good first issue Easy with clear instructions to resolve help wanted labels Jan 11, 2019
@jnothman
Copy link
Member

It looks like 40e6c43 made the incorrect change to k_means_.py.

if effective_n_jobs(n_jobs):
should say == 1.

@nixphix
Copy link
Contributor
nixphix commented Jan 11, 2019

Let me work on this, I just need some pointers on regression test. I found a existing test case checking for cpu count, should I write a test case similar to this.

def test_multi_output_classification_partial_fit_parallelism():
sgd_linear_clf = SGDClassifier(loss='log', random_state=1, max_iter=5)
mor = MultiOutputClassifier(sgd_linear_clf, n_jobs=-1)
mor.partial_fit(X, y, classes)
est1 = mor.estimators_[0]
mor.partial_fit(X, y)
est2 = mor.estimators_[0]
if cpu_count() > 1:
# parallelism requires this to be the case for a sane implementation
assert est1 is not est2

@jnothman
Copy link
Member
jnothman commented Jan 11, 2019 via email

@fwillo
Copy link
Author
fwillo commented Jan 11, 2019

I made the correction @nixphix suggested in #12955 and can confirm that it's working like intended, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Easy with clear instructions to resolve help wanted Regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants