KMeans not running in parallel when init='random' #12949

fwillo · 2019-01-10T11:22:04Z

Description

Dear all,

I experience a difference in behaviour of sklearn.cluster.KMeans when using init='random' or init='k-means++' in combination with n_jobs=-1 (or unequal 1). Not all CPUs are used when init='random', n_jobs=-1 and n_clusers>1. I monitored this with htop. For init='k-means++' this is not the case. Interestingly, this is happening only on Linux (tested Red Hat and Ubuntu, specified in the Versions section is Ubuntu). Another intersting note is, that the behaviour is not observable on my Windows machine, here monitored with Task manager.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from tqdm import tqdm # to check the behaviour in dependence of cluster amount

A = make_blobs(60000, 48, 8)

# i=1 running on all cores, monitored with htop. i > 1 only one core
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='random').fit(A[0])

# For all i's this is using all cores
for i in tqdm(range(1, 10)):
    model = KMeans(n_clusters=i, n_jobs=-1, n_init=200, max_iter=500, init='k-means++').fit(A[0])

Expected Results

No difference regarding usage of cores between 'random' and 'k-means++'.

Actual Results

Only working for all cores with 'random' when n_clusters=1, otherwise only using one core. 'k-means++' is using all cores for any value of n_clusters.

Versions

Windows:

Could not locate executable g77
Could not locate executable f77
Could not locate executable ifort
Could not locate executable ifl
Could not locate executable f90
Could not locate executable DF
Could not locate executable efl

System:
    python: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
executable: C:\ProgramData\Miniconda3\pythonw.exe
   machine: Windows-7-6.1.7601-SP1

BLAS:
    macros: 
  lib_dirs: 
cblas_libs: cblas

Python deps:
       pip: 18.1
setuptools: 39.0.1
   sklearn: 0.20.2
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.28.2
    pandas: 0.23.4
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [blas]) or by setting
    the BLAS environment variable.
  self.calc_info()
C:\ProgramData\Miniconda3\lib\site-packages\numpy\distutils\system_info.py:625: UserWarning: 
    Blas (http://www.netlib.org/blas/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [blas_src]) or by setting
    the BLAS_SRC environment variable.
  self.calc_info()

Linux:

System:
    python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)  [GCC 7.2.0]
executable: /cluster/programs/miniconda/envs/miniconda-36/bin/python
   machine: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /cluster/programs/miniconda/envs/miniconda-36/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.0
setuptools: 38.4.0
   sklearn: 0.20.2
     numpy: 1.14.2
     scipy: 1.1.0
    Cython: 0.27.3
    pandas: 0.23.4

The text was updated successfully, but these errors were encountered:

jnothman · 2019-01-11T01:57:09Z

It looks like 40e6c43 made the incorrect change to k_means_.py.

scikit-learn/sklearn/cluster/k_means_.py

Line 370 in d300f40

if effective_n_jobs(n_jobs):

should say == 1.

nixphix · 2019-01-11T03:25:30Z

Let me work on this, I just need some pointers on regression test. I found a existing test case checking for cpu count, should I write a test case similar to this.

scikit-learn/sklearn/tests/test_multioutput.py

Lines 168 to 177 in fa98a72

    
           def test_multi_output_classification_partial_fit_parallelism(): 
        
               sgd_linear_clf = SGDClassifier(loss='log', random_state=1, max_iter=5) 
        
               mor = MultiOutputClassifier(sgd_linear_clf, n_jobs=-1) 
        
               mor.partial_fit(X, y, classes) 
        
               est1 = mor.estimators_[0] 
        
               mor.partial_fit(X, y) 
        
               est2 = mor.estimators_[0] 
        
               if cpu_count() > 1: 
        
                   # parallelism requires this to be the case for a sane implementation 
        
                   assert est1 is not est2

jnothman · 2019-01-11T06:06:19Z

PR welcome. I don't think it would be easy to write a test, nor do I see it as essential. But I expect at present that the else case is never executed (the coverage drop might have been masked by the rest of that change), so a solution might at least increase test coverage

fwillo · 2019-01-11T16:20:42Z

I made the correction @nixphix suggested in #12955 and can confirm that it's working like intended, thanks!

jnothman added this to the 0.20.3 milestone Jan 11, 2019

jnothman added Bug good first issue Easy with clear instructions to resolve help wanted labels Jan 11, 2019

jnothman added the Regression label Jan 11, 2019

nixphix mentioned this issue Jan 11, 2019

[MRG] Fix parallelisation of kmeans clustering #12955

Merged

jnothman closed this as completed in #12955 Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMeans not running in parallel when init='random' #12949

KMeans not running in parallel when init='random' #12949

KMeans not running in parallel when init='random' #12949

KMeans not running in parallel when init='random' #12949

Comments

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions