DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

pfaucon · 2016-12-07T18:40:02Z

Description

DBSCAN seems not to use multiple processors (n_jobs argument ignored)
it looks like dbscan hands the arguments off to nearest neighbor, but NN only uses the n_jobs arguments for certain clustering types (presumably not ones that dbscan calls by default). It would be good to mention how to change input to use the n_jobs parameter, and possibly modify the default values to make it useful.

Steps/Code to Reproduce

code taken from:
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=1000000, centers=centers, cluster_std=0.4,
random_state=0)

X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=10, n_jobs=-1).fit(X)

Expected Results

answer is correct but the job should be split between processors, and time consumed should be significantly less.

Actual Results

seems to run on only one processor

Versions

import platform; print(platform.platform())
Linux-3.13.0-101-generic-x86_64-with-Ubuntu-14.04-trusty
import sys; print("Python", sys.version)
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4]
import numpy; print("NumPy", numpy.version)
NumPy 1.11.2
import scipy; print("SciPy", scipy.version)
SciPy 0.18.1
import sklearn; print("Scikit-Learn", sklearn.version)
Scikit-Learn 0.18.1

amueller · 2016-12-07T19:08:17Z

you can set algorithm="brute" to use multiple cores but that will probably make it slower. The neighbors module decides it wants to use a tree, which we haven't parallelized yet.

How many cores do you have? And can you report times for the default setting an for algorithm="brute"?

jnothman · 2016-12-08T01:47:00Z

In #4009, we failed to find an implementation in which parallelism in radius_neighbors for the spatial trees was effectively faster. Perhaps this needs further experimentation.

…

On 8 December 2016 at 06:08, Andreas Mueller ***@***.***> wrote: you can set algorithm="brute" to use multiple cores but that will probably make it slower. The neighbors module decides it wants to use a tree, which we haven't parallelized yet. How many cores do you have? And can you report times for the default setting an for algorithm="brute"? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8003 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xUQvcXH4z7DigPYkcPgSikJYN8cks5rFwQigaJpZM4LG8pK> .

jnothman · 2016-12-08T01:47:46Z

@pfaucon, a clarification in the documentation is welcome. Please submit a pull request. Otherwise I'm closing this as something we can't do much about.

jnothman · 2016-12-08T01:48:39Z

Actually, as you seem to be requesting documentation changes, I'll leave it open and you or someone else can contribute a fix.

Don86 · 2016-12-12T04:13:55Z

Hi, I'm new to scikit learn, but I'd like to contribute to this.

jnothman · 2016-12-12T04:41:47Z

Go ahead

kushagraagrawal · 2016-12-16T12:01:57Z

is this issue still open? I'm new to scikit-learn, and would like to try

amueller · 2016-12-16T20:34:23Z

@kushagraagrawal no, PR at #8039

sp7412 · 2020-01-29T22:58:41Z

Just wondering if there's any update on this issue.

jnothman closed this as completed Dec 8, 2016

jnothman reopened this Dec 8, 2016

jnothman added Documentation Easy Well-defined and straightforward way to resolve Need Contributor labels Dec 8, 2016

Don86 mentioned this issue Dec 12, 2016

[MRG] Updated dbscan(): added documentation #8039

Closed

amueller removed the Need Contributor label Dec 16, 2016

recamshak mentioned this issue Mar 28, 2018

[MRG+1] Parallel radius neighbors #10887

Merged

4 tasks

jnothman closed this as completed in #10887 Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

Comments

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions