-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_n_leaves() and get_max_depth() to DesicionTrees #12300
Conversation
doc/whats_new/v0.21.rst
Outdated
on it, including :class:`tree.DecisionTreeClassifier`, | ||
:class:`tree.DecisionTreeRegressor`, :class:`tree.ExtraTreeClassifier`, | ||
and :class:`tree.ExtraTreeRegressor`. | ||
:issue:`7613` by :user:`Adrin Jalali <adrinjalali>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue should link to your PR #12300, not the actual issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, not really. There seems to be different preferences on this one, and I personally prefer to link to the issues and not the PRs, and if you check v0.20 whats_new page, there are 352 cases of links with "/issue/" in them, and only 1 "/pull/".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you click on most of them, those /issue/ links actually lead to a PR. It's misleading and I did the same before (reference an issue number), but was corrected by @glemaitre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confusing but :issue: link to the PR (The PR will link to the issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of linking to the pr is:
- It's easier for us to track the difference between the change log and the commit log
- it's easier to find your way to all relevant history
sloria's Sphinx-issues library which we vendor here recently supports a separate :pr: role so we could upgrade and start using that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should it be sth like:
- |Feature| :issue:`7613` ``get_n_leaves()`` and ``get_max_depth()`` have been added to
:class:`tree.BaseDecisionTree` and consequently all estimators based
on it, including :class:`tree.DecisionTreeClassifier`,
:class:`tree.DecisionTreeRegressor`, :class:`tree.ExtraTreeClassifier`,
and :class:`tree.ExtraTreeRegressor`.
:pr:`12300` by :user:`Adrin Jalali <adrinjalali>`.
sklearn/tree/tests/test_tree.py
Outdated
@@ -1233,8 +1232,7 @@ def test_max_leaf_nodes_max_depth(): | |||
k = 4 | |||
for name, TreeEstimator in ALL_TREES.items(): | |||
est = TreeEstimator(max_depth=1, max_leaf_nodes=k).fit(X, y) | |||
tree = est.tree_ | |||
assert_greater(tree.max_depth, 1) | |||
assert_greater(est.get_max_depth(), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this test. Why should it be greater than max_depth
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, don't hate me, I just change a test to use the new functions. But I suppose that test is valid because X
and y
are manually designed in a way that we know the tree will be deeper if allowed, and this makes sure max_depth
is ignored by the estimator if max_leaf_nodes
is given. At least that's what the comment says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh... I didn't read the comment. That behavior is... unexpected to me? Should we raise a warning in that case? Do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no test that shows that the depth can be smaller than max_depth, right? We should have some slightly more explicit test, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller, this actually seems a bug to me, which I think can be fixed this way.
On master/this branch, we have the following:
>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = load_iris(return_X_y=True)
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
2
>>> clf.get_n_leaves()
3
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
5
>>> clf.get_n_leaves()
9
Which shows that the max_depth
is not actually ignored, it's just an overshoot by 1. And the issue is here:
scikit-learn/sklearn/tree/_tree.pyx
Line 453 in 3804ccd
is_leaf = (depth > self.max_depth or |
After the proposed fix, we'd have:
>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = load_iris(return_X_y=True)
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>>
>>> clf = DecisionTreeClassifier(random_state=0, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
5
>>> clf.get_n_leaves()
9
which then makes sense. I think that change needs a Bug Fix
entry in whats_new, since it changes the behavior? And should I include it in this PR or create a new one?
Regarding the "overshoot" can we just be more explicit that the depth excludes the root...? |
@jnothman, we could, but the issue is that we're not consistent at the moment. The DFS and BFS expansions don't agree with each other. BFS overshoots, DFS doesn't. The output of the following two cases should be the same:
|
I see. I'm not sure that there's a neat (i.e. backwards compatible) way to
change this behaviour :(
|
I'm not sure how it's done here, but it's arguably a bug, and some bugfixes make things backward incompatible, don't they? We could raise a user warning if they reach that part of the code and tell them it's changed. |
Yes, we break for bug fixes, although we still try to limit the damage and
alert the user, or else make multiple breaking changes at the same time.
Maybe it's not so bad if it's only BFS and most users are getting DFS?
|
I suppose predictions may not be much affected either, actually.
|
I have no idea how people use it. It's the BFS if they set |
doc/whats_new/v0.21.rst
Outdated
@@ -56,6 +56,15 @@ Support for Python 3.4 and below has been officially dropped. | |||
order for further speed performances. :issue:`12251` by `Tom Dupre la Tour`_. | |||
|
|||
|
|||
:mod:`sklearn.tree` | |||
................... | |||
- |Feature| ``get_n_leaves()`` and ``get_max_depth()`` have been added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_depth
sklearn/tree/tree.py
Outdated
@@ -108,6 +108,19 @@ def __init__(self, | |||
self.class_weight = class_weight | |||
self.presort = presort | |||
|
|||
def get_depth(self): | |||
"""Returns the depth of the decision tree, i.e. the maximum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep 257:
https://www.python.org/dev/peps/pep-0257/#multi-line-docstrings
No line breaks in first line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from docstring nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/tree/tree.py
Outdated
def get_depth(self): | ||
"""Returns the depth of the decision tree. | ||
|
||
The depth of a tree is defined as the maximum depth of its leaves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more explicit: "maximum number of nodes between the root and any leaf, inclusive of endpoints."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not inclusive of both endpoints though. The depth is correctly reported as 1 for a tree with a root and two leaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good that I clarified then. "The number of branching nodes from the root to the leaf", then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may be more confusing, wouldn't it? How about "maximum distance between the root and any leaf"? I had it changed, the push didn't get through, didn't notice.
I guess that is clear enough, thanks |
yay! |
See #7613 (it technically fixes the original issue, but more discussion's going on there)
To count the leaves, I'm checking left AND right children of of each node, which should be the case in a general tree. But in most of the tests, only left or right children are checked and not both. I suppose it doesn't happen that only one of the children are set and not the other, but still, it can a potential source of a bug, I think.
I couldn't really come up with a nice test for the two functions, so I've just replaced the equivalent ones in one of the existing tests. We could replace more of those in tests though.