[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_n_leaves() and get_max_depth() to DesicionTrees #12300

Merged
merged 10 commits into from
Oct 10, 2018

Conversation

adrinjalali
Copy link
Member

See #7613 (it technically fixes the original issue, but more discussion's going on there)

To count the leaves, I'm checking left AND right children of of each node, which should be the case in a general tree. But in most of the tests, only left or right children are checked and not both. I suppose it doesn't happen that only one of the children are set and not the other, but still, it can a potential source of a bug, I think.

I couldn't really come up with a nice test for the two functions, so I've just replaced the equivalent ones in one of the existing tests. We could replace more of those in tests though.

sklearn/tree/tree.py Outdated Show resolved Hide resolved
on it, including :class:`tree.DecisionTreeClassifier`,
:class:`tree.DecisionTreeRegressor`, :class:`tree.ExtraTreeClassifier`,
and :class:`tree.ExtraTreeRegressor`.
:issue:`7613` by :user:`Adrin Jalali <adrinjalali>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue should link to your PR #12300, not the actual issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, not really. There seems to be different preferences on this one, and I personally prefer to link to the issues and not the PRs, and if you check v0.20 whats_new page, there are 352 cases of links with "/issue/" in them, and only 1 "/pull/".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you click on most of them, those /issue/ links actually lead to a PR. It's misleading and I did the same before (reference an issue number), but was corrected by @glemaitre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing but :issue: link to the PR (The PR will link to the issue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of linking to the pr is:

  • It's easier for us to track the difference between the change log and the commit log
  • it's easier to find your way to all relevant history

sloria's Sphinx-issues library which we vendor here recently supports a separate :pr: role so we could upgrade and start using that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should it be sth like:

- |Feature| :issue:`7613` ``get_n_leaves()`` and ``get_max_depth()`` have been added to
  :class:`tree.BaseDecisionTree` and consequently all estimators based
  on it, including :class:`tree.DecisionTreeClassifier`,
  :class:`tree.DecisionTreeRegressor`, :class:`tree.ExtraTreeClassifier`,
  and :class:`tree.ExtraTreeRegressor`.
  :pr:`12300` by :user:`Adrin Jalali <adrinjalali>`.

@@ -1233,8 +1232,7 @@ def test_max_leaf_nodes_max_depth():
k = 4
for name, TreeEstimator in ALL_TREES.items():
est = TreeEstimator(max_depth=1, max_leaf_nodes=k).fit(X, y)
tree = est.tree_
assert_greater(tree.max_depth, 1)
assert_greater(est.get_max_depth(), 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this test. Why should it be greater than max_depth?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, don't hate me, I just change a test to use the new functions. But I suppose that test is valid because X and y are manually designed in a way that we know the tree will be deeper if allowed, and this makes sure max_depth is ignored by the estimator if max_leaf_nodes is given. At least that's what the comment says.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh... I didn't read the comment. That behavior is... unexpected to me? Should we raise a warning in that case? Do we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no test that shows that the depth can be smaller than max_depth, right? We should have some slightly more explicit test, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller, this actually seems a bug to me, which I think can be fixed this way.

On master/this branch, we have the following:

>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = load_iris(return_X_y=True)
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
2
>>> clf.get_n_leaves()
3
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
5
>>> clf.get_n_leaves()
9

Which shows that the max_depth is not actually ignored, it's just an overshoot by 1. And the issue is here:

is_leaf = (depth > self.max_depth or

After the proposed fix, we'd have:

>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = load_iris(return_X_y=True)
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
5
>>> clf.get_n_leaves()
9

which then makes sense. I think that change needs a Bug Fix entry in whats_new, since it changes the behavior? And should I include it in this PR or create a new one?

sklearn/tree/tree.py Show resolved Hide resolved
@jnothman
Copy link
Member
jnothman commented Oct 8, 2018

Regarding the "overshoot" can we just be more explicit that the depth excludes the root...?

@adrinjalali
Copy link
Member Author

@jnothman, we could, but the issue is that we're not consistent at the moment. The DFS and BFS expansions don't agree with each other. BFS overshoots, DFS doesn't. The output of the following two cases should be the same:

>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1, max_leaf_nodes=100)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
2
>>> clf.get_n_leaves()
3
>>> 
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=1)
>>> clf = clf.fit(X, y)
>>> clf.get_max_depth()
1
>>> clf.get_n_leaves()
2

@jnothman
Copy link
Member
jnothman commented Oct 8, 2018 via email

@adrinjalali
Copy link
Member Author

I'm not sure how it's done here, but it's arguably a bug, and some bugfixes make things backward incompatible, don't they? We could raise a user warning if they reach that part of the code and tell them it's changed.

@jnothman
Copy link
Member
jnothman commented Oct 8, 2018 via email

@jnothman
Copy link
Member
jnothman commented Oct 8, 2018 via email

@adrinjalali
Copy link
Member Author

@jnothman:

Yes, we break for bug fixes, although we still try to limit the damage and alert the user, or else make multiple breaking changes at the same time. Maybe it's not so bad if it's only BFS and most users are getting DFS?

I have no idea how people use it. It's the BFS if they set max_leaf_nodes. Anyhow, should we then get this one in since it's not about that issue, and discuss/test the bug on another PR?

@@ -56,6 +56,15 @@ Support for Python 3.4 and below has been officially dropped.
order for further speed performances. :issue:`12251` by `Tom Dupre la Tour`_.


:mod:`sklearn.tree`
...................
- |Feature| ``get_n_leaves()`` and ``get_max_depth()`` have been added to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_depth

@@ -108,6 +108,19 @@ def __init__(self,
self.class_weight = class_weight
self.presort = presort

def get_depth(self):
"""Returns the depth of the decision tree, i.e. the maximum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from docstring nitpick.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

def get_depth(self):
"""Returns the depth of the decision tree.

The depth of a tree is defined as the maximum depth of its leaves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more explicit: "maximum number of nodes between the root and any leaf, inclusive of endpoints."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not inclusive of both endpoints though. The depth is correctly reported as 1 for a tree with a root and two leaves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that I clarified then. "The number of branching nodes from the root to the leaf", then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be more confusing, wouldn't it? How about "maximum distance between the root and any leaf"? I had it changed, the push didn't get through, didn't notice.

@jnothman
Copy link
Member

I guess that is clear enough, thanks

@jnothman jnothman merged commit a80bbd9 into scikit-learn:master Oct 10, 2018
@adrinjalali adrinjalali deleted the feature/treefuncs branch October 10, 2018 11:46
@amueller
Copy link
Member

yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants