Regularisation for Decision Tree. #1296

Manthan-R-Sheth · 2018-03-06T08:00:37Z

Early stopping is a regularisation method for decision trees i.e. it reduces overfitting. In this method, the splitting of the node is prevented if the information gain is less than the given threshold.

This PR introduces the paramter minimumGainSplit to the Decision Tree API for the users to stop the growth of the tree if the gain is less than the threshold.

Tests for early stopping and implemetnation of post-pruning (another regularisation technique for decision tree) will be added in the following commits.

Manthan-R-Sheth · 2018-03-06T09:57:51Z

Early stopping in literature is the process of calculating error on the validation set before each split and only then carry about the split if there is an error drop of more than threshold. However, minimumGainSplit does this on the training data only. Nevertheless, addition of this parameter to training time improves the generalisation capability of the tree. The API I have modified, provides a choice to the user to employ early stopping or not. If minimumGainSplit == 1e-7 , its equivalent to not having early stopping during training time.

For implementing post-pruning, what shall be the best to implement? Minimum error tree or smallest tree or the weighted sum of the two? (All these will be performed on the CV set).
Any views, @rcurtin @zoq.

rcurtin

Thanks for the contribution. If you can fix the style and build issues and handle the comments, I think this is close to ready. Also, if you can add a test to decision_tree_test.cpp to ensure that the regularization is working, I think that would also be a good idea.

For implementing post-pruning, what shall be the best to implement? Minimum error tree or smallest tree or the weighted sum of the two? (All these will be performed on the CV set).

Let's not go with just one---when you implement that (and I think maybe a separate PR would be a good idea since it will be a big change), you should add a template parameter to the decision tree, so that users can decide which post-pruning algorithm they want.

rcurtin · 2018-03-06T13:10:46Z

src/mlpack/methods/decision_tree/decision_tree_impl.hpp

@@ -533,7 +542,7 @@ void DecisionTree<FitnessFunction,

  // Look through the list of dimensions and obtain the gain of the best split.
  // We'll cache the best numeric and categorical split auxiliary information in
-  // numericAux and categoricalAux (and clear them later if we make not split),
+  // numericAux and categoricalAux (and clear them later if we make no split),


Nice catch, thanks. :)

rcurtin · 2018-03-06T13:11:46Z

src/mlpack/methods/decision_tree/decision_tree_main.cpp

@@ -39,7 +39,8 @@ PROGRAM_INFO("Decision tree",
    "may not be specified when the " + PRINT_PARAM_STRING("training") + " "
    "parameter is specified.  The " + PRINT_PARAM_STRING("minimum_leaf_size") +
    " parameter specifies the minimum number of training points that must fall"
-    " into each leaf for it to be split.  If " +
+    " into each leaf for it to be split.  The " + PRINT_PARAM_STRING("minimum_gain_split") +
+    " parameter specifies the minimum gain that is needed for the node to split. If " +


I think it might be useful to add a little intuition here---maybe we should point out that a larger minimum_gain_split is a form of regularization?

I think, adding this intuition in the header file during function declaration would be better, just like how it is done for other parameters.

rcurtin · 2018-03-06T13:13:19Z

src/mlpack/methods/decision_tree/decision_tree_impl.hpp

@@ -430,6 +436,7 @@ void DecisionTree<FitnessFunction,
                                      const size_t numClasses,
                                      WeightsType&& weights,
                                      const size_t minimumLeafSize,
+                                      const double minimumGainSplit = 1e-7,


I think it's not actually legal to set a default parameter when it's already been specified in the declaration, so no need for the = 1e-7 here.

rcurtin · 2018-03-06T13:13:59Z

src/mlpack/methods/decision_tree/decision_tree.hpp

@@ -98,6 +102,7 @@ class DecisionTree :
   * @param numClasses Number of classes in the dataset.
   * @param weights The weight list of given label.
   * @param minimumLeafSize Minimum number of points in each leaf node.
+   * @param minimumGainSplit Minimum Gain for the node to split.


This is super picky (sorry!) but you can change Gain to gain here---there's no need to capitalize. :)

rcurtin · 2018-03-06T13:15:00Z

src/mlpack/methods/decision_tree/all_categorical_split.hpp

@@ -62,7 +62,8 @@ class AllCategoricalSplit
      const WeightVecType& weights,
      const size_t minimumLeafSize,
      arma::Col<typename VecType::elem_type>& classProbabilities,
-      AuxiliarySplitInfo<typename VecType::elem_type>& aux);
+      AuxiliarySplitInfo<typename VecType::elem_type>& aux,
+      const double minimumGainSplit);


This is also a bit picky, but we typically order const (input) parameters before output parameters, so I would put minimumGainSplit before classProbabilities (and similar changes through the rest of the code).

Manthan-R-Sheth · 2018-03-06T16:28:23Z

@rcurtin
I have made the corresponding changes and have added a test for the same. To ensure that regularisation is working, I made the 2 decision trees on same data (one with regulariser and one without it). I predicted the output for the same data and if the misclassifcations by regularised tree are non-zero, it means that the tree is regularised.

rcurtin · 2018-03-07T15:32:37Z

src/mlpack/tests/decision_tree_test.cpp

+    BOOST_REQUIRE_EQUAL(probabilitiesRegularised.n_elem, 3);
+  }
+
+  BOOST_REQUIRE_GT(count, 0);


In this test we need to make sure regularization is working (note that this is different than the main tests, where we only need to make sure that the option is being properly passed to the code), but it looks like this test just checks if the predictions produced by the regularized tree are different than the nonregularized tree.

I think a more effective test might be to make sure that a tree built with a high minimum gain split has fewer nodes in it than a tree with low minimum gain split. What do you think?

We could also test that the accuracy on a training set of a regularized tree is higher, but I am not sure that will always be true so I might be hesitant about that testing strategy.

In fact for this test here, I think it would be worthwhile to adapt and move it to main_tests/decision_tree_test.cpp.

I agree that it is not a strict test for regularisation. I considered this hypothesis :- Since the unregularised tree gives 100% accuracy on training data, the fact that the count > 0 means that the tree is regularised and it contains less nodes. Had it contained more nodes, it can't have count > 0. So what you said is somehow indirectly tested. This is the reason I adopted this strategy.

Ah, clever. I can agree with that. 👍

I guess, then this should be good to go?
Any views @rcurtin .

Agreed, but I do still think it would be useful to add a similar test to main_tests/decision_tree_test.cpp as paranoia just to ensure the regularization is being passed down properly.

Sure, I will add it then.

Manthan-R-Sheth · 2018-03-09T19:58:21Z

@rcurtin
I have added the test in main_tests/decision_tree_test.cpp for the same. I think it should work fine. At the same time, I have also started the make mlpack_test process to test it locally. I will check for failures (if any) locally, tomorrow morning.

Manthan-R-Sheth · 2018-03-12T03:35:45Z

@rcurtin @zoq
This should be fine and good to go now? Any suggestions?

rcurtin

Thanks, this looks good to me. The static code analysis issues don't need to be worried about---that is intentional integer division.

I'll go ahead and merge this in 4 days to leave time for any other feedback.

Manthan-R-Sheth · 2018-03-12T14:25:18Z

Thanks, ya i thought so about the static code analysis.

zoq

Looks good to me, no comment from my side.

rcurtin · 2018-03-16T13:52:26Z

Thanks for the contribution! :) I think that this is your first contribution so if you want to add your name to the list of contributors in src/mlpack/core.hpp and COPYRIGHT.txt feel free. Also, if you want some mlpack stickers to put on your laptop, we can mail them to you; just send your physical mailing address to ryan@ratml.org and I'll get them in the mail.

Manthan-R-Sheth · 2018-03-16T14:13:28Z

@rcurtin @zoq
Thank you for the same. Sure I would love to have my name in the list of contributors.
Shall I add my name in the two files and push it in the BatchNormLayer PR that is open?
Thanks once again.

Manthan-R-Sheth changed the title ~~Early stopping for Decision Tree.~~ Regularisation for Decision Tree. Mar 6, 2018

rcurtin reviewed Mar 6, 2018

View reviewed changes

rcurtin reviewed Mar 7, 2018

View reviewed changes

Manthan-R-Sheth added 9 commits March 9, 2018 19:58

Early stopping parameter

4c818ea

Default value for minimumGainSplit

a222951

Fix style issues, add test

e6ab344

Minor fix

cc0b918

Solve alias issue

08a9b8f

Test for CLI

98ac1db

Resolve typecast error for CLI test

67641de

CLI test for minimum_gain_split

dfd07c0

Update CLI test

66c7ac4

Manthan-R-Sheth force-pushed the decisiontree branch from 3df9216 to 66c7ac4 Compare March 9, 2018 19:50

Manthan-R-Sheth added 2 commits March 10, 2018 11:02

Fix build errors

6837ef2

Fix main_tests

7668ea4

rcurtin approved these changes Mar 12, 2018

View reviewed changes

zoq approved these changes Mar 15, 2018

View reviewed changes

rcurtin merged commit 0e14562 into mlpack:master Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regularisation for Decision Tree. #1296

Regularisation for Decision Tree. #1296

Regularisation for Decision Tree. #1296

Regularisation for Decision Tree. #1296

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment