DATA100-L23: Decision Trees

HHZZ included in DATA100

2024-07-19 335 words 2 minutes

Contents

Multiclass Classification

多分类问题但是没有softmax 😢

Decision Trees (conceptually)

Decision Tree Demo

Creating Decision Trees in sklearn

可视化代码见lecture code

Evaluating Tree Accuracy

Overfit Decision Tree Example

tree is too complex to generalize well to new data too tall and narrow

有用的特征越多，树的结构可能比较简单🤔

The Decision Tree Generation Algorithm

Intuitively Evaluating Split Quality

分割怎么样“更明显”？

Entropy

沿着树向下，信息熵越小？可能变大！

Generating Trees Using Entropy

Weighted entropy can decrease!

Traditional decision tree generation algorithm:

All of the data starts in the root node.
Repeat until every node is either pure or unsplittable:
- Pick the best feature x and split value β such that the ΔWS is maximized, e.g. x = petal_width, β = 0.8 has ΔWS = 0.91.
- Split data into two nodes, one where x < β, and one where x ≥ β.

Notes: A node that has only one samples from one class is called a “pure” node. A node that has overlapping data points from different classes and thus that cannot be split is called “unsplittable”.

Avoiding Overfitting

正则化在这里不起作用？

Heuristically Restricting Decision Tree Complexity（启发式算法）

Approach2: allow full growth of the tree, but Prune the tree.

Embracing Decision Tree Complexity with Random Forests

🪵 😋 Bagging: Short for Bootstrap AGGregatING.

Generate bootstrap resamples of training data.
Fit one model for each resample.
Final model = average predictions of each small model.
Invented by Leo Breiman in 1994 (Berkeley Statistics!).

启发式的特点

These ideas are generally “heuristic”

Not provably best or mathematically optimal.
Instead, they are just ideas that somebody thought sounded good, implemented, then found to work in practice acceptably well.

Summary and Context

Decision trees provide an alternate non-linear framework for classification and regression.

The underlying principle is fundamentally different.
Decision boundaries can be more complex.
Danger of overfitting is high.
Small decision trees are very easy to interpret.
Doing regression with a tree is straightforward. See statquest video.

Keeping complexity under control is not nearly as mathematically elegant and relies on heuristic rules.

Hard constraints.
Pruning rules.
Random forests:
- Generate multiple trees using bootstrap.
- Have the trees vote on the outcome.