DATA100-lab13: Decision Trees and Random Forests
|
|
Lab 13: Decision Trees and Random Forests
|
|
Objectives
In this assignment, we will have you train a multi-class classifier with three different models (one-vs-rest logistic regression, decision trees, random forests) and compare the accuracies and decision boundaries created by each.
[Tutorial] Dataset, EDA, and Classification Task
We’ll be looking at a dataset of per-game stats for all NBA players in the 2018-19 season. This dataset comes from basketball-reference.com.
|
|
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Álex Abrines\abrinal01 | SG | 25 | OKC | 31 | 2 | 19.0 | 1.8 | 5.1 | ... | 0.923 | 0.2 | 1.4 | 1.5 | 0.6 | 0.5 | 0.2 | 0.5 | 1.7 | 5.3 |
1 | 2 | Quincy Acy\acyqu01 | PF | 28 | PHO | 10 | 0 | 12.3 | 0.4 | 1.8 | ... | 0.700 | 0.3 | 2.2 | 2.5 | 0.8 | 0.1 | 0.4 | 0.4 | 2.4 | 1.7 |
2 | 3 | Jaylen Adams\adamsja01 | PG | 22 | ATL | 34 | 1 | 12.6 | 1.1 | 3.2 | ... | 0.778 | 0.3 | 1.4 | 1.8 | 1.9 | 0.4 | 0.1 | 0.8 | 1.3 | 3.2 |
3 | 4 | Steven Adams\adamsst01 | C | 25 | OKC | 80 | 80 | 33.4 | 6.0 | 10.1 | ... | 0.500 | 4.9 | 4.6 | 9.5 | 1.6 | 1.5 | 1.0 | 1.7 | 2.6 | 13.9 |
4 | 5 | Bam Adebayo\adebaba01 | C | 21 | MIA | 82 | 28 | 23.3 | 3.4 | 5.9 | ... | 0.735 | 2.0 | 5.3 | 7.3 | 2.2 | 0.9 | 0.8 | 1.5 | 2.5 | 8.9 |
5 rows × 30 columns
Our goal will be to predict a player’s position given several other features. The 5 positions in basketball are PG, SG, SF, PF, and C (which stand for point guard, shooting guard, small forward, power forward, and center; Wikipedia).
This information is contained in the Pos
column:
|
|
Pos
SG 176
PF 147
PG 139
C 120
SF 118
PF-SF 2
SF-SG 2
SG-PF 1
C-PF 1
SG-SF 1
PF-C 1
Name: count, dtype: int64
There are several features we could use to predict this position; check the Basketball statistics page of Wikipedia for more details on the statistics themselves.
|
|
Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
'3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
dtype='object')
In this lab, we will restrict our exploration to two inputs: Rebounds (TRB
) and Assists (AST
). Two-input feature models will make our 2-D visualizations more straightforward.
3-class classification
While we could set out to try and perform 5-class classification, the results (and visualizations) are slightly more interesting if we try and categorize players into 1 of 3 categories: Guard, Forward, and Center. The below code will take the Pos
column of our dataframe and use it to create a new column Pos3
that consist of values 'G'
, 'F'
, and 'C'
(which stand for Guard, Forward, and Center).
|
|
Pos3
G 315
F 273
C 120
Name: count, dtype: int64
Data Cleaning and Visualization
Furthermore, since there are many players in the NBA (in the 2018-19 season there were 530 unique players), our visualizations can get noisy and messy. Let’s restrict our data to only contain rows for players that averaged 10 or more points per game.
|
|
Now, let’s look at a scatterplot of Rebounds (TRB
) vs. Assists (AST
).
|
|
As you can see, when using just rebounds and assists as our features, we see pretty decent cluster separation. That is, Guards, Forward, and Centers appear in different regions of the plot.
Question 1: Evaluating Split Quality
We will explore different ways to evaluate split quality for classification and regression trees in this question.
Question 1a: Entropy
In lecture we defined the entropy $S$ of a node as:
$$ S = -\sum_{C} p_C \log_{2} p_C $$
where $p_C$ is the proportion of data points in a node with label $C$. This function is a measure of the unpredictability of a node in a decision tree.
Implement the entropy
function, which outputs the entropy of a node with a given set of labels. The labels
parameter is a list of labels in our dataset. For example, labels
could be ['G', 'G', 'F', 'F', 'C', 'C']
.
|
|
np.float64(1.521555567956027)
|
|
Question 1b: Gini impurity
Another metric for determining the quality of a split is Gini impurity. This is defined as the chance that a sample would be misclassified if randomly assigned at this point. Gini impurity is a popular alternative to entropy for determining the best split at a node, and it is in fact the default criterion for scikit-learn’s DecisionTreeClassifier
.
We can calculate the Gini impurity of a node with the formula ($p_C$ is the proportion of data points in a node with label $C$):
$$ G = 1 - \sum_{C} {p_C}^2 $$
Note that no logarithms are involved in the calculation of Gini impurity, which can make it faster to compute compared to entropy.
Implement the gini_impurity
function, which outputs the Gini impurity of a node with a given set of labels. The labels
parameter is defined similarly to the previous part.
|
|
np.float64(0.6383398017253514)
|
|
As an optional exercise in probability, try to think of a way to derive the formula for Gini impurity.
[Tutorial] Variance
Are there other splitting metrics beyond entropy and Gini impurity? Yes! A third metric is variance (yes, that variance), which is often used for regression trees, or decision tree regressors, which split data based on a continuous response variable. It makes little sense to use entropy/Gini impurity for regression, as both metrics assume that there are discrete probabilities of responses (and therefore are more suited to classification).
Recall that the variance is defined as:
$$ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 $$
where $\mu$ is the mean, $N$ is the total number of data points, and $x_i$ is the value of each data point.
Run the below cell to define the variance
function.
|
|
np.float64(21.023148263588652)
Question 1c: Weighted Metrics
In lecture, we used weighted entropy as a loss function to help us determine the best split. Recall that the weighted entropy is given by:
$$ L = \frac{N_1 S(X) + N_2 S(Y)}{N_1 + N_2} $$
$N_1$ is the number of samples in the left node $X$, and $N_2$ is the number of samples in the right node $Y$. This notion of a weighted average can be extended to other metrics such as Gini impurity and variance simply by changing the $S$ (entropy) function to $G$ (Gini impurity) or $\sigma^2$ (variance).
First, implement the weighted_metric
function. The left
parameter is a list of labels or values in the left node $X$, and the right
parameter is a list of labels or values in the right node $Y$. The metric
parameter is a function which can be entropy
, gini_impurity
, or variance
. For entropy
and gini_impurity
, you may assume that left
and right
contain discrete labels. For variance
, you may assume that left
and right
contain continuous values.
Then, assign we_pos3_age_30
to the weighted entropy (in the Pos3
column) of a split that partitions nba_data
into two groups: a group with players who are 30 years old or older and a group with players who are younger than 30 years old.
|
|
3 C
7 C
10 C
19 F
21 F
..
695 G
698 F
699 G
700 C
703 C
Name: Pos3, Length: 223, dtype: object
|
|
np.float64(1.521489768014793)
|
|
We will not go over the entire decision tree fitting process in this assignment, but you now have the basic tools to fit a decision tree. As an optional exercise, try to think about how you would extend these tools to fit a decision tree from scratch.
Question 2: Classification
Let’s switch gears to classification.
Before fitting any models, let’s first split nba_data
into a training set and test set.
|
|
One-vs-Rest Logistic Regression
We only discussed binary logistic regression in class, but there is a natural extension to binary logistic regression called one-vs-rest logistic regression for multiclass classification. In essence, one-vs-rest logistic regression simply builds one binary logistic regression classifier for each of the $N$ classes (in this scenario $N = 3$). We then predict the class corresponding to the classifier that gives the highest probability among the $N$ classes.
Question 2a
In the cell below, set logistic_regression_model
to be a one-vs-rest logistic regression model. Then, fit that model using the AST
and TRB
columns (in that order) from nba_train
as our features, and Pos3
as our response variable.
Remember, sklearn.linear_model.LogisticRegression
(documentation) has already been imported for you. There is an optional parameter multi_class
you need to specify in order to make your model a multi-class one-vs-rest classifier. See the documentation for more details.
|
|
|
|
q2a
passed! 🙌
[Tutorial] Visualizing Performance
To see our classifier in action, we can use logistic_regression_model.predict
and see what it outputs.
|
|
AST | TRB | Pos3 | Predicted (OVRLR) Pos3 | |
---|---|---|---|---|
655 | 1.4 | 8.6 | C | C |
644 | 2.0 | 10.2 | C | C |
703 | 0.8 | 4.5 | C | F |
652 | 1.6 | 7.2 | C | F |
165 | 1.4 | 7.5 | C | C |
122 | 2.4 | 8.4 | C | C |
353 | 7.3 | 10.8 | C | C |
367 | 1.4 | 8.6 | C | C |
408 | 1.2 | 4.9 | C | F |
161 | 3.9 | 12.0 | C | C |
647 | 3.4 | 12.4 | C | C |
308 | 4.2 | 6.7 | C | G |
362 | 3.0 | 11.4 | C | C |
146 | 3.6 | 8.2 | C | C |
233 | 4.4 | 7.9 | C | C |
Our model does decently well here, as you can see visually above. Below, we compute the training accuracy; remember that model.score()
computes accuracy.
|
|
0.7964071856287425
We can compute the test accuracy as well by looking at nba_test
instead of nba_train
:
|
|
0.6428571428571429
Now, let’s draw the decision boundary for this logistic regression classifier, and see how the classifier performs on both the training and test data.
|
|
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
warnings.warn(
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
warnings.warn(
Our one-vs-rest logistic regression was able to find a linear decision boundary between the three classes. It generally classifies centers as players with a lot of rebounds, forwards as players with a medium number of rebounds and a low number of assists, and guards as players with a low number of rebounds.
Note: In practice we would use many more features – we only used 2 here just so that we could visualize the decision boundary.
Decision Trees
Question 2b
Let’s now create a decision tree classifier on the same training data nba_train
, and look at the resulting decision boundary.
In the following cell, first, use tree.DecisionTreeClassifier
(documentation) to fit a model using the same features and response as above, and call this model decision_tree_model
. Set the random_state
and criterion
parameters to 42 and entropy
, respectively.
Hint: Your code will be mostly the same as the previous part.
|
|
|
|
q2b
passed! 🌟
[Tutorial] Decision Tree Performance
Now, let’s draw the decision boundary for this decision tree classifier, and see how the classifier performs on both the training and test data.
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
warnings.warn(
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
warnings.warn(
We compute the training and test accuracies of the decision tree model below.
|
|
(0.9940119760479041, 0.5714285714285714)
Random Forests
Question 2c
Let’s now create a random forest classifier on the same training data nba_train
and look at the resulting decision boundary.
In the following cell, use ensemble.RandomForestClassifier
(documentation) to fit a model using the same features and response as above, and call this model random_forest_model
. Use 20 trees in your random forest classifier; set the random_state
and criterion
parameters to 42 and entropy
, respectively.
Hint: Your code for both parts will be mostly the same as the first few parts of this question.
Hint: Look at the n_estimators
parameter of ensemble.RandomForestClassifier
.
|
|
|
|
q2c
passed! 🙌
[Tutorial] Random Forest Performance
Now, let’s draw the decision boundary for this random forest classifier, and see how the classifier performs on both the training and test data.
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
warnings.warn(
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
warnings.warn(
We compute the training and test accuracies of the random forest model below.
|
|
(0.9760479041916168, 0.6964285714285714)
Compare/Contrast
How do the three models you created (multiclass one-vs-rest logistic regression, decision tree, random forest) compare to each other?)
Decision boundaries: Run the below cell for your convenience. It overlays the decision boundaries for the train and test sets for each of the models you created.
|
|
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
warnings.warn(
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
warnings.warn(
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
warnings.warn(
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
warnings.warn(
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
warnings.warn(
d:\miniconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
warnings.warn(
Performance Metrics: Run the below cell for your convenience. It summarizes the train and test accuracies for the three models you created.
|
|
<matplotlib.legend.Legend at 0x21314df30e0>
Question 2d
Looking at the three models, which model performed the best on the training set, and which model performed the best on the test set? How are the training and test accuracy related for the three models, and how do the decision boundaries generated for each of the three models relate to the model’s performance?
DT
OVRLR
[ungraded] Question 3: Regression Trees
In Project 1, we used linear regression to predict housing prices in Cook County, Illinois. However, what would happen if we tried to use a different prediction method?
Try fitting a regression tree (also known as a decision tree regressor) to predict housing prices. Here’s one in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
What do you notice about the training error and the test error for the decision tree regressor? Is one significantly larger than the other? If so, what methods could we use to make this error lower?
Now, try fitting a random forest regressor instead of a single decision tree. What do you notice about the training error and the test error for the random forest, and how does this compare to the training and test error of a single decision tree?
see in project 1