Contents

DATA100-lab8: Model Selection, Regularization, and Cross-Validation

1
2
3
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

Lab 8: Model Selection, Regularization, and Cross-Validation

In this lab, you will practice using scikit-learn to generate models of various complexity. You’ll then use the holdout method and K-fold cross-validation to select the models that generalize best.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Run this cell to set up your notebook
import seaborn as sns
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.set_context("talk")

from IPython.display import display, Latex, Markdown

Introduction

For this lab, we will use a toy dataset to predict the house prices in Boston with data provided by the sklearn.datasets package. There are more interesting datasets in the package if you want to explore them during your free time!

Run the following cell to load the data. load_boston() will return a dictionary object which includes keys for:

  • data : the covariates (X)
  • target : the response vector (Y)
  • feature_names: the column names
  • DESCR : a full description of the data
  • filename: name of the csv file
1
2
3
4
5
6
import pickle
boston_data = pickle.load(open("boston_data.pickle", "rb")) 


print(boston_data.keys())
sum(boston_data.data)
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])





array([1.82844292e+03, 5.75000000e+03, 5.63521000e+03, 3.50000000e+01,
       2.80675700e+02, 3.18002500e+03, 3.46989000e+04, 1.92029160e+03,
       4.83200000e+03, 2.06568000e+05, 9.33850000e+03, 6.40245000e+03])
1
print(boston_data['DESCR'])
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 12 numeric/categorical predictive. Median Value (attribute 13) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

A look at the DESCR attribute tells us the data contains these features:

1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 
             25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
             river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per 10,000 USD
11. PTRATIO  pupil-teacher ratio by town
12. LSTAT    % lower status of the population

Let’s now convert this data into a pandas DataFrame.

1
2
boston = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
boston.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 5.33

Question 1

Let’s model this housing price data! Before we can do this, however, we need to split the data into training and test sets. Remember that the response vector (housing prices) lives in the target attribute. A random seed is set here so that we can deterministically generate the same splitting in the future if we want to test our result again and find potential bugs.

Use the train_test_split function to split out 10% of the data for the test set. Call the resulting splits X_train, X_holdout, Y_train, Y_holdout. Here “holdout” refers to the fact that we’re going to hold this data our when training our model.

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split
np.random.seed(45)

X = boston
Y = pd.Series(boston_data['target'])

X_train, X_holdout, Y_train, Y_holdout = train_test_split(X, Y, test_size=0.1)
1
grader.check("q1")

q1
passed! 🚀

Question 2

As a warmup, fit a linear model to describe the relationship between the housing price and all available covariates. We’ve imported sklearn.linear_model as lm, so you can use that instead of typing out the whole module name. Fill in the cells below to fit a linear regression model to the covariates and create a scatter plot for our predictions vs. the true prices.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import sklearn.linear_model as lm

linear_model = lm.LinearRegression()

# Fit your linear model
linear_model.fit(X_train, Y_train)

# Predict housing prices on the test set
Y_pred = linear_model.predict(X_holdout)

# Plot predicted vs true prices
plt.scatter(Y_holdout, Y_pred, alpha=0.5)
plt.xlabel("Prices $(y)$")
plt.ylabel("Predicted Prices $(\hat{y})$")
plt.title("Prices vs Predicted Prices");
<>:14: SyntaxWarning: invalid escape sequence '\h'
<>:14: SyntaxWarning: invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\1494534656.py:14: SyntaxWarning: invalid escape sequence '\h'
  plt.ylabel("Predicted Prices $(\hat{y})$")

/datalab8/lab08_files/lab08_12_1.png

Briefly analyze the scatter plot above. Do you notice any outliers? Write your answer in the cell below.

理想情况应该是均匀?且较窄分布于y=x直线上

Alternately, we can plot the residuals vs. our model predictions. Ideally they’d all be zero. Given the inevitably of noise, we’d at least like them to be scatter randomly across the line where the residual is zero. By contrast, there appears to be a possible pattern, with our model consistently underestimating prices for both very low and very high values, and possibly consistently overestimating prices towards the middle range.

1
2
3
4
5
6
plt.scatter(Y_pred, Y_holdout - Y_pred, alpha=0.5)
plt.ylabel("Residual $(y - \hat{y})$")
plt.xlabel("Predicted Prices $(\hat{y})$")
plt.title("Residuals vs Predicted Prices")
plt.title("Residual of prediction for i'th house")
plt.axhline(y = 0, color='r');
<>:2: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\h'
<>:2: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\2491234216.py:2: SyntaxWarning: invalid escape sequence '\h'
  plt.ylabel("Residual $(y - \hat{y})$")
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\2491234216.py:3: SyntaxWarning: invalid escape sequence '\h'
  plt.xlabel("Predicted Prices $(\hat{y})$")

/datalab8/lab08_files/lab08_16_1.png

Question 3

As we find from the scatter plot, our model is not perfect. If it were perfect, we would see the identity line (i.e. a line of slope 1). Compute the root mean squared error (RMSE) of the predicted responses:

$$ \textbf{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 } $$

Fill out the function below and compute the RMSE for our predictions on both the training data X_train and the held out set X_holdout. Your implementation should not use for loops.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def rmse(actual_y, predicted_y):
    """
    Args:
        predicted_y: an array of the prediction from the model
        actual_y: an array of the groudtruth label
        
    Returns:
        The root mean square error between the prediction and the groudtruth
    """
    return np.sqrt(np.mean((predicted_y - actual_y)**2))

train_error = rmse(Y_train, linear_model.predict(X_train))
holdout_error = rmse(Y_holdout, Y_pred)

print("Training RMSE:", train_error)
print("Holdout RMSE:", holdout_error)
Training RMSE: 4.633297105625516
Holdout RMSE: 5.685160866583937
1
grader.check("q3")

q3
passed! 🙌

Is your training error lower than the error on the data the model never got to see? If so, why could this be happening? Answer in the cell below.

稍微过拟合?

Overfitting

Sometimes we can get even higher accuracy by adding more features. For example, the code below adds the square, square root, and hyperbolic tangent of every feature to the design matrix. We’ve chosen these bizarre features specifically to highlight overfitting.

1
2
3
4
5
6
7
boston_with_extra_features = boston.copy()
for feature_name in boston.columns:
    boston_with_extra_features[feature_name + "^2"] = boston_with_extra_features[feature_name] ** 2
    boston_with_extra_features["sqrt" + feature_name] = np.sqrt(boston_with_extra_features[feature_name])
    boston_with_extra_features["tanh" + feature_name] = np.tanh(boston_with_extra_features[feature_name])
    
boston_with_extra_features.head(5)

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... tanhRAD TAX^2 sqrtTAX tanhTAX PTRATIO^2 sqrtPTRATIO tanhPTRATIO LSTAT^2 sqrtLSTAT tanhLSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... 0.761594 87616.0 17.204651 1.0 234.09 3.911521 1.0 24.8004 2.231591 0.999905
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 0.964028 58564.0 15.556349 1.0 316.84 4.219005 1.0 83.5396 3.023243 1.000000
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... 0.964028 58564.0 15.556349 1.0 316.84 4.219005 1.0 16.2409 2.007486 0.999368
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.995055 49284.0 14.899664 1.0 349.69 4.324350 1.0 8.6436 1.714643 0.994426
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.995055 49284.0 14.899664 1.0 349.69 4.324350 1.0 28.4089 2.308679 0.999953

5 rows × 48 columns

We split up our data again and refit the model. From this cell forward, we append 2 to the variable names X_train, X_holdout, Y_train, Y_holdout, train_error, holdout_error in order to maintain our original data. Make sure you use these variable names from this cell forward, at least until we get to the part where we create version 3 of each of these.

1
2
3
4
np.random.seed(25)
X = boston_with_extra_features
X_train2, X_holdout2, Y_train2, Y_holdout2 = train_test_split(X, Y, test_size = 0.10)
linear_model.fit(X_train2, Y_train2);

Looking at our training and test RMSE, we see that they are lower than you computed earlier. This strange model is seemingly better, even though it includes seemingly useless features like the hyperbolic tangent of the average number of rooms per dwelling.

1
2
3
4
5
train_error2 = rmse(Y_train2, linear_model.predict(X_train2)) 
holdout_error2 = rmse(Y_holdout2, linear_model.predict(X_holdout2))

print("Training RMSE:", train_error2)
print("Holdout RMSE:", holdout_error2)
Training RMSE: 3.3514483036916287
Holdout RMSE: 5.410120414381265

The code below generates the training and holdout RMSE for 49 different models stores the results in a DataFrame. The first model uses only the first feature “CRIM”. The second model uses the first two features “CRIM” and “ZN”, and so forth.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
errors_vs_N = pd.DataFrame(columns = ["N", "Training Error", "Holdout Error"])
range_of_num_features = range(1, X_train2.shape[1] + 1)

for N in range_of_num_features:
    X_train_first_N_features = X_train2.iloc[:, :N]    
    
    linear_model.fit(X_train_first_N_features, Y_train2)
    train_error_overfit = rmse(Y_train2, linear_model.predict(X_train_first_N_features))
    
    X_holdout_first_N_features = X_holdout2.iloc[:, :N]
    holdout_error_overfit = rmse(Y_holdout2, linear_model.predict(X_holdout_first_N_features))    
    errors_vs_N.loc[len(errors_vs_N)] = [N, train_error_overfit, holdout_error_overfit]
    
errors_vs_N

N Training Error Holdout Error
0 1.0 8.536340 7.825177
1 2.0 8.085693 7.637465
2 3.0 7.776942 7.213870
3 4.0 7.643897 6.391482
4 5.0 7.634894 6.372166
5 6.0 5.698878 7.635694
6 7.0 5.689554 7.585860
7 8.0 5.399034 7.158563
8 9.0 5.379679 7.281769
9 10.0 5.318218 7.231629
10 11.0 5.088829 6.922974
11 12.0 4.680294 5.437528
12 13.0 4.679671 5.443388
13 14.0 4.664717 5.448438
14 15.0 4.627661 5.479720
15 16.0 4.613226 5.488425
16 17.0 4.580971 5.389309
17 18.0 4.580622 5.391183
18 19.0 4.507301 5.185114
19 20.0 4.482925 5.194924
20 21.0 4.482412 5.188007
21 22.0 4.482412 5.188007
22 23.0 4.482412 5.188007
23 24.0 4.482412 5.188007
24 25.0 4.482224 5.191621
25 26.0 4.471079 5.256722
26 27.0 4.460457 5.308239
27 28.0 3.909139 4.582940
28 29.0 3.889483 4.951462
29 30.0 3.728687 6.963946
30 31.0 3.697400 6.986045
31 32.0 3.672688 7.081944
32 33.0 3.672674 7.082046
33 34.0 3.638518 6.916599
34 35.0 3.570632 6.535278
35 36.0 3.515723 6.417958
36 37.0 3.513539 6.420029
37 38.0 3.502764 6.453024
38 39.0 3.502689 6.452923
39 40.0 3.498222 6.373783
40 41.0 3.450329 6.493727
41 42.0 3.450329 6.493727
42 43.0 3.448549 6.486777
43 44.0 3.448549 6.486974
44 45.0 3.448549 6.486974
45 46.0 3.420689 5.823307
46 47.0 3.353809 5.398784
47 48.0 3.351448 5.410120

If we plot the training and holdout error as we add each additional feature, our training error gets lower and lower (since our model bias is increasing), and in fact it’s possible to prove with linear algebra that the training error will decrease monotonically.

By contrast, the error on unseen held out data is higher for the models with more parameters, since the lessons learned from these last 20+ features aren’t actually useful when applied to unseen data. That is, these models aren’t generalizable.

1
2
import plotly.express as px
px.line(errors_vs_N, x = "N", y = ["Training Error", "Holdout Error"])

Note that this diagram resembles are cartoon from Lecture 15.

This plot is a useful tool for model selection: the best model is the one the lowest error on the holdout set, i.e. the one that includes parameters 1 through 28.

Regularization

As an alternative and more realistic example, instead of using only the first N features, we can use various different regularization strengths. For example, for really low regularization strengths (e.g. $\alpha = 10^{-3}$), we get a model that is very identical to our linear regression model.

1
2
3
4
from sklearn.linear_model import Ridge
regularized_model = Ridge(alpha = 10**-5)
regularized_model.fit(X_train2, Y_train2)
regularized_model.coef_
d:\miniconda3\Lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning:

Ill-conditioned matrix (rcond=6.11696e-19): result may not be accurate.






array([ 4.44044277e-01, -3.00268517e-02,  2.03776925e+00,  3.54247206e-01,
       -1.19704083e+02,  1.63780073e+01, -3.10555372e-01, -1.31182539e+01,
        2.87010751e+00,  7.68411439e-01,  2.43201974e+01,  2.09160420e+00,
       -1.17012738e-03, -5.60565882e+00,  6.79680723e+00,  1.02949752e-03,
       -1.31223400e+00,  6.99621340e+00, -3.55165065e-02, -7.66339676e+00,
       -2.53950130e+00,  3.54247186e-01,  3.54247186e-01,  2.69792455e-01,
        1.91778126e+00,  3.11293526e+02, -1.53815298e+02,  8.03364965e-01,
       -1.17792246e+02,  3.25883430e+02,  1.08476149e-03,  2.42998443e+00,
        2.52462516e+02,  3.55080093e-01,  3.78504405e+01, -8.11283072e+01,
       -5.18073808e-02, -8.51699934e+00,  1.14213610e+01, -2.86248788e-04,
       -2.10606164e+01,  0.00000000e+00, -1.85988225e-01, -1.54605184e+02,
        5.73422430e-06, -1.79546600e-02, -1.53342390e+01, -4.25637232e+01])
1
linear_model.fit(X_train2, Y_train2)
1
linear_model.coef_
array([ 3.65647144e-01,  7.96329260e-02,  1.50196461e+00,  3.72759210e-01,
       -1.82281287e+03,  6.19862020e+02, -2.86690023e-01, -1.29491141e+01,
        1.68693762e+00,  7.86841735e-01,  1.62893036e+01,  1.95113824e+00,
       -9.11835586e-04, -5.02513063e+00,  5.90016774e+00,  6.12889765e-04,
       -2.21247181e+00,  8.90275845e+00, -2.73913970e-02, -5.40098561e+00,
       -4.23462112e+00,  3.72978675e-01,  3.72978861e-01,  2.84060205e-01,
        5.41748851e+02,  4.88274463e+02,  1.16998609e+03, -1.36350124e+01,
       -2.23299632e+03,  5.18647024e+04,  1.04162650e-03,  2.14549424e+00,
        4.31003519e+02,  3.51263646e-01,  3.77337190e+01, -8.06896603e+01,
       -2.88295129e-02, -4.52779826e+00,  8.15771554e+00, -2.99443268e-04,
       -2.14061912e+01,  3.63797881e-12, -1.15683673e-01, -1.07968511e+02,
        1.52846060e-03, -2.03166630e-02, -1.38532349e+01, -4.22894414e+01])

However, if we pick a large regularization strength, e.g. $\alpha = 10^4$, we see that the resulting parameters are much smaller in magnitude.

1
2
3
4
from sklearn.linear_model import Ridge
regularized_model = Ridge(alpha = 10**4)
regularized_model.fit(X_train2, Y_train2)
regularized_model.coef_
array([-2.64236947e-02, -9.32767913e-03, -2.42925745e-02,  5.47079848e-03,
       -2.54276859e-03,  1.92843599e-02, -5.85037883e-02, -2.06397155e-02,
        2.62611572e-02, -4.16712719e-02, -1.95840395e-03, -1.91841765e-01,
       -1.08846586e-03, -4.28805626e-03,  1.70791430e-03,  6.51767238e-04,
        1.71133790e-03,  1.07486010e-03, -1.19407955e-03, -7.15970642e-03,
       -7.29287455e-04,  5.47079848e-03,  5.47079848e-03,  4.16652815e-03,
       -3.60910235e-03, -1.50954020e-03, -1.59681172e-03,  3.35928833e-01,
        3.11186224e-03, -2.79750628e-06,  4.48782500e-04, -5.71759051e-03,
        2.22943575e-06, -6.59740404e-02, -7.01191670e-03, -1.58200606e-03,
        1.32454447e-03,  8.15878522e-03,  1.17645581e-03,  3.59660322e-05,
       -2.54207413e-03,  0.00000000e+00, -2.57499245e-02, -3.15683513e-04,
       -8.10128212e-15, -6.45893053e-03, -4.20286900e-02, -2.29035441e-04])

Standard Scaling

归一化

Recall from lecture that in order to properly regularize a model, the features should be at the same scale. Otherwise the model has to spend more of its parameter budget to use “small” features (e.g. lengths in inches) compared to “large” features (e.g. lengths in kilometers).

To do this we can use a Standard Scaler to create a new version of the DataFrame where every column has zero mean and a standard deviation of 1.

1
2
3
4
5
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(boston_with_extra_features)
boston_with_extra_features_scaled = pd.DataFrame(ss.transform(boston_with_extra_features), columns = boston_with_extra_features.columns)
boston_with_extra_features_scaled

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... tanhRAD TAX^2 sqrtTAX tanhTAX PTRATIO^2 sqrtPTRATIO tanhPTRATIO LSTAT^2 sqrtLSTAT tanhLSTAT
0 -0.419782 0.284830 -1.287909 -0.272599 -0.144217 0.413672 -0.120013 0.140214 -0.982843 -0.666608 ... -4.863216 -0.682024 -0.644166 0.0 -1.458429 -1.453573 0.135095 -0.789529 -1.202689 0.103530
1 -0.417339 -0.487722 -0.593381 -0.272599 -0.740262 0.194274 0.367166 0.557160 -0.867883 -0.987329 ... -0.521299 -0.866530 -1.053383 0.0 -0.373078 -0.266921 0.179012 -0.540454 -0.399953 0.128396
2 -0.417342 -0.487722 -0.593381 -0.272599 -0.740262 1.282714 -0.265812 0.557160 -0.867883 -0.987329 ... -0.521299 -0.866530 -1.053383 0.0 -0.373078 -0.266921 0.179012 -0.825825 -1.429933 -0.037847
3 -0.416750 -0.487722 -1.306878 -0.272599 -0.835284 1.016303 -0.809889 1.077737 -0.752922 -1.106115 ... 0.144191 -0.925467 -1.216415 0.0 0.057783 0.139631 0.179251 -0.858040 -1.726876 -1.338649
4 -0.412482 -0.487722 -1.306878 -0.272599 -0.835284 1.228577 -0.511180 1.077737 -0.752922 -1.106115 ... 0.144191 -0.925467 -1.216415 0.0 0.057783 0.139631 0.179251 -0.774228 -1.124522 0.116050
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 -0.413229 -0.487722 0.115738 -0.272599 0.158124 0.439316 0.018673 -0.625796 -0.982843 -0.803212 ... -4.863216 -0.765138 -0.813468 0.0 1.255407 1.136187 0.179299 -0.498180 -0.312324 0.128400
502 -0.415249 -0.487722 0.115738 -0.272599 0.158124 -0.234548 0.288933 -0.716639 -0.982843 -0.803212 ... -4.863216 -0.765138 -0.813468 0.0 1.255407 1.136187 0.179299 -0.545089 -0.410031 0.128395
503 -0.413447 -0.487722 0.115738 -0.272599 0.158124 0.984960 0.797449 -0.773684 -0.982843 -0.803212 ... -4.863216 -0.765138 -0.813468 0.0 1.255407 1.136187 0.179299 -0.759808 -1.057406 0.121757
504 -0.407764 -0.487722 0.115738 -0.272599 0.158124 0.725672 0.736996 -0.668437 -0.982843 -0.803212 ... -4.863216 -0.765138 -0.813468 0.0 1.255407 1.136187 0.179299 -0.716638 -0.884300 0.127164
505 -0.415000 -0.487722 0.115738 -0.272599 0.158124 -0.362767 0.434732 -0.613246 -0.982843 -0.803212 ... -4.863216 -0.765138 -0.813468 0.0 1.255407 1.136187 0.179299 -0.631389 -0.619088 0.128327

506 rows × 48 columns

Let’s now regenerate the training and holdout sets using this new rescaled dataset.

1
2
3
np.random.seed(25)
X = boston_with_extra_features_scaled
X_train3, X_holdout3, Y_train3, Y_holdout3 = train_test_split(X, Y, test_size = 0.10)

Fitting our regularized model with $\alpha = 10^4$ on this scaled data, we now see that our coefficients are of about the same magnitude. This is because all of our features are of around the same magnitude, whereas in the unscaled data, some of the features like TAX^2 were much larger than others.

1
2
3
4
from sklearn.linear_model import Ridge
regularized_model = Ridge(alpha = 10**2)
regularized_model.fit(X_train3, Y_train3)
regularized_model.coef_
array([-0.61501301, -0.04142115, -0.13765546,  0.11847529, -0.48559141,
        1.08393358, -0.11193453, -0.6446524 ,  0.25956768, -0.41922265,
       -0.48366805, -1.23850023, -0.22227015, -0.51281683,  0.40952134,
        0.2537374 , -0.07390569,  0.06674777,  0.11386252, -0.32684806,
       -0.39658025,  0.11847529,  0.11847529,  0.11847529, -0.67728184,
       -0.385382  , -0.36114118,  1.652695  ,  0.78959095, -1.09450355,
       -0.02430294, -0.14153645,  0.11511136, -0.41673303, -0.72747143,
       -1.36478486,  0.21308676,  0.30241207,  0.45131889, -0.16799052,
       -0.59340155,  0.        , -0.43637213, -0.50878723, -0.16529828,
       -0.04194842, -1.94295189, -0.70807685])

Finding an Optimum Alpha

In the cell below, write code that generates a DataFrame with the training and holdout error for the range of alphas given. Make sure you’re using the 3rd training and holdout sets, which have been rescaled!

Note: You should use all 48 features for every single model that you fit, i.e. you’re not going to be keeping only the first N features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
error_vs_alpha = pd.DataFrame(columns = ["alpha", "Training Error", "Holdout Error"])
range_of_alphas = 10**np.linspace(-5, 4, 40)

# for N in range_of_num_features:
#     X_train_first_N_features = X_train2.iloc[:, :N]    
    
#     linear_model.fit(X_train_first_N_features, Y_train2)
#     train_error_overfit = rmse(Y_train2, linear_model.predict(X_train_first_N_features))
    
#     X_holdout_first_N_features = X_holdout2.iloc[:, :N]
#     holdout_error_overfit = rmse(Y_holdout2, linear_model.predict(X_holdout_first_N_features))    
#     errors_vs_N.loc[len(errors_vs_N)] = [N, train_error_overfit, holdout_error_overfit]
for alpha in range_of_alphas:
    linear_model = Ridge(alpha=alpha)
    linear_model.fit(X_train3, Y_train3)
    training_error = rmse(Y_train3, linear_model.predict(X_train3))
    holdout_error = rmse(Y_holdout3, linear_model.predict(X_holdout3))
    error_vs_alpha.loc[len(error_vs_alpha)] = [alpha, training_error, holdout_error]
error_vs_alpha

alpha Training Error Holdout Error
0 0.000010 3.344803 5.389722
1 0.000017 3.344885 5.362696
2 0.000029 3.345093 5.318839
3 0.000049 3.345588 5.249551
4 0.000084 3.346672 5.144906
5 0.000143 3.348827 4.997596
6 0.000242 3.352670 4.810448
7 0.000412 3.358709 4.603154
8 0.000702 3.366898 4.408047
9 0.001194 3.376490 4.252523
10 0.002031 3.386611 4.144918
11 0.003455 3.396946 4.077740
12 0.005878 3.407582 4.038919
13 0.010000 3.418347 4.018141
14 0.017013 3.428713 4.007542
15 0.028943 3.438401 4.001021
16 0.049239 3.447793 3.994133
17 0.083768 3.457708 3.984607
18 0.142510 3.468839 3.972858
19 0.242446 3.481455 3.962098
20 0.412463 3.495804 3.958457
21 0.701704 3.512882 3.971376
22 1.193777 3.534575 4.011992
23 2.030918 3.562638 4.086328
24 3.455107 3.597518 4.187414
25 5.878016 3.638674 4.296469
26 10.000000 3.686303 4.392487
27 17.012543 3.742258 4.458995
28 28.942661 3.809021 4.486227
29 49.238826 3.889335 4.478730
30 83.767764 3.989339 4.470314
31 142.510267 4.121409 4.524381
32 242.446202 4.300992 4.693465
33 412.462638 4.541284 4.968124
34 701.703829 4.854189 5.289802
35 1193.776642 5.251478 5.615333
36 2030.917621 5.733147 5.946439
37 3455.107295 6.275742 6.304280
38 5878.016072 6.841884 6.698886
39 10000.000000 7.394722 7.119279

Below we plot your training and holdout set error for the range of alphas given. You should see a figure similar to this one from lecture, where training error goes down as model complexity increases, but the error on the held out set is large for extreme values of alpha, and minimized for some intermediate value.

Note that on your plot, the x-axis is in the inverse of complexity! In other words, small alpha models (on the left) are complex, because there is no regularization. That’s why the training error is lowest on the left side of the plot, as this is where overfitting occurs.

1
px.line(error_vs_alpha, x = "alpha", y = ["Training Error", "Holdout Error"], log_x=True)

From the plot above, what is the best alpha to use?

training error尽可能小,同时hold-out error尽可能小 ==> 0.01~1左右

REMINDER: Test Set vs. Validation Set (a.k.a. Development Set)

In the plots above, we trained our models on a training set, and plotted the resulting RMSE on the training set in blue. We also held out a set of data, and plotted the error on this holdout set in red, calling it the “holdout set error”.

For the example above, since we used the holdout set to pick a hyperparameter, we’d call the holdout set a “validation set” or “development set”. These terms are exactly synonomous.

It would not be accurate to call this line the “test set error”, because we did not use this dataset as a test set. While it is true that your code never supplied X_test3 or Y_test3 to the fit function of the ridge regression models, once you decide to use the holdout set to select between different models, different hyperparameters, or different sets of features, then we are not using that dataset as a “test set”.

That is, since we’ve used this holdout set for picking alpha, the resulting errors are no longer unbiased predictors of our performance on unseen models – the true error on an unseen dataset is likely to be somewhat higher than the validation set. After all, we trained 40 models and picked the best one!

In many real world contexts, model builders will split their data into three sets: training, validation, and test sets, where the test set is only ever used once. That is, there are two holdout sets: One used as a development set (for model selection), and one used a test set (for providing an unbiased estimate of error).

An Alternate Strategy for Hyper Parameter Selection: K-Fold Cross Validation

Earlier we used the holdout method for model selection (the holdout method is also sometimes called “simple cross validation”). Another approach is K-fold cross validation. This allows us to use more data for training instead of having to set aside some specifically for hyperparameter selection. However, doing so requires more computation resources as we’ll have to fit K models per hyperparameter choice.

In our course Data 100, there’s really no reason not to use cross validation. However, in environments where models are very expensive to train (e.g. deep learning), you’ll typically prefer using a holdout set (simple cross validation) rather than K-fold cross validation.

To emphasize what K-fold cross validation actually means, we’re going to manually carry out the procedure. Recall the approach looks something like the figure below for 4-fold cross validation:

When we use K-fold cross validation, rather than using a held out set for model selection, we instead use the training set for model selection. To select between various features, various models, or various hyperparameters, we split the training set further into multiple temporary train and validation sets (each split is called a “fold”, hence k-fold cross validation). We will use the average validation error across all k folds to make our optimal feature, model, and hyperparameter choices. In this example, we’ll only use this procedure for hyperparameter selection, specifically to choose the best alpha.

Question 4 重点在于怎么切分数据集!

Scikit-learn has built-in support for cross validation. However, to better understand how cross validation works complete the following function which cross validates a given model.

  1. Use the KFold.split function to get 4 splits on the training data. Note that split returns the indices of the data for that split.
  2. For each split:
    1. Select out the training and validation rows and columns based on the split indices and features.
    2. Compute the RMSE on the validation split.
    3. Return the average error across all cross validation splits.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sklearn.model_selection import KFold

def compute_CV_error(model, X_train, Y_train):
    '''
    Split the training data into 4 subsets.
    For each subset, 
        fit a model holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 4 models total.
    Return the average MSE of these 4 folds.

    Args:
        model: an sklearn model with fit and predict functions 
        X_train (data_frame): Training data
        Y_train (data_frame): Label 

    Return:
        the average validation MSE for the 4 splits.
    '''
    kf = KFold(n_splits=4)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
        split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]

        # Fit the model on the training split
        model.fit(split_X_train, split_Y_train)
        
        # Compute the RMSE on the validation split
        error = rmse(model.predict(split_X_valid), split_Y_valid)


        validation_errors.append(error)
        
    return np.mean(validation_errors)
1
grader.check("q4")

Question 5

Use compute_CV_error to add a new column to error_vs_alpha which gives the 4-fold cross validation error for the given choice of alpha.

1
2
3
4
5
6
7
8
cv_errors = []
range_of_alphas = 10**np.linspace(-5, 4, 40)

for alpha in range_of_alphas:
    cv_error = compute_CV_error(Ridge(alpha=alpha), X_train3, Y_train3)
    cv_errors.append(cv_error)

error_vs_alpha["CV Error"] = cv_errors
1
error_vs_alpha

alpha Training Error Holdout Error CV Error
0 0.000010 3.344803 5.389722 10.763338
1 0.000017 3.344885 5.362696 10.578003
2 0.000029 3.345093 5.318839 10.254709
3 0.000049 3.345588 5.249551 9.756308
4 0.000084 3.346672 5.144906 9.054988
5 0.000143 3.348827 4.997596 8.147759
6 0.000242 3.352670 4.810448 7.069916
7 0.000412 3.358709 4.603154 5.905299
8 0.000702 3.366898 4.408047 4.810950
9 0.001194 3.376490 4.252523 4.104387
10 0.002031 3.386611 4.144918 4.080071
11 0.003455 3.396946 4.077740 4.240810
12 0.005878 3.407582 4.038919 4.224883
13 0.010000 3.418347 4.018141 4.086858
14 0.017013 3.428713 4.007542 3.956585
15 0.028943 3.438401 4.001021 3.889772
16 0.049239 3.447793 3.994133 3.867618
17 0.083768 3.457708 3.984607 3.858856
18 0.142510 3.468839 3.972858 3.850327
19 0.242446 3.481455 3.962098 3.842001
20 0.412463 3.495804 3.958457 3.837080
21 0.701704 3.512882 3.971376 3.838459
22 1.193777 3.534575 4.011992 3.848340
23 2.030918 3.562638 4.086328 3.867120
24 3.455107 3.597518 4.187414 3.893089
25 5.878016 3.638674 4.296469 3.924624
26 10.000000 3.686303 4.392487 3.962520
27 17.012543 3.742258 4.458995 4.009721
28 28.942661 3.809021 4.486227 4.070020
29 49.238826 3.889335 4.478730 4.149246
30 83.767764 3.989339 4.470314 4.257353
31 142.510267 4.121409 4.524381 4.406670
32 242.446202 4.300992 4.693465 4.607861
33 412.462638 4.541284 4.968124 4.870040
34 701.703829 4.854189 5.289802 5.203950
35 1193.776642 5.251478 5.615333 5.617020
36 2030.917621 5.733147 5.946439 6.099994
37 3455.107295 6.275742 6.304280 6.625052
38 5878.016072 6.841884 6.698886 7.158474
39 10000.000000 7.394722 7.119279 7.665518

The code below shows the holdout error that we computed in the previous problem as well as the 4-fold cross validation error. Note that the cross validation error shows a similar dependency on alpha relative to the holdout error. This is because they are both doing the same thing, namely trying to estimate the expected error on unseen data drawn from distribution from which the training set was drawn.

In other words, this figure compares the holdout method with 4-fold cross validation.

Note: I don’t know why the CV error is so much higher for very small (i.e. very complex) models. Let me know if you figure out why. I suspec ti’ts just random noise.

1
px.line(error_vs_alpha, x = "alpha", y = ["Holdout Error", "CV Error"], log_x=True)

Extra: Using GridSearchCV 自主找到最佳超参数

Above, we manually performed a search of the space of possible hyperparameters. In this section we’ll discuss how to use sklearn to automatically perform such a search. The code below automatically tries out all alpha values in the given range.

1
2
3
4
5
from sklearn.model_selection import GridSearchCV
params = {'alpha': 10**np.linspace(-5, 4, 40)}

grid_search = GridSearchCV(Ridge(), params, cv = 4, scoring = "neg_root_mean_squared_error")
grid_search.fit(X_train3, Y_train3)

We can get the average RMSE for the four folds for each of the values of alpha with the code below. In other words, this array is the same as the one you computed earlier when you created the “CV Error” column.

1
grid_search.cv_results_['mean_test_score']
array([-10.7633381 , -10.57800314, -10.25470921,  -9.75630755,
        -9.05498816,  -8.14775947,  -7.06991566,  -5.90529929,
        -4.8109505 ,  -4.10438693,  -4.08007128,  -4.24080956,
        -4.22488284,  -4.08685828,  -3.95658497,  -3.88977241,
        -3.86761841,  -3.85885628,  -3.85032722,  -3.8420014 ,
        -3.83707965,  -3.83845914,  -3.84833967,  -3.86711956,
        -3.89308871,  -3.92462404,  -3.96251959,  -4.00972106,
        -4.07002011,  -4.14924607,  -4.25735297,  -4.4066697 ,
        -4.60786131,  -4.87004045,  -5.20394987,  -5.61702004,
        -6.09999442,  -6.62505185,  -7.15847442,  -7.66551837])

We can specifically see the lowest RMSE with best_score_:

1
grid_search.best_score_
np.float64(-3.8370796510062055)

And we can get the best model with best_estimator_, which you’ll note is a Ridge regression model with alpha = 0.412.

1
grid_search.best_estimator_

We can even add the errors from GridSearchCV to our error_vs_alpha DataFrame and compare the results of our manual 4-fold cross validation with sklearn’s implementation:

1
error_vs_alpha["sklearn CV Score"] = grid_search.cv_results_['mean_test_score']
1
px.line(error_vs_alpha, x = "alpha", y = ["CV Error", "sklearn CV Score"], log_x=True)

You’ll notice they are exactly the same except that the sklearn CV score is the negative of the error. This is because GridSearchCV is conceptualized as a “maximizer”, where the goal is to get the highest possible score, whereas our code was a “minimizer”, where the goal was to get the lowest possible error. In other words, the error is just the negative of the score. 镜像由来

Extra: Examining the Residuals of our Optimal Alpha Model

The code below plots the residuals of our best model (Ridge with alpha = 0.412) on the test set. Note that they now seem to be better distributed on either size of the line and are generally closer the line, though with a few more extreme outliers.

1
2
3
4
5
6
7
8
plt.figure(figsize=(10, 6))
predicted_values_on_holdout3 = grid_search.best_estimator_.predict(X_holdout3)
plt.scatter(predicted_values_on_holdout3, Y_holdout3 - predicted_values_on_holdout3, alpha = 0.5)
plt.ylabel("Residual $(y - \hat{y})$")
plt.xlabel("Predicted Prices $(\hat{y})$")
plt.title("Residuals vs Predicted Prices")
plt.title("Residual of prediction for i'th house")
plt.axhline(y = 0, color='r');
<>:4: SyntaxWarning:

invalid escape sequence '\h'

<>:5: SyntaxWarning:

invalid escape sequence '\h'

<>:4: SyntaxWarning:

invalid escape sequence '\h'

<>:5: SyntaxWarning:

invalid escape sequence '\h'

C:\Users\86135\AppData\Local\Temp\ipykernel_6688\3088448444.py:4: SyntaxWarning:

invalid escape sequence '\h'

C:\Users\86135\AppData\Local\Temp\ipykernel_6688\3088448444.py:5: SyntaxWarning:

invalid escape sequence '\h'

/datalab8/lab08_files/lab08_81_1.png

Lastly we can compute the RMSE on the test set. This gives the expected squared error on a new unseen data point that may come to us in the future from the same distribution as our training set.

1
2
test_rmse = rmse(grid_search.best_estimator_.predict(X_holdout3), Y_holdout3)
test_rmse
np.float64(3.9584573514348387)

Extra: LASSO Regression

The code below finds an optimal Lasso model. Note that Lasso regression generalize behaves more poorly numerically, so you’ll probably get a bunch of warnings.

1
2
3
4
5
from sklearn.linear_model import Lasso
params = {'alpha': 10**np.linspace(-5, 4, 40)}

grid_search_lasso = GridSearchCV(Lasso(), params, cv = 4, scoring = "neg_root_mean_squared_error")
grid_search_lasso.fit(X_train3, Y_train3)

The best lasso model is below:

1
grid_search_lasso.best_estimator_

It’s error on the same test set as our best Ridge model is shown below:

1
2
test_rmse_lasso = rmse(grid_search_lasso.best_estimator_.predict(X_holdout3), Y_holdout3)
test_rmse_lasso
np.float64(4.054830916690993)

Note that if we tried to use this test error to decide between Ridge and LASSO, then our holdout set is now being used as a validation set, not a test set!! In other words, you get to either use the holdout set to decide between models, or to provide an unbiased estimate of error, but not both!

If we look at the best estimator’s parameters, we’ll see that many of the parameters are zero, due to the inherent feature selecting nature of a LASSO model.

1
grid_search_lasso.best_estimator_.coef_
array([-0.00000000e+00, -6.85384379e-01,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00, -0.00000000e+00, -7.38599400e-02, -5.29374425e-02,
        5.54295757e-01, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        4.37063521e-01, -3.80592597e+00,  1.61080715e+00,  6.37366884e-01,
       -0.00000000e+00,  2.22834586e-01,  6.01812381e-03, -9.40700489e-02,
       -4.02630887e-01,  3.07990173e-01,  3.72360525e-14,  0.00000000e+00,
       -2.51811102e+00,  0.00000000e+00,  0.00000000e+00,  9.85248689e+00,
       -7.21033868e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
        1.36076846e-01, -0.00000000e+00, -2.00907909e+00, -1.68923341e+00,
        1.77833261e+00,  2.55936962e-01,  5.04076324e-01,  0.00000000e+00,
       -1.82804827e+00,  0.00000000e+00, -0.00000000e+00, -1.53173793e+00,
       -7.84893470e-03,  1.09000336e+00, -5.21734363e+00, -3.87962203e-01])

We can also stick these parameters in a Series showing us both the weights and the names: 可解释性强一点点

1
2
3
lasso_weights = pd.Series(grid_search_lasso.best_estimator_.coef_, 
             index = boston_with_extra_features_scaled.columns)
lasso_weights
CRIM          -0.000000e+00
ZN            -6.853844e-01
INDUS          0.000000e+00
CHAS           0.000000e+00
NOX           -0.000000e+00
RM            -0.000000e+00
AGE           -7.385994e-02
DIS           -5.293744e-02
RAD            5.542958e-01
TAX           -0.000000e+00
PTRATIO       -0.000000e+00
LSTAT          0.000000e+00
CRIM^2         4.370635e-01
sqrtCRIM      -3.805926e+00
tanhCRIM       1.610807e+00
ZN^2           6.373669e-01
sqrtZN        -0.000000e+00
tanhZN         2.228346e-01
INDUS^2        6.018124e-03
sqrtINDUS     -9.407005e-02
tanhINDUS     -4.026309e-01
CHAS^2         3.079902e-01
sqrtCHAS       3.723605e-14
tanhCHAS       0.000000e+00
NOX^2         -2.518111e+00
sqrtNOX        0.000000e+00
tanhNOX        0.000000e+00
RM^2           9.852487e+00
sqrtRM        -7.210339e+00
tanhRM        -0.000000e+00
AGE^2         -0.000000e+00
sqrtAGE       -0.000000e+00
tanhAGE        1.360768e-01
DIS^2         -0.000000e+00
sqrtDIS       -2.009079e+00
tanhDIS       -1.689233e+00
RAD^2          1.778333e+00
sqrtRAD        2.559370e-01
tanhRAD        5.040763e-01
TAX^2          0.000000e+00
sqrtTAX       -1.828048e+00
tanhTAX        0.000000e+00
PTRATIO^2     -0.000000e+00
sqrtPTRATIO   -1.531738e+00
tanhPTRATIO   -7.848935e-03
LSTAT^2        1.090003e+00
sqrtLSTAT     -5.217344e+00
tanhLSTAT     -3.879622e-01
dtype: float64

Or sorting by the relative importance of each feature, we see that about a third of the parmaeters didn’t end up getting used at all by the LASSO model.

1
lasso_weights.sort_values(key = abs, ascending = False)
RM^2           9.852487e+00
sqrtRM        -7.210339e+00
sqrtLSTAT     -5.217344e+00
sqrtCRIM      -3.805926e+00
NOX^2         -2.518111e+00
sqrtDIS       -2.009079e+00
sqrtTAX       -1.828048e+00
RAD^2          1.778333e+00
tanhDIS       -1.689233e+00
tanhCRIM       1.610807e+00
sqrtPTRATIO   -1.531738e+00
LSTAT^2        1.090003e+00
ZN            -6.853844e-01
ZN^2           6.373669e-01
RAD            5.542958e-01
tanhRAD        5.040763e-01
CRIM^2         4.370635e-01
tanhINDUS     -4.026309e-01
tanhLSTAT     -3.879622e-01
CHAS^2         3.079902e-01
sqrtRAD        2.559370e-01
tanhZN         2.228346e-01
tanhAGE        1.360768e-01
sqrtINDUS     -9.407005e-02
AGE           -7.385994e-02
DIS           -5.293744e-02
tanhPTRATIO   -7.848935e-03
INDUS^2        6.018124e-03
sqrtCHAS       3.723605e-14
INDUS          0.000000e+00
RM            -0.000000e+00
NOX           -0.000000e+00
PTRATIO       -0.000000e+00
TAX           -0.000000e+00
LSTAT          0.000000e+00
CRIM          -0.000000e+00
CHAS           0.000000e+00
sqrtNOX        0.000000e+00
sqrtZN        -0.000000e+00
tanhCHAS       0.000000e+00
sqrtAGE       -0.000000e+00
AGE^2         -0.000000e+00
tanhRM        -0.000000e+00
tanhNOX        0.000000e+00
TAX^2          0.000000e+00
DIS^2         -0.000000e+00
PTRATIO^2     -0.000000e+00
tanhTAX        0.000000e+00
dtype: float64

Submission

Congratulations! You are finished with this assignment.