DATA100-lab8: Model Selection, Regularization, and Cross-Validation
|  |  | 
Lab 8: Model Selection, Regularization, and Cross-Validation
In this lab, you will practice using scikit-learn to generate models of various complexity. You’ll then use the holdout method and K-fold cross-validation to select the models that generalize best.
|  |  | 
Introduction
For this lab, we will use a toy dataset to predict the house prices in Boston with data provided by the sklearn.datasets package. There are more interesting datasets in the package if you want to explore them during your free time!
Run the following cell to load the data. load_boston() will return a dictionary object which includes keys for:
- data: the covariates (X)
- target: the response vector (Y)
- feature_names: the column names
- DESCR: a full description of the data
- filename: name of the csv file
|  |  | 
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])
array([1.82844292e+03, 5.75000000e+03, 5.63521000e+03, 3.50000000e+01,
       2.80675700e+02, 3.18002500e+03, 3.46989000e+04, 1.92029160e+03,
       4.83200000e+03, 2.06568000e+05, 9.33850000e+03, 6.40245000e+03])
|  |  | 
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**  
    :Number of Instances: 506 
    :Number of Attributes: 12 numeric/categorical predictive. Median Value (attribute 13) is usually the target.
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References
   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
A look at the DESCR attribute tells us the data contains these features:
1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 
             25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
             river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per 10,000 USD
11. PTRATIO  pupil-teacher ratio by town
12. LSTAT    % lower status of the population
Let’s now convert this data into a pandas DataFrame.
|  |  | 
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 4.98 | 
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 9.14 | 
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 4.03 | 
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 2.94 | 
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 5.33 | 
Question 1
Let’s model this housing price data! Before we can do this, however, we need to split the data into training and test sets. Remember that the response vector (housing prices) lives in the target attribute. A random seed is set here so that we can deterministically generate the same splitting in the future if we want to test our result again and find potential bugs.
Use the train_test_split function to split out 10% of the data for the test set. Call the resulting splits X_train, X_holdout, Y_train, Y_holdout. Here “holdout” refers to the fact that we’re going to hold this data our when training our model.
|  |  | 
|  |  | 
q1
 passed! 🚀
Question 2
As a warmup, fit a linear model to describe the relationship between the housing price and all available covariates. We’ve imported sklearn.linear_model as lm, so you can use that instead of typing out the whole module name. Fill in the cells below to fit a linear regression model to the covariates and create a scatter plot for our predictions vs. the true prices.
|  |  | 
<>:14: SyntaxWarning: invalid escape sequence '\h'
<>:14: SyntaxWarning: invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\1494534656.py:14: SyntaxWarning: invalid escape sequence '\h'
  plt.ylabel("Predicted Prices $(\hat{y})$")

Briefly analyze the scatter plot above. Do you notice any outliers? Write your answer in the cell below.
理想情况应该是均匀?且较窄分布于y=x直线上
Alternately, we can plot the residuals vs. our model predictions. Ideally they’d all be zero. Given the inevitably of noise, we’d at least like them to be scatter randomly across the line where the residual is zero. By contrast, there appears to be a possible pattern, with our model consistently underestimating prices for both very low and very high values, and possibly consistently overestimating prices towards the middle range.
|  |  | 
<>:2: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\h'
<>:2: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\2491234216.py:2: SyntaxWarning: invalid escape sequence '\h'
  plt.ylabel("Residual $(y - \hat{y})$")
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\2491234216.py:3: SyntaxWarning: invalid escape sequence '\h'
  plt.xlabel("Predicted Prices $(\hat{y})$")

Question 3
As we find from the scatter plot, our model is not perfect. If it were perfect, we would see the identity line (i.e. a line of slope 1). Compute the root mean squared error (RMSE) of the predicted responses:
$$ \textbf{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 } $$
Fill out the function below and compute the RMSE for our predictions on both the training data X_train and the held out set X_holdout.  Your implementation should not use for loops.
|  |  | 
Training RMSE: 4.633297105625516
Holdout RMSE: 5.685160866583937
|  |  | 
q3
 passed! 🙌
Is your training error lower than the error on the data the model never got to see? If so, why could this be happening? Answer in the cell below.
稍微过拟合?
Overfitting
Sometimes we can get even higher accuracy by adding more features. For example, the code below adds the square, square root, and hyperbolic tangent of every feature to the design matrix. We’ve chosen these bizarre features specifically to highlight overfitting.
|  |  | 
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | ... | tanhRAD | TAX^2 | sqrtTAX | tanhTAX | PTRATIO^2 | sqrtPTRATIO | tanhPTRATIO | LSTAT^2 | sqrtLSTAT | tanhLSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | ... | 0.761594 | 87616.0 | 17.204651 | 1.0 | 234.09 | 3.911521 | 1.0 | 24.8004 | 2.231591 | 0.999905 | 
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | ... | 0.964028 | 58564.0 | 15.556349 | 1.0 | 316.84 | 4.219005 | 1.0 | 83.5396 | 3.023243 | 1.000000 | 
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | ... | 0.964028 | 58564.0 | 15.556349 | 1.0 | 316.84 | 4.219005 | 1.0 | 16.2409 | 2.007486 | 0.999368 | 
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | ... | 0.995055 | 49284.0 | 14.899664 | 1.0 | 349.69 | 4.324350 | 1.0 | 8.6436 | 1.714643 | 0.994426 | 
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | ... | 0.995055 | 49284.0 | 14.899664 | 1.0 | 349.69 | 4.324350 | 1.0 | 28.4089 | 2.308679 | 0.999953 | 
5 rows × 48 columns
We split up our data again and refit the model. From this cell forward, we append 2 to the variable names X_train, X_holdout, Y_train, Y_holdout, train_error, holdout_error in order to maintain our original data. Make sure you use these variable names from this cell forward, at least until we get to the part where we create version 3 of each of these.
|  |  | 
Looking at our training and test RMSE, we see that they are lower than you computed earlier. This strange model is seemingly better, even though it includes seemingly useless features like the hyperbolic tangent of the average number of rooms per dwelling.
|  |  | 
Training RMSE: 3.3514483036916287
Holdout RMSE: 5.410120414381265
The code below generates the training and holdout RMSE for 49 different models stores the results in a DataFrame. The first model uses only the first feature “CRIM”. The second model uses the first two features “CRIM” and “ZN”, and so forth.
|  |  | 
| N | Training Error | Holdout Error | |
|---|---|---|---|
| 0 | 1.0 | 8.536340 | 7.825177 | 
| 1 | 2.0 | 8.085693 | 7.637465 | 
| 2 | 3.0 | 7.776942 | 7.213870 | 
| 3 | 4.0 | 7.643897 | 6.391482 | 
| 4 | 5.0 | 7.634894 | 6.372166 | 
| 5 | 6.0 | 5.698878 | 7.635694 | 
| 6 | 7.0 | 5.689554 | 7.585860 | 
| 7 | 8.0 | 5.399034 | 7.158563 | 
| 8 | 9.0 | 5.379679 | 7.281769 | 
| 9 | 10.0 | 5.318218 | 7.231629 | 
| 10 | 11.0 | 5.088829 | 6.922974 | 
| 11 | 12.0 | 4.680294 | 5.437528 | 
| 12 | 13.0 | 4.679671 | 5.443388 | 
| 13 | 14.0 | 4.664717 | 5.448438 | 
| 14 | 15.0 | 4.627661 | 5.479720 | 
| 15 | 16.0 | 4.613226 | 5.488425 | 
| 16 | 17.0 | 4.580971 | 5.389309 | 
| 17 | 18.0 | 4.580622 | 5.391183 | 
| 18 | 19.0 | 4.507301 | 5.185114 | 
| 19 | 20.0 | 4.482925 | 5.194924 | 
| 20 | 21.0 | 4.482412 | 5.188007 | 
| 21 | 22.0 | 4.482412 | 5.188007 | 
| 22 | 23.0 | 4.482412 | 5.188007 | 
| 23 | 24.0 | 4.482412 | 5.188007 | 
| 24 | 25.0 | 4.482224 | 5.191621 | 
| 25 | 26.0 | 4.471079 | 5.256722 | 
| 26 | 27.0 | 4.460457 | 5.308239 | 
| 27 | 28.0 | 3.909139 | 4.582940 | 
| 28 | 29.0 | 3.889483 | 4.951462 | 
| 29 | 30.0 | 3.728687 | 6.963946 | 
| 30 | 31.0 | 3.697400 | 6.986045 | 
| 31 | 32.0 | 3.672688 | 7.081944 | 
| 32 | 33.0 | 3.672674 | 7.082046 | 
| 33 | 34.0 | 3.638518 | 6.916599 | 
| 34 | 35.0 | 3.570632 | 6.535278 | 
| 35 | 36.0 | 3.515723 | 6.417958 | 
| 36 | 37.0 | 3.513539 | 6.420029 | 
| 37 | 38.0 | 3.502764 | 6.453024 | 
| 38 | 39.0 | 3.502689 | 6.452923 | 
| 39 | 40.0 | 3.498222 | 6.373783 | 
| 40 | 41.0 | 3.450329 | 6.493727 | 
| 41 | 42.0 | 3.450329 | 6.493727 | 
| 42 | 43.0 | 3.448549 | 6.486777 | 
| 43 | 44.0 | 3.448549 | 6.486974 | 
| 44 | 45.0 | 3.448549 | 6.486974 | 
| 45 | 46.0 | 3.420689 | 5.823307 | 
| 46 | 47.0 | 3.353809 | 5.398784 | 
| 47 | 48.0 | 3.351448 | 5.410120 | 
If we plot the training and holdout error as we add each additional feature, our training error gets lower and lower (since our model bias is increasing), and in fact it’s possible to prove with linear algebra that the training error will decrease monotonically.
By contrast, the error on unseen held out data is higher for the models with more parameters, since the lessons learned from these last 20+ features aren’t actually useful when applied to unseen data. That is, these models aren’t generalizable.
|  |  | 
Note that this diagram resembles are cartoon from Lecture 15.
This plot is a useful tool for model selection: the best model is the one the lowest error on the holdout set, i.e. the one that includes parameters 1 through 28.
Regularization
As an alternative and more realistic example, instead of using only the first N features, we can use various different regularization strengths. For example, for really low regularization strengths (e.g. $\alpha = 10^{-3}$), we get a model that is very identical to our linear regression model.
|  |  | 
d:\miniconda3\Lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning:
Ill-conditioned matrix (rcond=6.11696e-19): result may not be accurate.
array([ 4.44044277e-01, -3.00268517e-02,  2.03776925e+00,  3.54247206e-01,
       -1.19704083e+02,  1.63780073e+01, -3.10555372e-01, -1.31182539e+01,
        2.87010751e+00,  7.68411439e-01,  2.43201974e+01,  2.09160420e+00,
       -1.17012738e-03, -5.60565882e+00,  6.79680723e+00,  1.02949752e-03,
       -1.31223400e+00,  6.99621340e+00, -3.55165065e-02, -7.66339676e+00,
       -2.53950130e+00,  3.54247186e-01,  3.54247186e-01,  2.69792455e-01,
        1.91778126e+00,  3.11293526e+02, -1.53815298e+02,  8.03364965e-01,
       -1.17792246e+02,  3.25883430e+02,  1.08476149e-03,  2.42998443e+00,
        2.52462516e+02,  3.55080093e-01,  3.78504405e+01, -8.11283072e+01,
       -5.18073808e-02, -8.51699934e+00,  1.14213610e+01, -2.86248788e-04,
       -2.10606164e+01,  0.00000000e+00, -1.85988225e-01, -1.54605184e+02,
        5.73422430e-06, -1.79546600e-02, -1.53342390e+01, -4.25637232e+01])
|  |  | 
|  |  | 
array([ 3.65647144e-01,  7.96329260e-02,  1.50196461e+00,  3.72759210e-01,
       -1.82281287e+03,  6.19862020e+02, -2.86690023e-01, -1.29491141e+01,
        1.68693762e+00,  7.86841735e-01,  1.62893036e+01,  1.95113824e+00,
       -9.11835586e-04, -5.02513063e+00,  5.90016774e+00,  6.12889765e-04,
       -2.21247181e+00,  8.90275845e+00, -2.73913970e-02, -5.40098561e+00,
       -4.23462112e+00,  3.72978675e-01,  3.72978861e-01,  2.84060205e-01,
        5.41748851e+02,  4.88274463e+02,  1.16998609e+03, -1.36350124e+01,
       -2.23299632e+03,  5.18647024e+04,  1.04162650e-03,  2.14549424e+00,
        4.31003519e+02,  3.51263646e-01,  3.77337190e+01, -8.06896603e+01,
       -2.88295129e-02, -4.52779826e+00,  8.15771554e+00, -2.99443268e-04,
       -2.14061912e+01,  3.63797881e-12, -1.15683673e-01, -1.07968511e+02,
        1.52846060e-03, -2.03166630e-02, -1.38532349e+01, -4.22894414e+01])
However, if we pick a large regularization strength, e.g. $\alpha = 10^4$, we see that the resulting parameters are much smaller in magnitude.
|  |  | 
array([-2.64236947e-02, -9.32767913e-03, -2.42925745e-02,  5.47079848e-03,
       -2.54276859e-03,  1.92843599e-02, -5.85037883e-02, -2.06397155e-02,
        2.62611572e-02, -4.16712719e-02, -1.95840395e-03, -1.91841765e-01,
       -1.08846586e-03, -4.28805626e-03,  1.70791430e-03,  6.51767238e-04,
        1.71133790e-03,  1.07486010e-03, -1.19407955e-03, -7.15970642e-03,
       -7.29287455e-04,  5.47079848e-03,  5.47079848e-03,  4.16652815e-03,
       -3.60910235e-03, -1.50954020e-03, -1.59681172e-03,  3.35928833e-01,
        3.11186224e-03, -2.79750628e-06,  4.48782500e-04, -5.71759051e-03,
        2.22943575e-06, -6.59740404e-02, -7.01191670e-03, -1.58200606e-03,
        1.32454447e-03,  8.15878522e-03,  1.17645581e-03,  3.59660322e-05,
       -2.54207413e-03,  0.00000000e+00, -2.57499245e-02, -3.15683513e-04,
       -8.10128212e-15, -6.45893053e-03, -4.20286900e-02, -2.29035441e-04])
Standard Scaling
归一化
Recall from lecture that in order to properly regularize a model, the features should be at the same scale. Otherwise the model has to spend more of its parameter budget to use “small” features (e.g. lengths in inches) compared to “large” features (e.g. lengths in kilometers).
To do this we can use a Standard Scaler to create a new version of the DataFrame where every column has zero mean and a standard deviation of 1.
|  |  | 
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | ... | tanhRAD | TAX^2 | sqrtTAX | tanhTAX | PTRATIO^2 | sqrtPTRATIO | tanhPTRATIO | LSTAT^2 | sqrtLSTAT | tanhLSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.419782 | 0.284830 | -1.287909 | -0.272599 | -0.144217 | 0.413672 | -0.120013 | 0.140214 | -0.982843 | -0.666608 | ... | -4.863216 | -0.682024 | -0.644166 | 0.0 | -1.458429 | -1.453573 | 0.135095 | -0.789529 | -1.202689 | 0.103530 | 
| 1 | -0.417339 | -0.487722 | -0.593381 | -0.272599 | -0.740262 | 0.194274 | 0.367166 | 0.557160 | -0.867883 | -0.987329 | ... | -0.521299 | -0.866530 | -1.053383 | 0.0 | -0.373078 | -0.266921 | 0.179012 | -0.540454 | -0.399953 | 0.128396 | 
| 2 | -0.417342 | -0.487722 | -0.593381 | -0.272599 | -0.740262 | 1.282714 | -0.265812 | 0.557160 | -0.867883 | -0.987329 | ... | -0.521299 | -0.866530 | -1.053383 | 0.0 | -0.373078 | -0.266921 | 0.179012 | -0.825825 | -1.429933 | -0.037847 | 
| 3 | -0.416750 | -0.487722 | -1.306878 | -0.272599 | -0.835284 | 1.016303 | -0.809889 | 1.077737 | -0.752922 | -1.106115 | ... | 0.144191 | -0.925467 | -1.216415 | 0.0 | 0.057783 | 0.139631 | 0.179251 | -0.858040 | -1.726876 | -1.338649 | 
| 4 | -0.412482 | -0.487722 | -1.306878 | -0.272599 | -0.835284 | 1.228577 | -0.511180 | 1.077737 | -0.752922 | -1.106115 | ... | 0.144191 | -0.925467 | -1.216415 | 0.0 | 0.057783 | 0.139631 | 0.179251 | -0.774228 | -1.124522 | 0.116050 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 501 | -0.413229 | -0.487722 | 0.115738 | -0.272599 | 0.158124 | 0.439316 | 0.018673 | -0.625796 | -0.982843 | -0.803212 | ... | -4.863216 | -0.765138 | -0.813468 | 0.0 | 1.255407 | 1.136187 | 0.179299 | -0.498180 | -0.312324 | 0.128400 | 
| 502 | -0.415249 | -0.487722 | 0.115738 | -0.272599 | 0.158124 | -0.234548 | 0.288933 | -0.716639 | -0.982843 | -0.803212 | ... | -4.863216 | -0.765138 | -0.813468 | 0.0 | 1.255407 | 1.136187 | 0.179299 | -0.545089 | -0.410031 | 0.128395 | 
| 503 | -0.413447 | -0.487722 | 0.115738 | -0.272599 | 0.158124 | 0.984960 | 0.797449 | -0.773684 | -0.982843 | -0.803212 | ... | -4.863216 | -0.765138 | -0.813468 | 0.0 | 1.255407 | 1.136187 | 0.179299 | -0.759808 | -1.057406 | 0.121757 | 
| 504 | -0.407764 | -0.487722 | 0.115738 | -0.272599 | 0.158124 | 0.725672 | 0.736996 | -0.668437 | -0.982843 | -0.803212 | ... | -4.863216 | -0.765138 | -0.813468 | 0.0 | 1.255407 | 1.136187 | 0.179299 | -0.716638 | -0.884300 | 0.127164 | 
| 505 | -0.415000 | -0.487722 | 0.115738 | -0.272599 | 0.158124 | -0.362767 | 0.434732 | -0.613246 | -0.982843 | -0.803212 | ... | -4.863216 | -0.765138 | -0.813468 | 0.0 | 1.255407 | 1.136187 | 0.179299 | -0.631389 | -0.619088 | 0.128327 | 
506 rows × 48 columns
Let’s now regenerate the training and holdout sets using this new rescaled dataset.
|  |  | 
Fitting our regularized model with $\alpha = 10^4$ on this scaled data, we now see that our coefficients are of about the same magnitude. This is because all of our features are of around the same magnitude, whereas in the unscaled data, some of the features like TAX^2 were much larger than others.
|  |  | 
array([-0.61501301, -0.04142115, -0.13765546,  0.11847529, -0.48559141,
        1.08393358, -0.11193453, -0.6446524 ,  0.25956768, -0.41922265,
       -0.48366805, -1.23850023, -0.22227015, -0.51281683,  0.40952134,
        0.2537374 , -0.07390569,  0.06674777,  0.11386252, -0.32684806,
       -0.39658025,  0.11847529,  0.11847529,  0.11847529, -0.67728184,
       -0.385382  , -0.36114118,  1.652695  ,  0.78959095, -1.09450355,
       -0.02430294, -0.14153645,  0.11511136, -0.41673303, -0.72747143,
       -1.36478486,  0.21308676,  0.30241207,  0.45131889, -0.16799052,
       -0.59340155,  0.        , -0.43637213, -0.50878723, -0.16529828,
       -0.04194842, -1.94295189, -0.70807685])
Finding an Optimum Alpha
In the cell below, write code that generates a DataFrame with the training and holdout error for the range of alphas given. Make sure you’re using the 3rd training and holdout sets, which have been rescaled!
Note: You should use all 48 features for every single model that you fit, i.e. you’re not going to be keeping only the first N features.
|  |  | 
| alpha | Training Error | Holdout Error | |
|---|---|---|---|
| 0 | 0.000010 | 3.344803 | 5.389722 | 
| 1 | 0.000017 | 3.344885 | 5.362696 | 
| 2 | 0.000029 | 3.345093 | 5.318839 | 
| 3 | 0.000049 | 3.345588 | 5.249551 | 
| 4 | 0.000084 | 3.346672 | 5.144906 | 
| 5 | 0.000143 | 3.348827 | 4.997596 | 
| 6 | 0.000242 | 3.352670 | 4.810448 | 
| 7 | 0.000412 | 3.358709 | 4.603154 | 
| 8 | 0.000702 | 3.366898 | 4.408047 | 
| 9 | 0.001194 | 3.376490 | 4.252523 | 
| 10 | 0.002031 | 3.386611 | 4.144918 | 
| 11 | 0.003455 | 3.396946 | 4.077740 | 
| 12 | 0.005878 | 3.407582 | 4.038919 | 
| 13 | 0.010000 | 3.418347 | 4.018141 | 
| 14 | 0.017013 | 3.428713 | 4.007542 | 
| 15 | 0.028943 | 3.438401 | 4.001021 | 
| 16 | 0.049239 | 3.447793 | 3.994133 | 
| 17 | 0.083768 | 3.457708 | 3.984607 | 
| 18 | 0.142510 | 3.468839 | 3.972858 | 
| 19 | 0.242446 | 3.481455 | 3.962098 | 
| 20 | 0.412463 | 3.495804 | 3.958457 | 
| 21 | 0.701704 | 3.512882 | 3.971376 | 
| 22 | 1.193777 | 3.534575 | 4.011992 | 
| 23 | 2.030918 | 3.562638 | 4.086328 | 
| 24 | 3.455107 | 3.597518 | 4.187414 | 
| 25 | 5.878016 | 3.638674 | 4.296469 | 
| 26 | 10.000000 | 3.686303 | 4.392487 | 
| 27 | 17.012543 | 3.742258 | 4.458995 | 
| 28 | 28.942661 | 3.809021 | 4.486227 | 
| 29 | 49.238826 | 3.889335 | 4.478730 | 
| 30 | 83.767764 | 3.989339 | 4.470314 | 
| 31 | 142.510267 | 4.121409 | 4.524381 | 
| 32 | 242.446202 | 4.300992 | 4.693465 | 
| 33 | 412.462638 | 4.541284 | 4.968124 | 
| 34 | 701.703829 | 4.854189 | 5.289802 | 
| 35 | 1193.776642 | 5.251478 | 5.615333 | 
| 36 | 2030.917621 | 5.733147 | 5.946439 | 
| 37 | 3455.107295 | 6.275742 | 6.304280 | 
| 38 | 5878.016072 | 6.841884 | 6.698886 | 
| 39 | 10000.000000 | 7.394722 | 7.119279 | 
Below we plot your training and holdout set error for the range of alphas given. You should see a figure similar to this one from lecture, where training error goes down as model complexity increases, but the error on the held out set is large for extreme values of alpha, and minimized for some intermediate value.
Note that on your plot, the x-axis is in the inverse of complexity! In other words, small alpha models (on the left) are complex, because there is no regularization. That’s why the training error is lowest on the left side of the plot, as this is where overfitting occurs.
|  |  | 
From the plot above, what is the best alpha to use?
training error尽可能小,同时hold-out error尽可能小 ==> 0.01~1左右
REMINDER: Test Set vs. Validation Set (a.k.a. Development Set)
In the plots above, we trained our models on a training set, and plotted the resulting RMSE on the training set in blue. We also held out a set of data, and plotted the error on this holdout set in red, calling it the “holdout set error”.
For the example above, since we used the holdout set to pick a hyperparameter, we’d call the holdout set a “validation set” or “development set”. These terms are exactly synonomous.
It would not be accurate to call this line the “test set error”, because we did not use this dataset as a test set. While it is true that your code never supplied X_test3 or Y_test3 to the fit function of the ridge regression models, once you decide to use the holdout set to select between different models, different hyperparameters, or different sets of features, then we are not using that dataset as a “test set”.
That is, since we’ve used this holdout set for picking alpha, the resulting errors are no longer unbiased predictors of our performance on unseen models – the true error on an unseen dataset is likely to be somewhat higher than the validation set. After all, we trained 40 models and picked the best one!
In many real world contexts, model builders will split their data into three sets: training, validation, and test sets, where the test set is only ever used once. That is, there are two holdout sets: One used as a development set (for model selection), and one used a test set (for providing an unbiased estimate of error).
An Alternate Strategy for Hyper Parameter Selection: K-Fold Cross Validation
Earlier we used the holdout method for model selection (the holdout method is also sometimes called “simple cross validation”). Another approach is K-fold cross validation. This allows us to use more data for training instead of having to set aside some specifically for hyperparameter selection. However, doing so requires more computation resources as we’ll have to fit K models per hyperparameter choice.
In our course Data 100, there’s really no reason not to use cross validation. However, in environments where models are very expensive to train (e.g. deep learning), you’ll typically prefer using a holdout set (simple cross validation) rather than K-fold cross validation.
To emphasize what K-fold cross validation actually means, we’re going to manually carry out the procedure. Recall the approach looks something like the figure below for 4-fold cross validation:
 
When we use K-fold cross validation, rather than using a held out set for model selection, we instead use the training set for model selection. To select between various features, various models, or various hyperparameters, we split the training set further into multiple temporary train and validation sets (each split is called a “fold”, hence k-fold cross validation). We will use the average validation error across all k folds to make our optimal feature, model, and hyperparameter choices. In this example, we’ll only use this procedure for hyperparameter selection, specifically to choose the best alpha.
Question 4 重点在于怎么切分数据集!
Scikit-learn has built-in support for cross validation. However, to better understand how cross validation works complete the following function which cross validates a given model.
- Use the KFold.splitfunction to get 4 splits on the training data. Note thatsplitreturns the indices of the data for that split.
- For each split:
- Select out the training and validation rows and columns based on the split indices and features.
- Compute the RMSE on the validation split.
- Return the average error across all cross validation splits.
 
|  |  | 
|  |  | 
Question 5
Use compute_CV_error to add a new column to error_vs_alpha which gives the 4-fold cross validation error for the given choice of alpha.
|  |  | 
|  |  | 
| alpha | Training Error | Holdout Error | CV Error | |
|---|---|---|---|---|
| 0 | 0.000010 | 3.344803 | 5.389722 | 10.763338 | 
| 1 | 0.000017 | 3.344885 | 5.362696 | 10.578003 | 
| 2 | 0.000029 | 3.345093 | 5.318839 | 10.254709 | 
| 3 | 0.000049 | 3.345588 | 5.249551 | 9.756308 | 
| 4 | 0.000084 | 3.346672 | 5.144906 | 9.054988 | 
| 5 | 0.000143 | 3.348827 | 4.997596 | 8.147759 | 
| 6 | 0.000242 | 3.352670 | 4.810448 | 7.069916 | 
| 7 | 0.000412 | 3.358709 | 4.603154 | 5.905299 | 
| 8 | 0.000702 | 3.366898 | 4.408047 | 4.810950 | 
| 9 | 0.001194 | 3.376490 | 4.252523 | 4.104387 | 
| 10 | 0.002031 | 3.386611 | 4.144918 | 4.080071 | 
| 11 | 0.003455 | 3.396946 | 4.077740 | 4.240810 | 
| 12 | 0.005878 | 3.407582 | 4.038919 | 4.224883 | 
| 13 | 0.010000 | 3.418347 | 4.018141 | 4.086858 | 
| 14 | 0.017013 | 3.428713 | 4.007542 | 3.956585 | 
| 15 | 0.028943 | 3.438401 | 4.001021 | 3.889772 | 
| 16 | 0.049239 | 3.447793 | 3.994133 | 3.867618 | 
| 17 | 0.083768 | 3.457708 | 3.984607 | 3.858856 | 
| 18 | 0.142510 | 3.468839 | 3.972858 | 3.850327 | 
| 19 | 0.242446 | 3.481455 | 3.962098 | 3.842001 | 
| 20 | 0.412463 | 3.495804 | 3.958457 | 3.837080 | 
| 21 | 0.701704 | 3.512882 | 3.971376 | 3.838459 | 
| 22 | 1.193777 | 3.534575 | 4.011992 | 3.848340 | 
| 23 | 2.030918 | 3.562638 | 4.086328 | 3.867120 | 
| 24 | 3.455107 | 3.597518 | 4.187414 | 3.893089 | 
| 25 | 5.878016 | 3.638674 | 4.296469 | 3.924624 | 
| 26 | 10.000000 | 3.686303 | 4.392487 | 3.962520 | 
| 27 | 17.012543 | 3.742258 | 4.458995 | 4.009721 | 
| 28 | 28.942661 | 3.809021 | 4.486227 | 4.070020 | 
| 29 | 49.238826 | 3.889335 | 4.478730 | 4.149246 | 
| 30 | 83.767764 | 3.989339 | 4.470314 | 4.257353 | 
| 31 | 142.510267 | 4.121409 | 4.524381 | 4.406670 | 
| 32 | 242.446202 | 4.300992 | 4.693465 | 4.607861 | 
| 33 | 412.462638 | 4.541284 | 4.968124 | 4.870040 | 
| 34 | 701.703829 | 4.854189 | 5.289802 | 5.203950 | 
| 35 | 1193.776642 | 5.251478 | 5.615333 | 5.617020 | 
| 36 | 2030.917621 | 5.733147 | 5.946439 | 6.099994 | 
| 37 | 3455.107295 | 6.275742 | 6.304280 | 6.625052 | 
| 38 | 5878.016072 | 6.841884 | 6.698886 | 7.158474 | 
| 39 | 10000.000000 | 7.394722 | 7.119279 | 7.665518 | 
The code below shows the holdout error that we computed in the previous problem as well as the 4-fold cross validation error. Note that the cross validation error shows a similar dependency on alpha relative to the holdout error. This is because they are both doing the same thing, namely trying to estimate the expected error on unseen data drawn from distribution from which the training set was drawn.
In other words, this figure compares the holdout method with 4-fold cross validation.
Note: I don’t know why the CV error is so much higher for very small (i.e. very complex) models. Let me know if you figure out why. I suspec ti’ts just random noise.
|  |  | 
Extra: Using GridSearchCV 自主找到最佳超参数
Above, we manually performed a search of the space of possible hyperparameters. In this section we’ll discuss how to use sklearn to automatically perform such a search. The code below automatically tries out all alpha values in the given range.
|  |  | 
We can get the average RMSE for the four folds for each of the values of alpha with the code below. In other words, this array is the same as the one you computed earlier when you created the “CV Error” column.
|  |  | 
array([-10.7633381 , -10.57800314, -10.25470921,  -9.75630755,
        -9.05498816,  -8.14775947,  -7.06991566,  -5.90529929,
        -4.8109505 ,  -4.10438693,  -4.08007128,  -4.24080956,
        -4.22488284,  -4.08685828,  -3.95658497,  -3.88977241,
        -3.86761841,  -3.85885628,  -3.85032722,  -3.8420014 ,
        -3.83707965,  -3.83845914,  -3.84833967,  -3.86711956,
        -3.89308871,  -3.92462404,  -3.96251959,  -4.00972106,
        -4.07002011,  -4.14924607,  -4.25735297,  -4.4066697 ,
        -4.60786131,  -4.87004045,  -5.20394987,  -5.61702004,
        -6.09999442,  -6.62505185,  -7.15847442,  -7.66551837])
We can specifically see the lowest RMSE with best_score_:
|  |  | 
np.float64(-3.8370796510062055)
And we can get the best model with best_estimator_, which you’ll note is a Ridge regression model with alpha = 0.412.
|  |  | 
We can even add the errors from GridSearchCV to our error_vs_alpha DataFrame and compare the results of our manual 4-fold cross validation with sklearn’s implementation:
|  |  | 
|  |  | 
You’ll notice they are exactly the same except that the sklearn CV score is the negative of the error. This is because GridSearchCV is conceptualized as a “maximizer”, where the goal is to get the highest possible score, whereas our code was a “minimizer”, where the goal was to get the lowest possible error. In other words, the error is just the negative of the score. 镜像由来
Extra: Examining the Residuals of our Optimal Alpha Model
The code below plots the residuals of our best model (Ridge with alpha = 0.412) on the test set. Note that they now seem to be better distributed on either size of the line and are generally closer the line, though with a few more extreme outliers.
|  |  | 
<>:4: SyntaxWarning:
invalid escape sequence '\h'
<>:5: SyntaxWarning:
invalid escape sequence '\h'
<>:4: SyntaxWarning:
invalid escape sequence '\h'
<>:5: SyntaxWarning:
invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\3088448444.py:4: SyntaxWarning:
invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_6688\3088448444.py:5: SyntaxWarning:
invalid escape sequence '\h'

Lastly we can compute the RMSE on the test set. This gives the expected squared error on a new unseen data point that may come to us in the future from the same distribution as our training set.
|  |  | 
np.float64(3.9584573514348387)
Extra: LASSO Regression
The code below finds an optimal Lasso model. Note that Lasso regression generalize behaves more poorly numerically, so you’ll probably get a bunch of warnings.
|  |  | 
The best lasso model is below:
|  |  | 
It’s error on the same test set as our best Ridge model is shown below:
|  |  | 
np.float64(4.054830916690993)
Note that if we tried to use this test error to decide between Ridge and LASSO, then our holdout set is now being used as a validation set, not a test set!! In other words, you get to either use the holdout set to decide between models, or to provide an unbiased estimate of error, but not both!
If we look at the best estimator’s parameters, we’ll see that many of the parameters are zero, due to the inherent feature selecting nature of a LASSO model.
|  |  | 
array([-0.00000000e+00, -6.85384379e-01,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00, -0.00000000e+00, -7.38599400e-02, -5.29374425e-02,
        5.54295757e-01, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        4.37063521e-01, -3.80592597e+00,  1.61080715e+00,  6.37366884e-01,
       -0.00000000e+00,  2.22834586e-01,  6.01812381e-03, -9.40700489e-02,
       -4.02630887e-01,  3.07990173e-01,  3.72360525e-14,  0.00000000e+00,
       -2.51811102e+00,  0.00000000e+00,  0.00000000e+00,  9.85248689e+00,
       -7.21033868e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
        1.36076846e-01, -0.00000000e+00, -2.00907909e+00, -1.68923341e+00,
        1.77833261e+00,  2.55936962e-01,  5.04076324e-01,  0.00000000e+00,
       -1.82804827e+00,  0.00000000e+00, -0.00000000e+00, -1.53173793e+00,
       -7.84893470e-03,  1.09000336e+00, -5.21734363e+00, -3.87962203e-01])
We can also stick these parameters in a Series showing us both the weights and the names: 可解释性强一点点
|  |  | 
CRIM          -0.000000e+00
ZN            -6.853844e-01
INDUS          0.000000e+00
CHAS           0.000000e+00
NOX           -0.000000e+00
RM            -0.000000e+00
AGE           -7.385994e-02
DIS           -5.293744e-02
RAD            5.542958e-01
TAX           -0.000000e+00
PTRATIO       -0.000000e+00
LSTAT          0.000000e+00
CRIM^2         4.370635e-01
sqrtCRIM      -3.805926e+00
tanhCRIM       1.610807e+00
ZN^2           6.373669e-01
sqrtZN        -0.000000e+00
tanhZN         2.228346e-01
INDUS^2        6.018124e-03
sqrtINDUS     -9.407005e-02
tanhINDUS     -4.026309e-01
CHAS^2         3.079902e-01
sqrtCHAS       3.723605e-14
tanhCHAS       0.000000e+00
NOX^2         -2.518111e+00
sqrtNOX        0.000000e+00
tanhNOX        0.000000e+00
RM^2           9.852487e+00
sqrtRM        -7.210339e+00
tanhRM        -0.000000e+00
AGE^2         -0.000000e+00
sqrtAGE       -0.000000e+00
tanhAGE        1.360768e-01
DIS^2         -0.000000e+00
sqrtDIS       -2.009079e+00
tanhDIS       -1.689233e+00
RAD^2          1.778333e+00
sqrtRAD        2.559370e-01
tanhRAD        5.040763e-01
TAX^2          0.000000e+00
sqrtTAX       -1.828048e+00
tanhTAX        0.000000e+00
PTRATIO^2     -0.000000e+00
sqrtPTRATIO   -1.531738e+00
tanhPTRATIO   -7.848935e-03
LSTAT^2        1.090003e+00
sqrtLSTAT     -5.217344e+00
tanhLSTAT     -3.879622e-01
dtype: float64
Or sorting by the relative importance of each feature, we see that about a third of the parmaeters didn’t end up getting used at all by the LASSO model.
|  |  | 
RM^2           9.852487e+00
sqrtRM        -7.210339e+00
sqrtLSTAT     -5.217344e+00
sqrtCRIM      -3.805926e+00
NOX^2         -2.518111e+00
sqrtDIS       -2.009079e+00
sqrtTAX       -1.828048e+00
RAD^2          1.778333e+00
tanhDIS       -1.689233e+00
tanhCRIM       1.610807e+00
sqrtPTRATIO   -1.531738e+00
LSTAT^2        1.090003e+00
ZN            -6.853844e-01
ZN^2           6.373669e-01
RAD            5.542958e-01
tanhRAD        5.040763e-01
CRIM^2         4.370635e-01
tanhINDUS     -4.026309e-01
tanhLSTAT     -3.879622e-01
CHAS^2         3.079902e-01
sqrtRAD        2.559370e-01
tanhZN         2.228346e-01
tanhAGE        1.360768e-01
sqrtINDUS     -9.407005e-02
AGE           -7.385994e-02
DIS           -5.293744e-02
tanhPTRATIO   -7.848935e-03
INDUS^2        6.018124e-03
sqrtCHAS       3.723605e-14
INDUS          0.000000e+00
RM            -0.000000e+00
NOX           -0.000000e+00
PTRATIO       -0.000000e+00
TAX           -0.000000e+00
LSTAT          0.000000e+00
CRIM          -0.000000e+00
CHAS           0.000000e+00
sqrtNOX        0.000000e+00
sqrtZN        -0.000000e+00
tanhCHAS       0.000000e+00
sqrtAGE       -0.000000e+00
AGE^2         -0.000000e+00
tanhRM        -0.000000e+00
tanhNOX        0.000000e+00
TAX^2          0.000000e+00
DIS^2         -0.000000e+00
PTRATIO^2     -0.000000e+00
tanhTAX        0.000000e+00
dtype: float64
Submission
Congratulations! You are finished with this assignment.