DATA100-lab7: Gradient Descent and Feature Engineering
|
|
Lab 7: Gradient Descent and Feature Engineering
In this lab, we will work through the process of:
- Defining loss functions
- Feature engineering
- Minimizing loss functions using numeric methods and analytical methods
- Understanding what happens if we use the analytical solution for OLS on a matrix with redundant features
- Computing a gradient for a nonlinear model
- Using gradient descent to optimize the nonline model
This lab will continue using the toy tips
calculation dataset used in Labs 5 and 6.
Loading the Tips Dataset
To begin, let’s load the tips dataset from the seaborn
library. This dataset contains records of tips, total bill, and information about the person who paid the bill. As earlier, we’ll be trying to predict tips from the other data.
|
|
|
|
Number of Records: 244
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Intro to Feature Engineering
So far, we’ve only considered models of the form $\hat{y} = f_{\theta}(x) = \sum_{j=0}^d x_j\theta_j$, where $\hat{y}$ is quantitative continuous.
We call this a linear model because it is a linear combination of the features (the $x_j$). However, our features don’t need to be numbers: we could have categorical values such as names. Additionally, the true relationship doesn’t have to be linear, as we could have a relationship that is quadratic, such as the relationship between the height of a projectile and time.
In these cases, we often apply feature functions, functions that take in some value and output another value. This might look like converting a string into a number, combining multiple numeric values, or creating a boolean value from some filter.
Then, if we call $\phi$ (“phi”) our “phi”-ture function, our model takes the form $\hat{y} = f_{\theta}(x) = \sum_{j=0}^d \phi(x)_j\theta_j$.
Example feature functions 编码一直是一个先验工程问题? vs AutoEncoders
- One-hot encoding
- converts a single categorical feature into many binary features, each of which represents one of the possible values in the original column
- each of the binary feature columns produced contains a 1 for rows that had that column’s label in the original column, and 0 elsewhere
- Polynomial features
- create polynomial combinations of features
Question 1: Defining the Model and Feature Engineering
In Lab 6 we used the constant model. Now let’s make a more complicated model that utilizes other features in our dataset. You can imagine that we might want to use the features with an equation that looks as shown below:
$$ \text{Tip} = \theta_1 \cdot \text{total}_\text{bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size} $$
Unfortunately, that’s not possible because some of these features like “day” are not numbers, so it doesn’t make sense to multiply by a numerical parameter.
Let’s start by converting some of these non-numerical values into numerical values. Before we do this, let’s separate out the tips and the features into two separate variables.
|
|
Question 1a: Feature Engineering
First, let’s convert our features to numerical values. A straightforward approach is to map some of these non-numerical features into numerical ones.
For example, we can treat the day as a value from 1-7. However, one of the disadvantages in directly translating to a numeric value is that we unintentionally assign certain features disproportionate weight. Consider assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In our linear model, Sunday will have 7 times the influence of Monday, which can lower the accuracy of our model.
Instead, let’s use one-hot encoding to better represent these features!
As you will learn in lecture, one-hot encoding is a way that we can produce a binary vector to indicate non-numeric features.
In the tips
dataset for example, we encode Sunday as the vector [0 0 0 1]
because our dataset only contains bills from Thursday through Sunday. This assigns a more even weight across each category in non-numeric features. Complete the code below to one-hot encode our dataset. This dataframe holds our “featurized” data, which is also often denoted by $\phi$.
Hint: You may find the pd.get_dummies method or the DictVectorizer class useful when doing your one-hot encoding.
|
|
total_bill | size | sex_Male | sex_Female | smoker_Yes | smoker_No | day_Thur | day_Fri | day_Sat | day_Sun | time_Lunch | time_Dinner | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | 10.34 | 3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
2 | 21.01 | 3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
3 | 23.68 | 2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 24.59 | 4 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
|
|
Question 1b: Defining the Model
Now that all of our data is numeric, we can begin to define our model function. Notice that after one-hot encoding our data, we now have 12 features instead of 6. Therefore, our linear model now looks like:
$$ \text{Tip} = \theta_1 \cdot \text{size} + \theta_2 \cdot \text{total}_\text{bill} + \theta_3 \cdot \text{day}_\text{Thur} + \theta_4 \cdot \text{day}_\text{Fri} + … + \theta_{11} \cdot \text{time}_\text{Lunch} + \theta_{12} \cdot \text{time}_\text{Dinner} $$
We can represent the linear combination above as a matrix-vector product. Implement the linear_model
function to evaluate this product.
Below, we create a MyLinearModel
class with two methods, predict
and fit
. When fitted, this model fails to do anything useful, setting all of its 12 parameters to zero.
|
|
|
|
array([[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]])
Question 2: Fitting a Linear Model using scipy.optimize.minimize Methods
Recall in Lab 5 and in lecture 12 we defined multiple loss functions and found the optimal theta using the scipy.optimize.minimize
function. Adapt the code below to implement the fit method of the linear model.
Note that we’ve added a loss_function
parameter where the model is fit using the desired loss function, i.e. not necssarily the L2 loss. Example loss function are given as l1
and l2
.
|
|
array([ 0.09448702, 0.17599315, 0.31373886, 0.34618029, -0.22256393,
-0.13615575, 0.30569628, 0.46797868, 0.34653012, 0.44250887,
0.1939605 , 0.1258012 ])
|
|
q2
passed! 💯
The MSE for your model above should be just slightly larger than 1:
|
|
np.float64(1.0103535612506567)
Question 3: Fitting the Model using Analytic Methods
Let’s also fit our model analytically for the L2 loss function. Recall from lecture that with a linear model, we are solving the following optimization problem for least squares:
$$\min_{\theta} ||\Bbb{X}\theta - \Bbb{y}||^2$$
We showed in Lecture 11 that the optimal $\hat{\theta}$ when $X^TX$ is invertible is given by the equation: $(X^TX)^{-1}X^TY$
Question 3a: Analytic Solution Using Explicit Inverses
For this problem, implement the analytic solution above using np.linalg.inv
to compute the inverse of $X^TX$.
Reminder: To compute the transpose of a matrix, you can use X.T
or X.transpose()
|
|
Now, run the cell below to find the analytical solution for the tips
dataset. Depending on the machine that you run your code on, you should either see a singular matrix error or end up with thetas that are nonsensical (magnitudes greater than 10^15). This is not good!
|
|
array([ 9.66544413e+00, -1.89677732e+02, -8.30149679e+17, -8.30149679e+17,
8.30149679e+17, 8.30149679e+17, -2.56000000e+02, 0.00000000e+00,
-3.20000000e+01, 3.20000000e+01, -8.00000000e+00, 0.00000000e+00])
In the cell below, explain why we got the error(指参数不对?) above when trying to calculate the analytical solution for our one-hot encoded tips
dataset.
本质上是因为矩阵 不可逆,独热编码某些线性组合之后可以轻易看出矩阵 $X^TX$ 不是满秩的
Question 3b: Fixing our One-Hot Encoding
Now, let’s fix our one-hot encoding approach from question 1 so we don’t get the error we saw in question 3a. Complete the code below to one-hot-encode our dataset such that one_hot_X_revised
has no redundant features.
|
|
Our numerical model's loss is: 1.0255082437778105
Our analytical model's loss is: 1.0255082436053506
|
|
Question 3c: Analyzing our new One-Hot Encoding
Why did removing redundancies in our one-hot encoding fix the problem we had in 3a?
不是全部进行独热编码操作,避免线性相关性
Note: An alternate approach is to use np.linalg.solve
instead of np.linalg.inv
. For the example above, even with the redundant features, np.linalg.solve
will work well. Though in general, it’s best to drop redundant features anyway.
In case you want to learn more, here is a relevant Stack Overflow post: https://stackoverflow.com/questions/31256252/why-does-numpy-linalg-solve-offer-more-precise-matrix-inversions-than-numpy-li
Question 4: Gradient Descent
|
|
x | y | |
---|---|---|
0 | -5.000000 | -7.672309 |
1 | -4.966555 | -7.779735 |
2 | -4.933110 | -7.995938 |
3 | -4.899666 | -8.197059 |
4 | -4.866221 | -8.183883 |
If we plot this data, we see that there is a clear sinusoidal relationship between x and y.
|
|
In this exercise, we’ll show gradient descent is so powerful it can even optimize a nonlinear model. Specifically, we’re going to model the relationship of our data by:
$$\Large{ f_{\boldsymbol{\theta(x)}} = \theta_1x + sin(\theta_2x) }$$
Our model is parameterized by both $\theta_1$ and $\theta_2$, which we can represent in the vector, $\boldsymbol{\theta}$.
Note that a general sine function $a\sin(bx+c)$ has three parameters: amplitude scaling parameter $a$, frequency parameter $b$ and phase shifting parameter $c$.
Here, we’re assuming the amplitude $a$ is around 1, and the phase shifting parameter $c$ is around zero. We do not attempt to justify this assumption and you’re welcome to see what happens if you ignore this assumption at the end of this lab.
You might ask why we don’t just create a linear model like we did earlier with a sinusoidal feature. The issue is that the theta is INSIDE the sin function. In other words, linear models use their parameters to adjust the scale of each feature, but $\theta_2$ in this model adjusts the frequency of the feature. There are tricks we could play to use our linear model framework here, but we won’t attempt this in our lab.
We define the sin_model
function below that predicts $\textbf{y}$ (the $y$-values) using $\textbf{x}$ (the $x$-values) based on our new equation.
|
|
Question 4a: Computing the Gradient of the MSE With Respect to Theta on the Sin Model
Recall $\hat{\theta}$ is the value of $\theta$ that minimizes our loss function. One way of solving for $\hat{\theta}$ is by computing the gradient of our loss function with respect to $\theta$, like we did in lecture: https://docs.google.com/presentation/d/1j9ESgjn-aeZSOX5ON1wjkF5WBZHc4IN7XvTpYnX1pFs/edit#slide=id.gfc76b62ec3_0_27. Recall that the gradient is a column vector of two partial derivatives.
Write/derive the expressions for following values and use them to fill in the functions below.
- $L(\textbf{x}, \textbf{y}, \theta_1, \theta_2)$: our loss function, the mean squared error
- $\frac{\partial L }{\partial \theta_1}$: the partial derivative of $L$ with respect to $\theta_1$
- $\frac{\partial L }{\partial \theta_2}$: the partial derivative of $L$ with respect to $\theta_2$
Recall that $L(\textbf{x}, \textbf{y}, \theta_1, \theta_2) = \frac{1}{n} \sum_{i=1}^{n} (\textbf{y}_i - \hat{\textbf{y}}_i)^2$
Specifically, the functions sin_MSE
, sin_MSE_dt1
and sin_MSE_dt2
should compute $R$, $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ respectively. Use the expressions you wrote for $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ to implement these functions. In the functions below, the parameter theta
is a vector that looks like $\begin{bmatrix} \theta_1 \ \theta_2 \end{bmatrix}$. We have completed sin_MSE_gradient
, which calls dt1
and dt2
and returns the gradient dt
for you.
Notes:
- Keep in mind that we are still working with our original set of data,
df
. - To keep your code a bit more concise, be aware that
np.mean
does the same thing asnp.sum
divided by the length of the numpy array. *注意mean的层级别 - Another way to keep your code more concise is to use the function
sin_model
we defined which computes the output of the model.
|
|
|
|
Question 4b: Implementing Gradient Descent and Using It to Optimize the Sin Model
Let’s now implement gradient descent.
Note that the function you’re implementing here is somewhat different than the gradient descent function we created in lecture. The version in lecture was gradient_descent(df, initial_guess, alpha, n)
, where df
was the gradient of the function we are minimizing and initial_guess
are the starting parameters for that function. Here our signature is a bit different (described below) than the gradient_descent
implementation from lecture.
|
|
theta: [0 0], Loss: 20.859191416422235
theta: [2.60105745 2.60105745], Loss: 9.285008173048666
theta: [0.90342728 2.59100602], Loss: 4.680169273815357
theta: [2.05633644 2.9631291 ], Loss: 2.6242517936325833
theta: [1.15892347 2.86687431], Loss: 1.4765157174727774
theta: [1.79388042 3.07275573], Loss: 0.9073271435862448
theta: [1.32157494 3.00146569], Loss: 0.541531643291128
theta: [1.64954491 3.02910866], Loss: 0.3775841142469479
theta: [1.42325294 2.98820303], Loss: 0.2969750688130759
theta: [1.58295041 3.01033846], Loss: 0.2590425421375732
theta: [1.47097255 2.98926519], Loss: 0.23973439443291833
theta: [1.55040965 3.0017442 ], Loss: 0.23034782416254634
theta: [1.49439132 2.99135194], Loss: 0.2255775832667724
theta: [1.5341564 2.99797824], Loss: 0.22321772191904068
theta: [1.50603995 2.99286671], Loss: 0.22202363967204045
theta: [1.52598919 2.99628665], Loss: 0.22142811500262397
theta: [1.51186655 2.99375531], Loss: 0.22112776381775168
theta: [1.52188208 2.99549617], Loss: 0.22097741373654575
theta: [1.51478773 2.99423497], Loss: 0.22090173185683032
theta: [1.51981739 2.99511516], Loss: 0.2208637810584589
|
|
If you pass the tests above, you’re done coding for this lab, though there are some cool visualizations below we’d like you to think about.
Let’s visually inspect our results of running gradient descent to optimize $\boldsymbol\theta$. The code below plots our $x$-values with our model’s predicted $\hat{y}$-values over the original scatter plot. You should notice that gradient descent successfully optimized $\boldsymbol\theta$.
|
|
Plotting our model output over our observaitons shows that gradient descent did a great job finding both the overall increase (slope) of the data, as well as the oscillation frequency.
|
|
<>:4: SyntaxWarning:
invalid escape sequence '\h'
<>:4: SyntaxWarning:
invalid escape sequence '\h'
C:\Users\86135\AppData\Local\Temp\ipykernel_10128\2413075366.py:4: SyntaxWarning:
invalid escape sequence '\h'
Visualizing Loss (Extra)
Let’s visualize our loss functions and gain some insight as to how gradient descent optimizes our model parameters.
In the previous plot we saw the loss decrease with each iteration. In this part, we’ll see the trajectory of the algorithm as it travels the loss surface? Run the following cells to see visualization of this trajectory.
|
|
array([[0. , 0. ],
[2.60105745, 2.60105745],
[0.90342728, 2.59100602],
[2.05633644, 2.9631291 ],
[1.15892347, 2.86687431],
[1.79388042, 3.07275573],
[1.32157494, 3.00146569],
[1.64954491, 3.02910866],
[1.42325294, 2.98820303],
[1.58295041, 3.01033846],
[1.47097255, 2.98926519],
[1.55040965, 3.0017442 ],
[1.49439132, 2.99135194],
[1.5341564 , 2.99797824],
[1.50603995, 2.99286671],
[1.52598919, 2.99628665],
[1.51186655, 2.99375531],
[1.52188208, 2.99549617],
[1.51478773, 2.99423497],
[1.51981739, 2.99511516]])
|
|
|
|
|
|
|
|
As we can see, gradient descent is able to navigate even this fairly complex loss space and find a nice minimum.