DATA100-lab5: Modeling, Loss Functions, and Summary Statistics
|
|
Lab 5: Modeling, Loss Functions, and Summary Statistics
Predicting Restaurant Tips
In this lab, you will try to predict restaurant tips from a set of data in several ways:
A. Without given any additional information, use a constant model with L2 loss to predict the tip $\hat{y}$ as a summary statistic, $\theta$.
B. Given one piece of information—the total bill $x$ use a linear model with L2 loss to predict the tip $\hat{y}$ as a linear function of $x$.
C. See if a constant model with L1 loss changes our predictions.
First, let’s load in the data.
|
|
|
|
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Quick EDA: Note that this dataset is likely from the United States. The below plot graphs the distribution of tips in this dataset, both in absolute amounts ($) and as a fraction of the total bill (post-tax, but pre-tip).
|
|
In this lab we’ll estimate the tip in absolute amounts ($). The above plot is just to confirm your expectations about the tips
dataset.
Part A: Tips as a Summary Statistic
Let’s first predict any restaurant tip using one single number: in other words, let’s try to find the best statistic $\hat{\theta}$ to represent (i.e., summarize) the tips from our dataset.
Each actual tip in our dataset is $y$, which is what we call the observed value. We want to predict each observed value as $\hat{y}$. We’ll save the observed tip values in a NumPy array y_tips
:
|
|
(244,)
Recall the three-step process for modeling as covered in lecture:
-
Define a model.
-
Define a loss function and the associated risk on our training dataset (i.e., average loss).
-
Find the best value of $\theta$, known as $\hat{\theta}$, that minimizes risk.
We’ll go through each step of this process next.
A.1: Define the model
We will define our model as the constant model:
$$\Large \hat{y} = \theta $$
In other words, regardless of any other details (i.e., features) about their meal, we will always predict our tip $\hat{y}$ as one single value: $\theta$.
$\theta$ is what we call a parameter. Our modeling goal is to find the value of our parameter(s) that best fit our data. We have choice over which $\theta$ we pick (using the data at hand), but ultimately we can only pick one to report, so we want to find the optimal parameter(s) $\hat{\theta}$.
We call the constant model a summary statistic, as we are determining one number that best “summarizes” a set of values.
No code to write here!
A.2: Define the loss function and risk
Next, in order to pick our $\theta$, we need to define a loss function, which is a measure of how well a model is able to predict the expected outcome. In other words, it measures the deviation of a predicted value $\hat{y}$ from the observed value $y$.
We will use squared loss (also known as the $L_2$ loss, pronounced “ell-two”). For an observed tip value $y$ (i.e., the real tip), our prediction of the tip $\hat{y}$ would give an $L_2$ loss of:
$$\Large L_2(y, \hat{y}) = (y - \hat{y})^2$$
Question 1
In our constant model $\hat{y} = \theta$, we always predict the tip as $\theta$. Therefore our $L_2$ loss for some actual, observed value $y$ can be rewritten as:
$$\Large L_2(y, \theta) = (y - \theta)^2$$
Use the function description below to implement the squared loss function for this single datapoint, assuming the constant model. Your answer should not use any loops.
|
|
|
|
We just defined loss for a single datapoint. Let’s extend the above loss function to our entire dataset by taking the average loss across the dataset.
Let the dataset $\mathcal{D}$ be the set of observations: $\mathcal{D} = {y_1, \ldots, y_n}$, where $y_i$ is the $i^{th}$ tip (this is the y_tips
array defined at the beginning of Part A).
We can define the average loss (aka risk) over the dataset as:
$$\Large R\left(\theta\right) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{y_i}) $$
If we use $L_2$ loss per datapoint ($L = L_2$), then the risk is also known as mean squared error (MSE). For the constant model $\hat{y}=\theta$:
$$\Large R\left(\theta\right) = \frac{1}{n} \sum_{i=1}^n (y_i - \theta)^2 $$
Question 2
Define the mse_tips_constant
function which computes $R(\theta)$ as the mean squared error on the tips data for a constant model with parameter $\theta$.
Notes/Hints:
- This function takes in one parameter,
theta
;data
is defined for you as a NumPy array that contains the observed tips values in the data. - Use the
squared_loss
function you defined in the previous question.
|
|
np.float64(7.204529508196728)
|
|
A.3: Find the $\theta$ that minimizes risk
Question 3
Now we can go about choosing our “best” value of $\theta$, which we call $\hat{\theta}$, that minimizes our defined risk (which we defined as mean squared error). There are several approaches to computing $\hat{\theta}$ that we’ll explore in this problem.
Question 3a: Visual Solution
In the cell below we plot the mean squared error for different thetas:
|
|
Find the value of theta
that minimizes the mean squared error via observation of the plot above. Round your answer to the nearest integer.
|
|
3
|
|
q3a
passed! 💯
Numerically computing $\hat{\theta}$
scipy.optimize.minimize
is a powerful method that can determine the optimal value of a variety of different functions. In practice, it is used to minimize functions that have no (or difficult to obtain) analytical solutions (it is a numerical method).
It is overkill for our simple example, but nonetheless, we will show you how to use it, as it will become useful in the near future.
The cell below plots some arbitrary 4th degree polynomial function.
|
|
By looking at the plot, we see that the x that minimizes the function is slightly larger than -2. What if we want the exact value? We will demonstrate how to grab the minimum value and the optimal x
in the following cell.
The function minimize
from scipy.optimize
will attempt to minimize any function you throw at it. Try running the cell below, and you will see that minimize
seems to get the answer correct.
Note: For today, we’ll let minimize
work as if by magic. We’ll discuss how minimize
works later in the course.
|
|
message: Optimization terminated successfully.
success: True
status: 0
fun: 8.728505719866614
x: [-1.747e+00]
nit: 6
jac: [ 1.192e-07]
hess_inv: [[ 5.088e-01]]
nfev: 16
njev: 8
Notes:
[1] fun
: the minimum value of the function.
[2] x
: the x which minimizes the function. We can index into the object returned by minimize
to get these values. We have to add the additional [0]
at the end because the minimizing x is returned as an array, but this is not necessarily the case for other attributes (i.e. fun
), shown in the cell below.
Note [2] means that minimize
can also minimize multivariable functions, which we’ll see in the second half of this lab.
|
|
(8.728505719866614, np.float64(-1.746827786380178))
Initial guess: The parameter x0
that we passed to the minimize
function is where the minimize
function starts looking as it tries to find the minimum. For example, above, minimize
started its search at x = 1.1 because that’s where we told it to start. For the function above, it doesn’t really matter what x we start at because the function is nice and has only a single local minimum. More technically, the function is nice because it is convex, a property of functions that we will discuss later in the course.
Local minima: minimize
isn’t perfect. For example, if we give it a function with many valleys (also known as local minima) it can get stuck. For example, consider the function below:
|
|
If we start the minimization at w
= 6.5
, we’ll get stuck in the local minimum at w = 7.03
. Note that no matter what your actual variable is called in your function (w
in this case), the minimize
routine still expects a starting point parameter called x0
.
|
|
message: Optimization terminated successfully.
success: True
status: 0
fun: 22.594302881719713
x: [ 7.038e+00]
nit: 4
jac: [-3.815e-06]
hess_inv: [[ 1.231e-01]]
nfev: 12
njev: 6
Question 3b: Numerical Solution
Using the minimize
function, find the value of theta
that minimizes the mean squared error for our tips
dataset. In other words, you want to find the exact minimum of the plot that you saw in the previous part.
Notes:
- You should use the function you defined earlier:
mse_tips_constant
. - For autograding purposes, assign
min_scipy
to the value oftheta
that minimizes the MSE according to theminimize
function, called with initialx0=0.0
.
|
|
np.float64(2.9982777037277204)
|
|
Question 3c: Analytical Solution
In lecture, we used calculus to show that the value of theta that minimizes the mean squared error for the constant model is the average (mean) of the data. Assign min_computed
to the mean of the observed y_tips
data, and compare this to the values you observed in questions 3a and 3b.
|
|
np.float64(2.99827868852459)
|
|
Reflecting on the lab so far, we used a 3-step approach to find the “best” summary statistic $\theta$:
-
Define the constant model $\hat{y}=\theta$.
-
Define “best”: Define loss per datapoint (L2 loss) and consequently define risk $R(\theta)$ over a given data array as the mean squared error, or MSE.
-
Find the $\theta = \hat{\theta}$ that minimizes the MSE $R(\theta)$ in several ways:
- Visually: Create a plot of $R(\theta)$ vs. $\theta$ and eyeball the minimizing $\hat{\theta}$.
- Numerically: Create a function that returns $R(\theta)$, the MSE for the given data array for a given $\theta$, and use the scipy minimize function to find the minimizing $\hat{\theta}$.
- Analytically: Simply compute $\hat{\theta}$ the mean of the given data array, since this minimizes the defined $R(\theta)$.
- (a fourth analytical option) Use calculus to find $\hat{\theta}$ that minimizes MSE $R(\theta)$.
At this point, you’ve hopefully convinced yourself that the mean of the data is the summary statistic that minimizes mean squared error.
Our prediction for every meal’s tip:
|
|
No matter what meal you have, Part A's modeling process
predicts that you will pay a tip of $3.00.
Part B: Tips as a Linear Function of Total Bill
In this section, you will follow the exact same modeling process but instead use total bill to predict tip.
We’ll save the observed total bill values (post-tax but pre-tip) and the observed tip values in two NumPy arrays, x_total_bills
and y_tips
:
|
|
total bills (244,)
tips (244,)
B.1 Define the model
We will define our model as the linear model that takes a single input feature, total_bill
($x$):
$$\Large \hat{y} = a + b x $$
Our “parameter” $\theta$ is actually two parameters: $a$ and $b$. You may see this written as $\theta = (a, b)$.
Our modeling task is then to pick the best values $a = \hat{a}$ and $b = \hat{b}$ from our data. Then, given the total bill $x$, we can predict the tip as $\hat{y} = \hat{a} + \hat{b} x$.
No code to write here!
B.2: Define the loss function and risk
Next, we’ll define our loss function $L(y, \hat{y})$ and consequently our risk function $R(\theta) = R(a, b)$.
Similar to our approach to Part A, we’ll use L2 Loss and Mean Squared Error. Let the dataset $\mathcal{D}$ be the set of observations: $\mathcal{D} = {(x_1, y_1), \ldots, (x_n, y_n)}$, where $(x_i, y_i)$ are the $i^{th}$ total bill and tip, respectively, in our dataset.
Our L2 Loss and Mean Squared Error are therefore:
\begin{align} \large L_2(y, \hat{y}) = \large (y - \hat{y})^2 &= \large (y - (a + bx))^2 \ \large R(a, b) = \large \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{y_i}) &= \large \frac{1}{n} \sum_{i = 1}^n(y_i - (a + b x_i))^2 \end{align}
Notice that because our model is now the linear model $\hat{y} = a + bx$, our final expressions for Loss and MSE are different from Part A.
Question 4
Define the mse_tips_linear
function which computes $R(a, b)$ as the mean squared error on the tips data for a linear model with parameters $a$ and $b$.
Notes:
- This function takes in two parameters
a
andb
. - You should use the NumPy arrays
x_total_bills
andy_tips
defined at the beginning of Part B. - We’ve included some skeleton code, but feel free to write your own as well.
|
|
np.float64(1.052336405737705)
|
|
B.3: Find the $\theta$ that minimizes risk
Similar to before, we’d like to try out different approaches to finding the optimal parameters $\hat{a}$ and $\hat{b}$ that minimize MSE.
Question 5: Analytical Solution
In lecture, we derived the following expression for the line of best fit:
$$\Large \hat{y_i} = \bar{y} + r \frac{SD(y)}{SD(x)} (x_i - \bar{x})$$
where $\bar{x}$, $\bar{y}$, $SD(x)$, $SD(y)$ correspond to the means and standard deviations of $x$ and $y$, respectively, and $r$ is the correlation coefficient.
Question 5a
Assign x_bar
, y_bar
, std_x
, std_y
, and r
, for our dataset. Note: Make sure to use np.std
, and not <Series name>.std()
.
- Hint: Remember, in our case,
y
isy_tips
, andx
isx_total_bills
. - Hint: You may find
np.corrcoef
(documentation) handy in computingr
. Note that the output ofnp.corrcoef
is a matrix, not a number, so you’ll need to collect the correlation coefficient by indexing into the returned matrix.
|
|
|
|
Question 5b
Now, set b_hat
and a_hat
correctly, in terms of the variables you defined above.
Hints:
- Try and match the slope and intercept in $\hat{y_i} = \hat{a} + \hat{b}x_i$ to the slope and intercept in $\hat{y_i} = \bar{y} + r \frac{SD(y)}{SD(x)} (x_i - \bar{x})$.
- You may want to define
a_hat
in terms ofb_hat
.
|
|
|
|
Question 5c
Now, use a_hat
and b_hat
to implement the predict_tip_linear
function, which predicts the tip for a total bill amount of bill
.
|
|
If you have a $20 bill, Part B's modeling process
predicts that you will pay a tip of $3.02.
|
|
Numerically computing $\hat{\theta}$
The minimize
function we introduced earlier can also minimize functions of multiple variables (useful for numerically computing $\hat{a}$ and $\hat{b}$. There’s one quirk, however, which is that the function has to accept its parameters as a single list.
For example, consider the multivariate $f(u, v) = u^2 - 2 u v - 3 v + 2 v^2$. It turns out this function’s minimum is at $(1.5, 1.5)$. To minimize this function, we create f
.
|
|
message: Optimization terminated successfully.
success: True
status: 0
fun: -2.2499999999999982
x: [ 1.500e+00 1.500e+00]
nit: 3
jac: [-5.960e-08 0.000e+00]
hess_inv: [[ 1.000e+00 5.000e-01]
[ 5.000e-01 5.000e-01]]
nfev: 12
njev: 4
Question 6: Numerical Solution
Question 6a
Implement the mse_tips_linear_list
function, which is exactly like mse_tips_linear
defined in Question 4 except that it takes in a single list of 2 variables rather than two separate variables. For example mse_tips_linear_list([2, 3])
should return the same value as mse_tips_linear(2, 3)
.
|
|
|
|
Question 6b
Now, set min_scipy_linear
to the result of calling minimize to optimize the risk function you just implemented.
Hint: Make sure to set x0
, say, to [0.0, 0.0]
.
|
|
message: Optimization terminated successfully.
success: True
status: 0
fun: 1.036019442011604
x: [ 9.203e-01 1.050e-01]
nit: 3
jac: [ 1.490e-08 0.000e+00]
hess_inv: [[ 2.980e+00 -1.253e-01]
[-1.253e-01 6.335e-03]]
nfev: 15
njev: 5
Based on the above output from your call to minimize
, running the following cell will set and print the values of a_hat
and b_hat
.
|
|
(np.float64(0.9202707061277714), np.float64(0.1050244640398299))
The following cell will print out the values of a_hat
and b_hat
computed from both methods (“manual” refers to the analytical solution in Question 5; “scipy” refers to the numerical solution in Question 6). If you’ve done everything correctly, these should be very close to one another.
|
|
a_hat_scipy: 0.9202707061277714
a_hat_manual: 0.9202696135546735
b_hat_scipy: 0.1050244640398299
b_hat_manual: 0.10502451738435334
互动可视化
Visual Solution (not graded): Feel free to interact with the below plot and verify that the $\hat{a}$ and $\hat{b}$ you computed using either method above minimize the MSE. In the cell below we plot the mean squared error for different parameter values. Note that now that we have two parameters, we have a 3D MSE surface plot.
Rotate the data around and zoom in and out using your trackpad or the controls at the top right of the figure. If you get an error that your browser does not support webgl, you may need to restart your kernel and/or browser.
|
|
Comparing Constant Model vs Linear Model
At this point, we can actually compare our two models! Both the linear model and constant model were optimized using the same L2 loss function but predict different values for different tips.
Run the cell below:
|
|
Note that while we plot tip by total bill, the constant model doesn’t use the total bill in its prediction and therefore shows up as a horizontal line.
Thought question: For predicting tip on this data, would you rather use the constant model or the linear model, assuming an L2 loss function for both? This might be more fun with a partner. Note, your answer will not be graded, so don’t worry about writing a detailed answer. If you want to see our answer, see the very end of this lab notebook.
In the not-so-distant future of this class, you will learn more quantitative metrics to compare model performance. Stay tuned!
Part C: Using a Different Loss Function
In this last (short) section, we’ll consider how the optimal parameters for the constant model would change if we used a different loss function.
We will now use absolute loss (also known as the $L_1$ loss, pronounced “ell-one”). For an observed tip value $y$ (i.e., the real tip), our prediction of the tip $\hat{y}$ would give an $L_1$ loss of:
$$\Large L_1(y, \hat{y}) = |y - \hat{y}|$$
While we still define risk as average loss, since we now use $L_1$ loss per datapoint in our datset $\mathcal{D} = {y_1, \ldots, y_n}$, our risk is now known as mean absolute error (MAE).
For the constant model $\hat{y} = \theta$ (i.e., we predict our tip as a summary statistic):
\begin{align} \Large R\left(\theta\right) &= \Large \frac{1}{n} \sum_{i=1}^n L_1(y_i, \hat{y_i}) \ &= \Large \frac{1}{n} \sum_{i=1}^n |y_i - \theta| \end{align}
Note: the last line results from using the constant model for $\hat{y}$. If we decided to use the linear model, we would have a different expression.
Question 7
Question 7a
Define the mae_tips_constant
function which computes $R(\theta)$ as the mean absolute error (MAE) on the tips data for a constant model with parameter $\theta$.
Hint: You may want to check out your solution from Question 2, which computed mean squared error (MSE).
|
|
np.float64(2.4527868852459016)
|
|
Question 7b
In lecture, we saw that the value of theta that minimizes mean absolute error for the constant model is the median of the data. Assign min_computed_mae
to the median of the observed y_tips
data.
|
|
np.float64(2.9)
|
|
Comparing MAE to MSE
Now run the below cell to compare MAE with MSE on the constant model.
|
|
Thought question You should see that the MAE plot (below) looks somewhat similar the MSE plot (above). Try to identify any key differences you observe and write them down below. This might be more fun with a partner. Note, your answer will not be graded, so don’t worry about writing a detailed answer. If you want to see our answer, see the very end of this lab notebook.
Write your answer here, replacing this text.
Congratulations! You finished the lab!
Extra Notes
Our Observations on Constant Model vs Linear Model
Earlier in this lab, we said we’d describe our observations about whether to use Constant Model or Linear Model (both trained with MSE). Here are some thoughts:
Recall that $r$ is the correlation coefficient, where values closer to -1 or 1 imply a very linear relationship:
|
|
np.float64(0.6757341092113641)
The relationship between $x$ and $y$ is somewhat linear; you can see this more clearly through the scatter plot, where there are many points that don’t fall close to the linear model line. With this in mind:
- The linear model seems to work well for most bills.
- However, as bills get bigger, some datapoints seem to suggest that the constant model works better.
- In the wild, a tip predictor may use a combination of both the constant and linear models we trained: an average prediction, or a random coin flip to pick the model, or some heuristic decision to choose one model if the total bill exceeds a certain threshold. (集成学习思想?)
In the not-so-distant future of this class, you will learn more quantitative metrics to compare model performance. You will have an opportunity to explore your own models in a future assignment!
Our Observations on Differences Between MAE vs. MSE
Earlier in this lab, we said we’d describe our observations about the differences between the MAE and MSE.
There are three key differences that we identified between the plots of the MSE and MAE.
- The minimizing $\theta = \hat{\theta}$ is different.
- The plot for MAE increases linearly instead of quadratically as we move far away from the minimizing $\theta$.
- The plot for MAE is piecewise linear instead of smooth. Each change in slope happens at the same $\theta$ value as a data point in our dataset.