DATA100-lab3: Data Cleaning and EDA
|
|
Lab 3: Data Cleaning and EDA
In this lab you will be working on visualizing a dataset from the City of Berkeley containing data on calls to the Berkeley Police Department. Information about the dataset can be found at this link.
Content Warning
This lab includes an analysis of crime in Berkeley. If you feel uncomfortable with this topic, please contact your GSI or the instructors.
Setup
In this lab, we’ll perform Exploratory Data Analysis and learn some preliminary tips for working with matplotlib (a Python plotting library). Note that we configure a custom default figure size. Virtually every default aspect of matplotlib can be customized.
Collaborators: list names here
|
|
|
|
<Figure size 1200x900 with 0 Axes>
Part 1: Acquire the Data
1. Obtain data
To retrieve the dataset, we will use the ds100_utils.fetch_and_cache
utility.
|
|
Using cached version that was downloaded (UTC): Sun Jul 28 01:04:24 2024
Located at data\lab03_data_sp22.zip
2. Unzip file
We will now directly unzip the ZIP archive and start working with the uncompressed files.
|
|
Note: There is no single right answer regarding whether to work with compressed files in their compressed state or to uncompress them on disk permanently. For example, if you need to work with multiple tools on the same files, or write many notebooks to analyze them—and they are not too large—it may be more convenient to uncompress them once. But you may also have situations where you find it preferable to work with the compressed data directly.
Python gives you tools for both approaches, and you should know how to perform both tasks in order to choose the one that best suits the problem at hand.
3. View files
Now, we’ll use the os
package to list all files in the data
directory. os.walk()
recursively traverses the directory, and os.path.join()
creates the full pathname of each file.
If you’re interested in learning more, check out the Python3 documentation pages for os.walk
(link) and os.path
(link).
We use Python 3 format strings to nicely format the printed variables dpath
and fpath
.
|
|
d data\secret
data\ben_kurtovic.py
data\Berkeley_PD_-_Calls_for_Service.csv
data\dummy.txt
data\hello_world.py
data\lab03_data_sp22.zip
data\secret\do_not_readme.md
In this Lab, we’ll be working with the Berkeley_PD_-_Calls_for_Service.csv
file. Feel free to check out the other files, though.
Part 2: Clean and Explore the Data
Let’s now load the CSV file we have into a pandas.DataFrame
object and start exploring the data.
|
|
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 04/19/2021 12:00:00 AM | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 02/13/2021 12:00:00 AM | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 02/08/2021 12:00:00 AM | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
We see that the fields include a case number, the offense type, the date and time of the offense, the “CVLEGEND” which appears to be related to the offense type, a “CVDOW” which has no apparent meaning, a date added to the database, and the location spread across four fields. We can read more about each field from the City of the Berkeley’s open dataset webpage.
Let’s also check some basic information about this DataFrame using the DataFrame.info
(documentation) and DataFrame.describe
methods (documentation).
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2632 entries, 0 to 2631
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CASENO 2632 non-null int64
1 OFFENSE 2632 non-null object
2 EVENTDT 2632 non-null object
3 EVENTTM 2632 non-null object
4 CVLEGEND 2632 non-null object
5 CVDOW 2632 non-null int64
6 InDbDate 2632 non-null object
7 Block_Location 2632 non-null object
8 BLKADDR 2612 non-null object
9 City 2632 non-null object
10 State 2632 non-null object
dtypes: int64(2), object(9)
memory usage: 226.3+ KB
Note that the BLKADDR
column only has 2612 non-null entries, while the other columns all have 2632 entries. This is because the .info()
method only counts non-null entries.
|
|
CASENO | CVDOW | |
---|---|---|
count | 2.632000e+03 | 2632.000000 |
mean | 2.095978e+07 | 3.071049 |
std | 2.452665e+05 | 1.984136 |
min | 2.005721e+07 | 0.000000 |
25% | 2.100568e+07 | 1.000000 |
50% | 2.101431e+07 | 3.000000 |
75% | 2.102256e+07 | 5.000000 |
max | 2.109066e+07 | 6.000000 |
Notice that the functions above reveal type information for the columns, as well as some basic statistics about the numerical columns found in the DataFrame. However, we still need more information about what each column represents. Let’s explore the data further in Question 1.
Before we go over the fields to see their meanings, the cell below will verify that all the events happened in Berkeley by grouping on the City
and State
columns. You should see that all of our data falls into one group.
|
|
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | ||
---|---|---|---|---|---|---|---|---|---|---|
City | State | |||||||||
Berkeley | CA | 2632 | 2632 | 2632 | 2632 | 2632 | 2632 | 2632 | 2632 | 2612 |
When we called head()
on the Dataframe calls
, it seemed like OFFENSE
and CVLEGEND
both contained information about the type of event reported. What is the difference in meaning between the two columns? One way to probe this is to look at the value_counts
for each Series.
|
|
OFFENSE
THEFT MISD. (UNDER $950) 559
VEHICLE STOLEN 277
BURGLARY AUTO 218
THEFT FELONY (OVER $950) 215
DISTURBANCE 204
BURGLARY RESIDENTIAL 178
VANDALISM 166
THEFT FROM AUTO 163
ASSAULT/BATTERY MISD. 116
ROBBERY 90
Name: count, dtype: int64
|
|
CVLEGEND
LARCENY 782
MOTOR VEHICLE THEFT 277
BURGLARY - VEHICLE 218
DISORDERLY CONDUCT 204
BURGLARY - RESIDENTIAL 178
VANDALISM 166
LARCENY - FROM VEHICLE 163
ASSAULT 150
FRAUD 93
ROBBERY 90
Name: count, dtype: int64
It seems like OFFENSE
is more specific than CVLEGEND
, e.g. “LARCENY” vs. “THEFT FELONY (OVER $950)”. If you’re unfamiliar with the term, “larceny” is a legal term for theft of personal property.
To get a sense of how many subcategories there are for each OFFENSE
, we will set calls_by_cvlegend_and_offense
equal to a multi-indexed series where the data is first indexed on the CVLEGEND
and then on the OFFENSE
, and the data is equal to the number of offenses in the database that match the respective CVLEGEND
and OFFENSE
. As you can see, calls_by_cvlegend_and_offense["LARCENY", "THEFT FROM PERSON"]
returns 8 which means there are 8 instances of larceny with offense of type “THEFT FROM PERSON” in the database.
|
|
np.int64(8)
Question 1
In the cell below, set answer1
equal to a list of strings corresponding to the possible values (which means unique 🤔 ) for OFFENSE
when CVLEGEND
is “LARCENY”. You can type the answer manually, or you can create an expression that automatically extracts the names.
|
|
['THEFT MISD. (UNDER $950)', 'THEFT FELONY (OVER $950)', 'THEFT FROM PERSON']
|
|
q1
passed! 🌟
Part 3: Visualize the Data
Matplotlib demo
You’ve seen some matplotlib
in this class already, but now we will explain how to work with the object-oriented plotting API mentioned in this matplotlib.pyplot tutorial useful. In matplotlib, plotting occurs on a set of Axes which are associated with a Figure. An analogy is that on a blank canvas (Figure), you choose a location to plot (Axes
) and then fill it in (plot).
There are two approaches to labeling and manipulating figure contents, which we’ll discuss below. Approach 1 is closest to the plotting paradigm of MATLAB, the namesake of matplotlib; Approach 2 is also common because many matplotlib-based packages (such as Seaborn) explicitly return the current set of axes after plotting data. Both are essentially equivalent, and at the end of this class you’ll be comfortable with both.
Approach 1: matplotlib (or Seaborn) will auto-plot onto the current set of Axes or (if none exists) create a new figure/set of default axes. You can plot data using methods from plt
, which is shorthand for the matplotlib.pyplot
package. Then subsequent plt
calls all edit the same set of default-created axes.
Approach 2:
After creating the initial plot, you can also use plt.gca()
to explicitly get the current set of axes, and then edit those specific axes using axes methods. Note the method naming is slightly different!
As an example of the built-in plotting functionality of pandas, the following example uses plot
method of the Series
class to generate a barh
plot type to visually display the value counts for CVLEGEND
.
There are also many other plots that we will explore throughout the lab.
Side note: Pandas also offers basic functionality for plotting. For example, the DataFrame
and Series
classes both have a plot
method, which uses matplotlib under the hood. For now we’ll focus on matplotlib itself so you get used to the syntax, but just know that convenient Pandas plotting methods exist for your own future data science exploration.
Below, we show both approaches by generating a horizontal bar plot to visually display the value counts for CVLEGEND
. See the barh
documentation for more details.
|
|
Demo 1: Using plt methods to update plot
An Additional Note on Plotting in Jupyter Notebooks
You may have noticed that many of our plotting code cells end with a semicolon ;
or plt.show()
. The former prevents any extra output from the last line of the cell; the latter explicitly returns (and outputs) the figure. Try adding this to your own code in the following questions!
Question 2
Now it is your turn to make a plot using matplotlib
. Let’s start by transforming the data so that it is easier to work with.
The CVDOW
field isn’t named helpfully and it is hard to see the meaning from the data alone. According to the website linked at the top of this notebook, CVDOW
is actually indicating the day that events happened. 0->Sunday, 1->Monday … 6->Saturday.
Question 2a
Add a new column Day
into the calls
dataframe that has the string weekday (eg. ‘Sunday’) for the corresponding value in CVDOW. For example, if the first 3 values of CVDOW
are [3, 6, 0]
, then the first 3 values of the Day
column should be ["Wednesday", "Saturday", "Sunday"]
.
Hint: Try using the Series.map function on calls["CVDOW"]
. Can you assign this to the new column calls["Day"]
?
|
|
|
|
q2a
passed! 🌟
|
|
Challenge (OPTIONAL): You could also accomplish this part as a table left join with pd.merge
(documentation), instead of using Series.map
. You would need to merge calls
with a new dataframe that just contains the days of the week. If you have time, try it out in the below cell!
merge操作在接下来的lab有详细介绍!😉
|
|
Ellipsis
Question 2b
Now let’s look at the EVENTTM
column which indicates the time for events. Since it contains hour and minute information, let’s extract the hour info and create a new column named Hour
in the calls
dataframe. You should save the hour as an int
.
Hint: Your code should only require one line.
Hint 2: The vectorized Series.str[ind]
performs integer indexing on an array entry.
|
|
0 10
1 10
2 12
3 17
4 6
..
2627 12
2628 15
2629 0
2630 18
2631 2
Name: Hour, Length: 2632, dtype: int64
|
|
Question 2c
Using matplotlib
, construct a line plot with the count of the number of calls (entries in the table) for each hour of the day ordered by the time (eg. 12:00 AM
, 1:00 AM
, …). Please use the provided variable hours
in your answer. Be sure that your axes are labeled and that your plot is titled.
Hint: Check out the plt.plot
method in the matplotlib tutorial, as well as our demo above.
|
|
|
|
q2c
passed! 🌟
To better understand the time of day a report occurs we could stratify the analysis by the day of the week. To do this we will use violin plots (a variation of a box plot), which you will learn in more detail next week.
For now, just know that a violin plot shows an estimated distribution of quantitative data (e.g., distribution of calls by hour) over a categorical variable (day of the week). More calls occur in hours corresponding to the fatter part of each violin; the median hour of all calls in a particular day is marked by the white dot in the corresponding violin.
|
|
C:\Users\86135\AppData\Local\Temp\ipykernel_19968\4044437892.py:5: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
ax = sns.violinplot(data=calls.sort_values("CVDOW"),
Question 2d
Based on your line plot and our violin plot above, what observations can you make about the patterns of calls? Here are some dimensions to consider:
- Are there more calls in the day or at night?
- What are the most and least popular times?
- Do call patterns vary by day of the week?
Question 3
In this last part of the lab, let’s extract the GPS coordinates (latitude, longitude) from the Block_Location
of each record.
|
|
'2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066, -122.253407)'
Question 3a: Regular Expressions
Use regular expressions to create a dataframe calls_lat_lon
that has two columns titled Lat
and Lon
, containing the respective latitude and longitude of each record in calls
. You should use the Block_Location
column to extract the latitude and longitude coordinates.
Hint: Check out the Series.str.extract
documentation.
|
|
2632
|
|
q3a
passed! 🚀
Question 3b: Join Tables
Let’s include the GPS data into our calls
data. In the below cell, use calls_lat_lon
to add two new columns called Lat
and Lon
to the calls
dataframe.
Hint: pd.merge
(documentation) could be useful here. Note that the order of records in calls
and calls_lat_lon
are the same.
|
|
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
|
|
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | Day | Hour | Lat | Lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1173 | 21024375 | DISTURBANCE | 06/02/2021 12:00:00 AM | 9:00 | DISORDERLY CONDUCT | 3 | 06/15/2021 12:00:00 AM | 2020 KITTREDGE ST\nBerkeley, CA\n(37.868356, -... | 2020 KITTREDGE ST | Berkeley | CA | Wednesday | 9 | 37.868356 | -122.268904 |
2573 | 21005894 | DISTURBANCE | 02/11/2021 12:00:00 AM | 14:12 | DISORDERLY CONDUCT | 4 | 06/15/2021 12:00:00 AM | 2400 BLOCK DWIGHT WAY\nBerkeley, CA\n(37.86482... | 2400 BLOCK DWIGHT WAY | Berkeley | CA | Thursday | 14 | 37.864826 | -122.260719 |
990 | 21022043 | ROBBERY | 05/18/2021 12:00:00 AM | 20:15 | ROBBERY | 2 | 06/15/2021 12:00:00 AM | 2521 TELEGRAPH AVE\nBerkeley, CA\n(37.864705, ... | 2521 TELEGRAPH AVE | Berkeley | CA | Tuesday | 20 | 37.864705 | -122.258463 |
908 | 21017272 | THEFT MISD. (UNDER $950) | 04/19/2021 12:00:00 AM | 20:20 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2800 BLOCK ADELINE ST\nBerkeley, CA\n(37.85811... | 2800 BLOCK ADELINE ST | Berkeley | CA | Monday | 20 | 37.858116 | -122.268002 |
597 | 21090359 | BURGLARY AUTO | 03/26/2021 12:00:00 AM | 0:00 | BURGLARY - VEHICLE | 5 | 06/15/2021 12:00:00 AM | 2100 BLOCK 5TH ST\nBerkeley, CA\n(37.86626, -1... | 2100 BLOCK 5TH ST | Berkeley | CA | Friday | 0 | 37.86626 | -122.298335 |
|
|
q3b
passed! 🚀
Question 3c: Check for Missing Values
It seems like every record has valid GPS coordinates:
|
|
Lat 1.0
Lon 1.0
dtype: float64
However, a closer examination of the data reveals something else. Here’s the first few records of our data again:
|
|
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | Day | Hour | Lat | Lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Thursday | 10 | 37.869058 | -122.270455 |
1 | 21014391 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Thursday | 10 | 37.869058 | -122.270455 |
2 | 21090494 | THEFT MISD. (UNDER $950) | 04/19/2021 12:00:00 AM | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA | Monday | 12 | 37.864908 | -122.267289 |
3 | 21090204 | THEFT FELONY (OVER $950) | 02/13/2021 12:00:00 AM | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA | Saturday | 17 | 37.863934 | -122.250262 |
4 | 21090179 | BURGLARY AUTO | 02/08/2021 12:00:00 AM | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA | Monday | 6 | 37.86066 | -122.253407 |
There is another field that tells us whether we have a valid Block_Location
entry per record—i.e., with GPS coordinates (latitude, longitude) that match the listed block location. What is it?
In the below cell, use the field you found to create a new dataframe, missing_lat_lon
, that contains only the rows of calls
that have invalid latitude and longitude data. Your new dataframe should have all the same columns of calls
.
|
|
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | Day | Hour | Lat | Lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Thursday | 10 | 37.869058 | -122.270455 |
1 | 21014391 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Thursday | 10 | 37.869058 | -122.270455 |
215 | 21019124 | BURGLARY RESIDENTIAL | 04/30/2021 12:00:00 AM | 10:00 | BURGLARY - RESIDENTIAL | 5 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Friday | 10 | 37.869058 | -122.270455 |
260 | 21000289 | VEHICLE STOLEN | 01/01/2021 12:00:00 AM | 12:00 | MOTOR VEHICLE THEFT | 5 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Friday | 12 | 37.869058 | -122.270455 |
633 | 21013362 | BURGLARY AUTO | 03/27/2021 12:00:00 AM | 4:20 | BURGLARY - VEHICLE | 6 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA | Saturday | 4 | 37.869058 | -122.270455 |
|
|
Question 3d: Check Missing Values
Now let us explore if there is a pattern to which types of records have missing latitude and longitude entries.
We’ve implemented the plotting code for you below, but read through it and verify you understand what we’re doing (we’ve thrown in a bonus plt.subplots()
call, documentation here).
|
|
C:\Users\86135\AppData\Local\Temp\ipykernel_19968\1218153592.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
missing_by_time = (pd.to_datetime(missing_lat_lon['EVENTDT'])
Based on the plots above, are there any patterns among entries that are missing latitude/longitude data? The dataset information linked at the top of this notebook may also give more context.
Type your answer here, replacing this text.
Question 3d: Explore
The below cell plots a map of phonecalls by GPS coordinates (latitude, longitude); we drop missing location data.
|
|
Important: To make sure the test cases run correctly, click Kernel>Restart & Run All
and make sure all of the test cases are still passing. Doing so will submit your code for you.
If your test cases are no longer passing after restarting, it’s likely because you’re missing a variable, or the modifications that you’d previously made to your DataFrame are no longer taking place (perhaps because you deleted a cell).
You may submit this assignment as many times as you’d like before the deadline.
You must restart and run all cells before submitting. Otherwise, you may pass test cases locally, but not on our servers. We will not entertain regrade requests of the form, “my code passed all of my local test cases, but failed the autograder?
Congratulations!
Congrats! You are finished with this lab.
To double-check your work, the cell below will rerun all of the autograder tests.
|
|
Submission
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!
|
|