DATA100-lab10: SQL

2024-08-13 5685 words 27 minutes

Contents

1
2
3


# Initialize Otter
import otter
grader = otter.Notebook("lab10.ipynb")

Lab 10: SQL

In this lab, we are going to practice viewing, sorting, grouping, and merging tables with SQL. We will explore two datasets:

A “minified” version of the Internet Movie Database (IMDb). This SQLite database (~10MB) is a tiny sample of the much larger database (more than a few GBs). As a result, disclaimer that we may get wildly different results than if we use the whole database!
The money donated during the 2016 election using the Federal Election Commission (FEC)’s public records. You will be connecting to a SQLite database containing the data. The data we will be working with in this lab is quite small (~16MB); however, it is a sample taken from a much larger database (more than a few GBs).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import plotly.express as px
import sqlalchemy
from ds100_utils import fetch_and_cache
from pathlib import Path
%load_ext sql

# Unzip the data.
!unzip -o data.zip

Archive:  data.zip
  inflating: imdbmini.db             
  inflating: fec_nyc.db

SQL Query Syntax

Throughout this lab, you will become familiar with the following syntax for the SELECT query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


SELECT [DISTINCT] 
    {* | expr [[AS] c_alias] 
    {,expr [[AS] c_alias] ...}}
FROM tableref {, tableref}
[[INNER | LEFT ] JOIN table_name
    ON qualification_list]
[WHERE search_condition]
[GROUP BY colname {,colname...}]
[HAVING search condition]
[ORDER BY column_list]
[LIMIT number]
[OFFSET number of rows];

Part 0 [Tutorial]: Writing SQL in Jupyter Notebooks

1. `%%sql` cell magic

In lecture, we used the sql extension to call %%sql cell magic, which enables us to connect to SQL databses and issue SQL commands within Jupyter Notebooks.

Run the below cell to connect to a mini IMDb database.

1

%sql sqlite:///imdbmini.db

Above, prefixing our single-line command with %sql means that the entire line will be treated as a SQL command (this is called “line magic”). In this class we will most often write multi-line SQL, meaning we need “cell magic”, where the first line has %%sql (note the double % operator).

The database imdbmini.db includes several tables, one of which is Title. Running the below cell will return first 5 lines of that table. Note that %%sql is on its own line.

We’ve also included syntax for single-line comments, which are surrounded by --.

1
2
3
4
5
6
7
8
9


%%sql
/*
 * This is a
 * multi-line comment.
 */
-- This is a single-line/inline comment. --
SELECT *
FROM Name
LIMIT 5;

 * sqlite:///imdbmini.db
Done.

nconst	primaryName	birthYear	deathYear	primaryProfession
1	Fred Astaire	1899	1987	soundtrack,actor,miscellaneous
2	Lauren Bacall	1924	2014	actress,soundtrack
3	Brigitte Bardot	1934	None	actress,soundtrack,music_department
4	John Belushi	1949	1982	actor,soundtrack,writer
5	Ingmar Bergman	1918	2007	writer,director,actor

2. The Pandas command `pd.read_sql`

As of 2022, the %sql magic for Jupyter Notebooks is still in development (check out its GitHub. It is still missing many features that would justify real-world use with Python. In particular, its returned tables are not Pandas dataframes (for example, the query result from the above cell is missing an index).

The rest of this section describes how data scientists use SQL and Python in practice, using the Pandas command pd.read_sql (documentation). You will see both %sql magic and pd.read_sql in this course.

The below cell connects to the same database using the SQLAlchemy Python library, which can connect to several different database management systems, including sqlite3, MySQL, PostgreSQL, and Oracle. The library also supports an advanced feature for generating queries called an object relational mapper or ORM, which we won’t discuss in this course but is quite useful for application development.

1
2
3
4
5
6


# important!!! run this cell
import sqlalchemy

# create a SQL Alchemy connection to the database
engine = sqlalchemy.create_engine("sqlite:///imdbmini.db")
connection = engine.connect()

With the SQLAlchemy object connection, we can then call pd.read_sql which takes in a query string. Note the """ to define our multi-line string, which allows us to have a query span multiple lines. The resulting df DataFrame stores the results of the same SQL query from the previous section.

1
2
3
4
5
6
7
8
9


# just run this cell
query = """
SELECT *
FROM Title
LIMIT 5;
"""

df = pd.read_sql(query, engine)
df

	tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres
0	417	short	A Trip to the Moon	Le voyage dans la lune	1902	None	13	Action,Adventure,Comedy
1	4972	movie	The Birth of a Nation	The Birth of a Nation	1915	None	195	Drama,History,War
2	10323	movie	The Cabinet of Dr. Caligari	Das Cabinet des Dr. Caligari	1920	None	76	Fantasy,Horror,Mystery
3	12349	movie	The Kid	The Kid	1921	None	68	Comedy,Drama,Family
4	13442	movie	Nosferatu	Nosferatu, eine Symphonie des Grauens	1922	None	94	Fantasy,Horror

Long error messages: Given that the SQL query is now in the string, the errors become more unintelligible. Consider the below (incorrect) query, which has a semicolon in the wrong place.

1
2
3
4
5
6
7
8


# uncomment the below code and check out the error

# query = """
# SELECT *
# FROM Title;
# LIMIT 5
# """
# pd.read_sql(query, engine)

---------------------------------------------------------------------------

ProgrammingError                          Traceback (most recent call last)

File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:1967, in Connection._exec_single_context(self, dialect, context, statement, parameters)
   1966     if not evt_handled:
-> 1967         self.dialect.do_execute(
   1968             cursor, str_statement, effective_parameters, context
   1969         )
   1971 if self._has_events or self.engine._has_events:


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\default.py:924, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    923 def do_execute(self, cursor, statement, parameters, context=None):
--> 924     cursor.execute(statement, parameters)


ProgrammingError: You can only execute one statement at a time.


The above exception was the direct cause of the following exception:


ProgrammingError                          Traceback (most recent call last)

Cell In[7], line 8
      1 # uncomment the below code and check out the error
      3 query = """
      4 SELECT *
      5 FROM Title;
      6 LIMIT 5
      7 """
----> 8 pd.read_sql(query, engine)


File d:\miniconda3\Lib\site-packages\pandas\io\sql.py:734, in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize, dtype_backend, dtype)
    724     return pandas_sql.read_table(
    725         sql,
    726         index_col=index_col,
   (...)
    731         dtype_backend=dtype_backend,
    732     )
    733 else:
--> 734     return pandas_sql.read_query(
    735         sql,
    736         index_col=index_col,
    737         params=params,
    738         coerce_float=coerce_float,
    739         parse_dates=parse_dates,
    740         chunksize=chunksize,
    741         dtype_backend=dtype_backend,
    742         dtype=dtype,
    743     )


File d:\miniconda3\Lib\site-packages\pandas\io\sql.py:1836, in SQLDatabase.read_query(self, sql, index_col, coerce_float, parse_dates, params, chunksize, dtype, dtype_backend)
   1779 def read_query(
   1780     self,
   1781     sql: str,
   (...)
   1788     dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",
   1789 ) -> DataFrame | Iterator[DataFrame]:
   1790     """
   1791     Read SQL query into a DataFrame.
   1792 
   (...)
   1834 
   1835     """
-> 1836     result = self.execute(sql, params)
   1837     columns = result.keys()
   1839     if chunksize is not None:


File d:\miniconda3\Lib\site-packages\pandas\io\sql.py:1659, in SQLDatabase.execute(self, sql, params)
   1657 args = [] if params is None else [params]
   1658 if isinstance(sql, str):
-> 1659     return self.con.exec_driver_sql(sql, *args)
   1660 return self.con.execute(sql, *args)


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:1779, in Connection.exec_driver_sql(self, statement, parameters, execution_options)
   1774 execution_options = self._execution_options.merge_with(
   1775     execution_options
   1776 )
   1778 dialect = self.dialect
-> 1779 ret = self._execute_context(
   1780     dialect,
   1781     dialect.execution_ctx_cls._init_statement,
   1782     statement,
   1783     None,
   1784     execution_options,
   1785     statement,
   1786     distilled_parameters,
   1787 )
   1789 return ret


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:1846, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1844     return self._exec_insertmany_context(dialect, context)
   1845 else:
-> 1846     return self._exec_single_context(
   1847         dialect, context, statement, parameters
   1848     )


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:1986, in Connection._exec_single_context(self, dialect, context, statement, parameters)
   1983     result = context._setup_result_proxy()
   1985 except BaseException as e:
-> 1986     self._handle_dbapi_exception(
   1987         e, str_statement, effective_parameters, cursor, context
   1988     )
   1990 return result


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:2353, in Connection._handle_dbapi_exception(self, e, statement, parameters, cursor, context, is_sub_exec)
   2351 elif should_wrap:
   2352     assert sqlalchemy_exception is not None
-> 2353     raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
   2354 else:
   2355     assert exc_info[1] is not None


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\base.py:1967, in Connection._exec_single_context(self, dialect, context, statement, parameters)
   1965                 break
   1966     if not evt_handled:
-> 1967         self.dialect.do_execute(
   1968             cursor, str_statement, effective_parameters, context
   1969         )
   1971 if self._has_events or self.engine._has_events:
   1972     self.dispatch.after_cursor_execute(
   1973         self,
   1974         cursor,
   (...)
   1978         context.executemany,
   1979     )


File d:\miniconda3\Lib\site-packages\sqlalchemy\engine\default.py:924, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    923 def do_execute(self, cursor, statement, parameters, context=None):
--> 924     cursor.execute(statement, parameters)


ProgrammingError: (sqlite3.ProgrammingError) You can only execute one statement at a time.
[SQL: 
SELECT *
FROM Title;
LIMIT 5
]
(Background on this error at: https://sqlalche.me/e/20/f405)

Now that’s an unruly error message!

3. A suggested workflow for writing SQL in Jupyter Notebooks

Which approach is better, %sql magic or pd.read_sql?

The SQL database generally contains much more data than what you would analyze in detail. As a Python-fluent data scientist, you will often query SQL databases to perform initial exploratory data analysis, a subset of which you load into Python for further processing.

In practice, you would likely use a combination of the two approaches. First, you’d try out some SQL queries with %sql magic to get an interesting subset of data. Then, you’d copy over the query into a pd.read_sql command for visualization, modeling, and export with Pandas, sklearn, and other Python libraries.

For SQL assignments in this course, to minimize unruly error messages while maximizing Python compatibility, we suggest the following “sandboxed” workflow:

Create a %%sql magic cell below the answer cell. You can copy in the below code:
1 2

%% sql -- This is a comment. Put your code here... --

Work on the SQL query in the %%sql cell; e.g., SELECT ... ;
Then, once you’re satisfied with your SQL query, copy it into the multi-string query in the answer cell (the one that contains the pd.read_sql call).

You don’t have to follow the above workflow to get full credit on assignments, but we suggest it to reduce debugging headaches. We’ve created the scratchwork %%sql cells for you in this assignment, but do not add cells between this %%sql cell and the Python cell right below it. It will cause errors when we run the autograder, and it will sometimes cause a failure to generate the PDF file.

Part 1: The IMDb (mini) Dataset

Let’s explore a miniature version of the IMDb Dataset. This is the same dataset that we will use for the upcoming homework.

Let’s load in the database in two ways (using both Python and cell magic) so that we can flexibly explore the database.

1
2


engine = sqlalchemy.create_engine("sqlite:///imdbmini.db")
connection = engine.connect()

1

%sql sqlite:///imdbmini.db

1
2


%%sql
SELECT * FROM sqlite_master WHERE type='table';

 * sqlite:///imdbmini.db
Done.

type	name	tbl_name	rootpage	sql
table	Title	Title	2	CREATE TABLE "Title" ( "tconst" INTEGER, "titleType" TEXT, "primaryTitle" TEXT, "originalTitle" TEXT, "isAdult" TEXT, "startYear" TEXT, "endYear" TEXT, "runtimeMinutes" TEXT, "genres" TEXT )
table	Name	Name	12	CREATE TABLE "Name" ( "nconst" INTEGER, "primaryName" TEXT, "birthYear" TEXT, "deathYear" TEXT, "primaryProfession" TEXT )
table	Role	Role	70	CREATE TABLE "Role" ( tconst INTEGER, ordering TEXT, nconst INTEGER, category TEXT, job TEXT, characters TEXT )
table	Rating	Rating	41	CREATE TABLE "Rating" ( tconst INTEGER, averageRating TEXT, numVotes TEXT )

From running the above cell, we see the database has 4 tables: Name, Role, Rating, and Title.

[Click to Expand] See descriptions of each table's schema.

Name -Contains the following information for names of people.

nconst (text) - alphanumeric unique identifier of the name/person
primaryName (text)-name by which the person is most often credited
birthYear (integer) -in YYYY format
deathYear (integer) -in YYYY format

Role -Contains the principal cast/crew for titles.

tconst (text) - alphanumeric unique identifier of the title
ordering (integer) -a number to uniquely identify rows for a given tconst
nconst (text) - alphanumeric unique identifier of the name/person
category (text) - the category of job that person was in
characters (text) - the name of the character played if applicable, else ‘\N’

Rating -Contains the IMDb rating and votes information for titles.

tconst (text) - alphanumeric unique identifier of the title
averageRating (real) -weighted average of all the individual user ratings
numVotes (integer) - number of votes (i.e., ratings) the title has received

Title - Contains the following information for titles.

tconst (text) - alphanumeric unique identifier of the title
titleType (text) - the type/format of the title
primaryTitle (text) - the more popular title / the title used by the filmmakers on promotional materials at the point of release
isAdult (text) - 0: non-adult title; 1: adult title
year (YYYY) -represents the release year of a title.
runtimeMinutes (integer) -primary runtime of the title, in minutes

From the above descriptions, we can conclude the following:

Name.nconst and Title.tconst are primary keys of the Name and Title tables, respectively.
that Role.nconst and Role.tconst are foreign keys that point to Name.nconst and Title.tconst, respectively.

Question 1

What are the different kinds of titleTypes included in the Title table? Write a query to find out all the unique titleTypes of films using the DISTINCT keyword. (You may not use GROUP BY.)

1
2
3
4
5
6
7
8
9


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
SELECT DISTINCT titleType FROM Title;

 * sqlite:///imdbmini.db
Done.

titleType
short
movie
tvSeries
tvMovie
tvMiniSeries
video
videoGame
tvEpisode
tvSpecial

1
2
3
4
5
6


query_q1 = """
SELECT DISTINCT titleType FROM Title;
"""

res_q1 = pd.read_sql(query_q1, engine)
res_q1

	titleType
0	short
1	movie
2	tvSeries
3	tvMovie
4	tvMiniSeries
5	video
6	videoGame
7	tvEpisode
8	tvSpecial

1

grader.check("q1")

q1

passed! 🍀

Question 2

Before we proceed we want to get a better picture of the kinds of jobs that exist. To do this examine the Role table by computing the number of records with each job category. Present the results in descending order by the total counts.

The top of your table should look like this (however, you should have more rows):

	category	total
0	actor	21665
1	writer	13830
2	…	…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
SELECT category, COUNT(*) AS total
FROM Role
GROUP BY category
ORDER BY total DESC;

 * sqlite:///imdbmini.db
Done.

category	total
actor	21665
writer	13830
actress	12175
producer	11028
director	6995
composer	4123
cinematographer	2747
editor	1558
self	623
production_designer	410
archive_footage	66
archive_sound	6

1
2
3
4
5
6
7
8
9


query_q2 = """
SELECT category, COUNT(*) AS total
FROM Role
GROUP BY category
ORDER BY total DESC;
"""

res_q2 = pd.read_sql(query_q2, engine)
res_q2

	category	total
0	actor	21665
1	writer	13830
2	actress	12175
3	producer	11028
4	director	6995
5	composer	4123
6	cinematographer	2747
7	editor	1558
8	self	623
9	production_designer	410
10	archive_footage	66
11	archive_sound	6

1

grader.check("q2")

q2

passed! 🍀

If we computed the results correctly we should see a nice horizontal bar chart of the counts per category below:

1
2


# just run this cell
px.bar(res_q2, x="total", y="category", orientation='h')

Question 3

Now that we have a better sense of the basics of our data, we can ask some more interesting questions.

The Rating table has the numVotes and the averageRating for each title. Which 10 films have the most ratings?

Write a SQL query that outputs three fields: the title, numVotes, and averageRating for the 10 films that have the highest number of ratings. Sort the result in descending order by the number of votes.

Hint: The numVotes in the Rating table is not an integer! Use CAST(Rating.numVotes AS int) AS numVotes to convert the attribute to an integer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
SELECT primaryTitle, 
CAST(numVotes AS int) AS numVotes,
averageRating
FROM Rating, Title
WHERE Rating.tconst = Title.tconst
ORDER BY numVotes DESC;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


query_q3 = """
SELECT primaryTitle AS title, 
CAST(numVotes AS int) AS numVotes,
averageRating
FROM Rating, Title
WHERE Rating.tconst = Title.tconst
ORDER BY numVotes DESC
LIMIT 10;
"""


res_q3 = pd.read_sql(query_q3, engine)
res_q3

	title	numVotes	averageRating
0	The Shawshank Redemption	2462686	9.3
1	The Dark Knight	2417875	9.0
2	Inception	2169255	8.8
3	Fight Club	1939312	8.8
4	Pulp Fiction	1907561	8.9
5	Forrest Gump	1903969	8.8
6	Game of Thrones	1874040	9.2
7	The Matrix	1756469	8.7
8	The Lord of the Rings: The Fellowship of the Ring	1730296	8.8
9	The Lord of the Rings: The Return of the King	1709023	8.9

1

grader.check("q3")

Part 2: Election Donations in New York City

Finally, let’s analyze the Federal Election Commission (FEC)’s public records. We connect to the database in two ways (using both Python and cell magic) so that we can flexibly explore the database.

1
2
3
4
5


# important!!! run this cell and the next one
import sqlalchemy
# create a SQL Alchemy connection to the database
engine = sqlalchemy.create_engine("sqlite:///fec_nyc.db")
connection = engine.connect()

1

%sql sqlite:///fec_nyc.db

Table Descriptions

Run the below cell to explore the schemas of all tables saved in the database.

If you’d like, you can consult the below linked FEC pages for the descriptions of the tables themselves.

cand (link): Candidates table. Contains names and party affiliation.
comm (link): Committees table. Contains committee names and types.
indiv_sample_nyc (link): All individual contributions from New York City .

1
2
3


%%sql
/* just run this cell */
SELECT sql FROM sqlite_master WHERE type='table';

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

sql
CREATE TABLE "cand" ( cand_id character varying(9), cand_name text, cand_pty_affiliation character varying(3), cand_election_yr integer, cand_office_st character varying(2), cand_office character(1), cand_office_district integer, cand_ici character(1), cand_status character(1), cand_pcc character varying(9), cand_st1 text, cand_st2 text, cand_city text, cand_st character varying(2), cand_zip character varying(10) )
CREATE TABLE "comm"( "cmte_id" TEXT, "cmte_nm" TEXT, "tres_nm" TEXT, "cmte_st1" TEXT, "cmte_st2" TEXT, "cmte_city" TEXT, "cmte_st" TEXT, "cmte_zip" TEXT, "cmte_dsgn" TEXT, "cmte_tp" TEXT, "cmte_pty_affiliation" TEXT, "cmte_filing_freq" TEXT, "org_tp" TEXT, "connected_org_nm" TEXT, "cand_id" TEXT )
CREATE TABLE indiv_sample_nyc ( cmte_id character varying(9), amndt_ind character(1), rpt_tp character varying(3), transaction_pgi character(5), image_num bigint, transaction_tp character varying(3), entity_tp character varying(3), name text, city text, state character(2), zip_code character varying(12), employer text, occupation text, transaction_dt character varying(9), transaction_amt integer, other_id text, tran_id text, file_num bigint, memo_cd text, memo_text text, sub_id bigint )

Let’s look at the indiv_sample_nyc table. The below cell displays individual donations made by residents of the state of New York. We use LIMIT 5 to avoid loading and displaying a huge table.

1
2
3
4
5
6
7
8


%%sql
/* just run this cell */
SELECT comm.cmte_id, cmte_nm, sum(transaction_amt) as total
FROM indiv_sample_nyc, comm
WHERE indiv_sample_nyc.cmte_id = comm.cmte_id AND name LIKE '%TRUMP%' AND name LIKE '%DONALD%'
GROUP BY cmte_nm
ORDER BY transaction_amt 
LIMIT 5;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cmte_id	cmte_nm	total
C00608489	GREAT AMERICA PAC	75
C00369033	TEXANS FOR SENATOR JOHN CORNYN INC	1000
C00494229	HELLER FOR SENATE	2000
C00554949	FRIENDS OF DAVE BRAT INC.	2600
C00230482	GRASSLEY COMMITTEE INC	5200

You can write a SQL query to return the id and name of the first five candidates from the Democratic party, as below:

1
2
3
4
5
6


%%sql
/* just run this cell */
SELECT cand_id, cand_name
FROM cand
WHERE cand_pty_affiliation = 'DEM'
LIMIT 5;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cand_id	cand_name
H0AL05049	CRAMER, ROBERT E "BUD" JR
H0AL07086	SEWELL, TERRYCINA ANDREA
H0AL07094	HILLIARD, EARL FREDERICK JR
H0AR01091	GREGORY, JAMES CHRISTOPHER
H0AR01109	CAUSEY, CHAD

[Tutorial] Matching Text with `LIKE`

First, let’s look at 2016 election contributions made by Donald Trump, who was a New York (NY) resident during that year. The following SQL query returns the cmte_id, transaction_amt, and name for every contribution made by any donor with “DONALD” and “TRUMP” in their name in the indiv_sample_nyc table.

Notes:

We use the WHERE ... LIKE '...' to match fields with text patterns. The % wildcard represents at least zero characters. Compare this to what you know from regex!
We use pd.read_sql syntax here because we will do some EDA on the result res.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


# just run this cell
example_query = """
SELECT 
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%';
"""

example_res = pd.read_sql(example_query, engine)
example_res

	cmte_id	transaction_amt	name
0	C00230482	2600	DONALD, TRUMP
1	C00230482	2600	DONALD, TRUMP
2	C00014498	9000	TRUMP, DONALD
3	C00494229	2000	TRUMP, DONALD MR
4	C00571869	2700	TRUMP, DONALD J.
...	...	...	...
152	C00608489	5	DONALD J TRUMP FOR PRESIDENT INC
153	C00608489	5	DONALD J TRUMP FOR PRESIDENT INC
154	C00608489	5	DONALD J TRUMP FOR PRESIDENT INC
155	C00608489	5	DONALD J TRUMP FOR PRESIDENT INC
156	C00608489	5	DONALD J TRUMP FOR PRESIDENT INC

157 rows × 3 columns

If we look at the list above, it appears that some donations were not by Donald Trump himself, but instead by an entity called “DONALD J TRUMP FOR PRESIDENT INC”. Fortunately, we see that our query only seems to have picked up one such anomalous name.

1
2


# just run this cell
example_res['name'].value_counts()

name
TRUMP, DONALD J.                    133
DONALD J TRUMP FOR PRESIDENT INC     15
TRUMP, DONALD                         4
DONALD, TRUMP                         2
TRUMP, DONALD MR                      1
TRUMP, DONALD J MR.                   1
TRUMP, DONALD J MR                    1
Name: count, dtype: int64

Question 4

Revise the above query so that the 15 anomalous donations made by “DONALD J TRUMP FOR PRESIDENT INC” do not appear. Your resulting table should have 142 rows.

Hints:

Consider using the above query as a starting point, or checking out the SQL query skeleton at the top of this lab.
The NOT keyword may also be useful here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
SELECT 
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%' AND name NOT LIKE '%DONALD J TRUMP FOR PRESIDENT INC%';

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


query_q4 = """
SELECT 
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%' AND name NOT LIKE '%DONALD J TRUMP FOR PRESIDENT INC%';
"""

res_q4 = pd.read_sql(query_q4, engine)
res_q4

	cmte_id	transaction_amt	name
0	C00230482	2600	DONALD, TRUMP
1	C00230482	2600	DONALD, TRUMP
2	C00014498	9000	TRUMP, DONALD
3	C00494229	2000	TRUMP, DONALD MR
4	C00571869	2700	TRUMP, DONALD J.
...	...	...	...
137	C00580100	9752	TRUMP, DONALD J.
138	C00580100	2574	TRUMP, DONALD J.
139	C00580100	23775	TRUMP, DONALD J.
140	C00580100	2000000	TRUMP, DONALD J.
141	C00580100	2574	TRUMP, DONALD J.

142 rows × 3 columns

1

grader.check("q4")

q4

passed!

Question 5: `JOIN`ing Tables

Let’s explore the other two tables in our database: cand and comm.

The cand table contains summary financial information about each candidate registered with the FEC or appearing on an official state ballot for House, Senate or President.

1
2
3
4
5


%%sql
/* just run this cell */
SELECT *
FROM indiv_sample_nyc
LIMIT 5;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	zip_code	employer	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_text	sub_id
C00445015	N	Q1	P	15951128130	15	IND	SINGER, TRIPP MR.	NEW YORK	NY	100214505	ATLANTIC MAILBOXES, INC.	OWNER	01302015	1000		A-CF13736	1002485		4041420151241812398
C00510461	N	Q1	P	15951129284	15E	IND	SIMON, DANIEL A	NEW YORK	NY	100237940	N/A	RETIRED	03292015	400	C00401224	VN8JBDDJBA8	1002590	* EARMARKED CONTRIBUTION: SEE BELOW	4041420151241813640
C00422410	N	Q1	P	15970352211	15	IND	ABDUL RAUF, FEISAL	NEW YORK	NY	101150010	THE CORDOBA INITIATIVE	CHAIRMAN	03042015	250		VN8A3DBSYG6	1003643		4041620151241914560
C00510461	N	Q1	P	15951129280	15	IND	SCHWARZER, FRANK	NEW YORK	NY	100145135	METRO HYDRAULIC JACK CO	SALES	01162015	100		VN8JBDAP4C4	1002590	* EARMARKED CONTRIBUTION: SEE BELOW	4041420151241813630
C00510461	N	Q1	P	15951129281	15	IND	SCHWARZER, FRANK	NEW YORK	NY	100145135	METRO HYDRAULIC JACK CO	SALES	02162015	100		VN8JBDBRDG3	1002590	* EARMARKED CONTRIBUTION: SEE BELOW	4041420151241813632

The comm table contains summary financial information about each committee registered with the FEC. Committees are organizations that spend money for political action or parties, or spend money for or against political candidates.

1
2
3
4
5


%%sql
/* just run this cell */
SELECT *
FROM comm
LIMIT 5;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cmte_id	cmte_nm	tres_nm	cmte_st1	cmte_st2	cmte_city	cmte_st	cmte_zip	cmte_dsgn	cmte_tp	cmte_pty_affiliation	cmte_filing_freq	org_tp	connected_org_nm
C00000059	HALLMARK CARDS PAC	ERIN BROWER	2501 MCGEE	MD#288	KANSAS CITY	MO	64108	U	Q	UNK	M	C
C00000422	AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE	WALKER, KEVIN	25 MASSACHUSETTS AVE, NW	SUITE 600	WASHINGTON	DC	20001	B	Q		M	M	AMERICAN MEDICAL ASSOCIATION
C00000489	D R I V E POLITICAL FUND CHAPTER 886	TOM RITTER	3528 W RENO		OKLAHOMA CITY	OK	73107	U	N		Q	L	TEAMSTERS LOCAL UNION 886
C00000547	KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE	C. RICHARD BONEBRAKE, M.D.	623 SW 10TH AVE		TOPEKA	KS	66612	U	Q	UNK	Q	T
C00000638	INDIANA STATE MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE	VIDYA KORA, M.D.	322 CANAL WALK, CANAL LEVEL		INDIANAPOLIS	IN	46202	U	Q		Q	M

Question 5a

Notice that both the cand and comm tables have a cand_id column. Let’s try joining these two tables on this column to print out committee information for candidates.

List the first 5 candidate names (cand_name) in reverse lexicographic order by cand_name, along with their corresponding committee names. Only select rows that have a matching cand_id in both tables.

Your output should look similar to the following:

	cand_name	cmte_nm
0	ZUTLER, DANIEL PAUL MR	CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
1	ZUMWALT, JAMES	ZUMWALT FOR CONGRESS
…	…	…

Consider starting from the following query skeleton, which uses the AS keyword to rename the cand and comm tables to c1 and c2, respectively. Which join is most appropriate?

SELECT ...
FROM cand AS c1
    [INNER | {LEFT |RIGHT | FULL } {OUTER}] JOIN comm AS c2
    ON ...
...
...;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
    SELECT cand_name, cmte_nm
    FROM cand AS c1
        INNER JOIN comm AS c2
        ON c1.cand_id = c2.cand_id
    ORDER BY cand_name DESC;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


query_q5a = """
    SELECT cand_name, cmte_nm
    FROM cand AS c1
        INNER JOIN comm AS c2
        ON c1.cand_id = c2.cand_id
    ORDER BY cand_name DESC
    LIMIT 5;
"""

res_q5a = pd.read_sql(query_q5a, engine)
res_q5a

	cand_name	cmte_nm
0	ZUTLER, DANIEL PAUL MR	CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
1	ZUMWALT, JAMES	ZUMWALT FOR CONGRESS
2	ZUKOWSKI, ANDREW GEORGE	ZUKOWSKI FOR CONGRESS
3	ZUCCOLO, JOE	JOE ZUCCOLO FOR CONGRESS
4	ZORN, ROBERT ERWIN	CONSTITUTIONAL COMMITTEE

1

grader.check("q5a")

q5a

passed! 🎉

Question 5b

Suppose we modify the query from the previous part to include all candidates, including those that don’t have a committee.

List the first 5 candidate names (cand_name) in reverse lexicographic order by cand_name, along with their corresponding committee names. If the candidate has no committee in the comm table, then cmte_nm should be NULL (or None in the Python representation).

Your output should look similar to the following:

	cand_name	cmte_nm
0	ZUTLER, DANIEL PAUL MR	CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
…	…	…
4	ZORNOW, TODD MR	None

Hint: Start from the same query skeleton as the previous part. Which join is most appropriate?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
    SELECT cand_name, cmte_nm
    FROM cand AS c1
        LEFT JOIN comm AS c2
        ON c1.cand_id = c2.cand_id
    ORDER BY cand_name DESC
    LIMIT 5;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cand_name	cmte_nm
ZUTLER, DANIEL PAUL MR	CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
ZUMWALT, JAMES	ZUMWALT FOR CONGRESS
ZUKOWSKI, ANDREW GEORGE	ZUKOWSKI FOR CONGRESS
ZUCCOLO, JOE	JOE ZUCCOLO FOR CONGRESS
ZORNOW, TODD MR	None

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


query_q5b = """
    SELECT cand_name, cmte_nm
    FROM cand AS c1
        LEFT JOIN comm AS c2
        ON c1.cand_id = c2.cand_id
    ORDER BY cand_name DESC
    LIMIT 5;
"""

res_q5b = pd.read_sql(query_q5b, engine)
res_q5b

	cand_name	cmte_nm
0	ZUTLER, DANIEL PAUL MR	CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
1	ZUMWALT, JAMES	ZUMWALT FOR CONGRESS
2	ZUKOWSKI, ANDREW GEORGE	ZUKOWSKI FOR CONGRESS
3	ZUCCOLO, JOE	JOE ZUCCOLO FOR CONGRESS
4	ZORNOW, TODD MR	None

1

grader.check("q5b")

q5b

passed! 🚀

Question 6: Subqueries and Grouping

If we return to our results from Question 4, we see that many of the contributions were to the same committee:

1
2
3


# Your SQL query result from Question 4
# reprinted for your convenience
res_q4

	cmte_id	transaction_amt	name
0	C00230482	2600	DONALD, TRUMP
1	C00230482	2600	DONALD, TRUMP
2	C00014498	9000	TRUMP, DONALD
3	C00494229	2000	TRUMP, DONALD MR
4	C00571869	2700	TRUMP, DONALD J.
...	...	...	...
137	C00580100	9752	TRUMP, DONALD J.
138	C00580100	2574	TRUMP, DONALD J.
139	C00580100	23775	TRUMP, DONALD J.
140	C00580100	2000000	TRUMP, DONALD J.
141	C00580100	2574	TRUMP, DONALD J.

142 rows × 3 columns

Create a new SQL query that returns the total amount that Donald Trump contributed to each committee.

Your table should have four columns: cmte_id, total_amount (total amount contributed to that committee), num_donations (total number of donations), and cmte_nm (name of the committee). Your table should be sorted in decreasing order of total_amount.

This is a hard question! Don’t be afraid to reference the lecture slides, or the overall SQL query skeleton at the top of this lab.

Here are some other hints:

Note that committee names are not available in indiv_sample_nyc, so you will have to obtain information somehow from the comm table (perhaps a JOIN would be useful).
Remember that you can compute summary statistics after grouping by using aggregates like COUNT(*), SUM() as output fields.
A subquery may be useful to break your question down into subparts. Consider the following query skeleton, which uses the WITH operator to store a subquery’s results in a temporary table named donations.
```
  WITH donations AS (
      SELECT ...
      ...
  )
  SELECT ...
  FROM donations
  GROUP BY ...
  ORDER BY ...;
```

1
2
3
4
5
6
7
8


%%sql
/* just run this cell */
SELECT comm.cmte_id, sum(transaction_amt) as total_amount, COUNT(*) as num_donations, cmte_nm
FROM indiv_sample_nyc, comm
WHERE indiv_sample_nyc.cmte_id = comm.cmte_id AND name LIKE '%TRUMP%' AND name LIKE '%DONALD%'
GROUP BY cmte_nm
ORDER BY total_amount DESC
LIMIT 10;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cmte_id	total_amount	num_donations	cmte_nm
C00580100	18633157	131	DONALD J. TRUMP FOR PRESIDENT, INC.
C00055582	10000	1	NY REPUBLICAN FEDERAL CAMPAIGN COMMITTEE
C00014498	9000	1	REPUBLICAN PARTY OF IOWA
C00571869	5400	2	DONOVAN FOR CONGRESS
C00230482	5200	2	GRASSLEY COMMITTEE INC
C00034033	5000	1	SOUTH CAROLINA REPUBLICAN PARTY
C00136457	5000	1	NEW HAMPSHIRE REPUBLICAN STATE COMMITTEE
C00554949	2600	1	FRIENDS OF DAVE BRAT INC.
C00494229	2000	1	HELLER FOR SENATE
C00369033	1000	1	TEXANS FOR SENATOR JOHN CORNYN INC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


%%sql
/*
 * Code in this scratchwork cell is __not graded.__
 * Copy over any SQL queries you write here into the below Python cell.
 * Do __not__ insert any new cells in between the SQL/Python cells!
 * Doing so may break the autograder.
 */
-- Write below this comment. --
        WITH donations AS (
            SELECT c1.cmte_id,transaction_amt, cmte_nm
            FROM indiv_sample_nyc AS c1, comm AS c2
            WHERE c1.cmte_id = c2.cmte_id AND name LIKE '%TRUMP%' AND name LIKE '%DONALD%' AND transaction_amt > 0
        )
        SELECT cmte_id, SUM(transaction_amt) AS total_amount, count(*) AS num_donations, cmte_nm
        FROM donations
        GROUP BY cmte_nm
        ORDER BY total_amount
        LIMIT 10;

 * sqlite:///fec_nyc.db
   sqlite:///imdbmini.db
Done.

cmte_id	total_amount	num_donations	cmte_nm
C00608489	75	15	GREAT AMERICA PAC
C00369033	1000	1	TEXANS FOR SENATOR JOHN CORNYN INC
C00494229	2000	1	HELLER FOR SENATE
C00554949	2600	1	FRIENDS OF DAVE BRAT INC.
C00136457	5000	1	NEW HAMPSHIRE REPUBLICAN STATE COMMITTEE
C00034033	5000	1	SOUTH CAROLINA REPUBLICAN PARTY
C00230482	5200	2	GRASSLEY COMMITTEE INC
C00571869	5400	2	DONOVAN FOR CONGRESS
C00014498	9000	1	REPUBLICAN PARTY OF IOWA
C00055582	10000	1	NY REPUBLICAN FEDERAL CAMPAIGN COMMITTEE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


query_q6 = """
SELECT comm.cmte_id, sum(transaction_amt) as total_amount, COUNT(*) as num_donations, cmte_nm
FROM indiv_sample_nyc, comm
WHERE indiv_sample_nyc.cmte_id = comm.cmte_id AND name LIKE '%TRUMP%' AND name LIKE '%DONALD%'
GROUP BY cmte_nm
ORDER BY total_amount DESC
LIMIT 10;
"""


res_q6 = pd.read_sql(query_q6, engine)
res_q6

	cmte_id	total_amount	num_donations	cmte_nm
0	C00580100	18633157	131	DONALD J. TRUMP FOR PRESIDENT, INC.
1	C00055582	10000	1	NY REPUBLICAN FEDERAL CAMPAIGN COMMITTEE
2	C00014498	9000	1	REPUBLICAN PARTY OF IOWA
3	C00571869	5400	2	DONOVAN FOR CONGRESS
4	C00230482	5200	2	GRASSLEY COMMITTEE INC
5	C00034033	5000	1	SOUTH CAROLINA REPUBLICAN PARTY
6	C00136457	5000	1	NEW HAMPSHIRE REPUBLICAN STATE COMMITTEE
7	C00554949	2600	1	FRIENDS OF DAVE BRAT INC.
8	C00494229	2000	1	HELLER FOR SENATE
9	C00369033	1000	1	TEXANS FOR SENATOR JOHN CORNYN INC

1

grader.check("q6")

q6

passed! 🎉

Congratulations! You finished the lab!

Contents

DATA100-lab10: SQL

Lab 10: SQL

SQL Query Syntax

Part 0 [Tutorial]: Writing SQL in Jupyter Notebooks

1. %%sql cell magic

2. The Pandas command pd.read_sql

3. A suggested workflow for writing SQL in Jupyter Notebooks

Part 1: The IMDb (mini) Dataset

Question 1

Question 2

Question 3

Part 2: Election Donations in New York City

Table Descriptions

[Tutorial] Matching Text with LIKE

Question 4

Question 5: JOINing Tables

Question 5a

Question 5b

Question 6: Subqueries and Grouping

Congratulations! You finished the lab!

1. `%%sql` cell magic

2. The Pandas command `pd.read_sql`

[Tutorial] Matching Text with `LIKE`

Question 5: `JOIN`ing Tables