Contents

DATA100-disc02

Discussion 2: Pandas Practice

We will begin our discussion of Pandas. You will practice:

  • Selecting columns
  • Filtering with boolean conditions
  • Counting with value_counts
1
2
import pandas as pd
import numpy as np

Pandas Practise

In the first Pandas question, we will be working with the elections dataset from lecture.

1
2
elections = pd.read_csv("elections.csv") # read in the elections data into a pandas dataframe!
elections.head(5)

Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Question 5

We want to select the “Popular vote” column as a pd.Series. Which of the following lines of code will error?

  • elections['Popular vote']
  • elections.iloc['Popular vote']
  • elections.loc['Popular vote']
  • elections.loc[:, 'Popular vote']
  • elections.iloc[:, 'Popular vote']

Run each line in the cell below and see for yourself!

1
2
3
4
5
# elections.iloc['Popular vote'] # wrong
# elections.iloc[:, 'popular votes'] # wrong
# elections['Popular vote'] # right
# elections.loc['Popular vote'] # ket error
# elections.loc[:,'Popular vote'] # right
0        151271
1        113142
2        642806
3        500897
4        702735
         ...   
173    62984828
174      732273
175     4489235
176    65853514
177     1457226
Name: Popular vote, Length: 178, dtype: int64

Question 6

Write one line of Pandas code that returns a pd.DataFrame that only contains election results from the 1900s.

1
elections[(elections['Year'] >= 1900) & (elections['Year'] < 2000)] # 注意是 &  

Year Candidate Party Popular vote Result %
54 1900 John G. Woolley Prohibition 210864 loss 1.526821
55 1900 William Jennings Bryan Democratic 6370932 loss 46.130540
56 1900 William McKinley Republican 7228864 win 52.342640
57 1904 Alton B. Parker Democratic 5083880 loss 37.685116
58 1904 Eugene V. Debs Socialist 402810 loss 2.985897
... ... ... ... ... ... ...
146 1996 Harry Browne Libertarian 485759 loss 0.505198
147 1996 Howard Phillips Taxpayers 184656 loss 0.192045
148 1996 John Hagelin Natural Law 113670 loss 0.118219
149 1996 Ralph Nader Green 685297 loss 0.712721
150 1996 Ross Perot Reform 8085294 loss 8.408844

97 rows × 6 columns

Question 7

Write one line of Pandas code that returns a pd.Series, where the index is the Party, and the values are how many times that party won an election. Hint: use value_counts.

1
2
# Your answer here
elections['Party'].value_counts()
Party
Democratic               46
Republican               40
Prohibition              11
Libertarian              11
Socialist                10
Independent               6
Whig                      6
Green                     6
Progressive               4
Populist                  3
Constitution              3
American Independent      3
American                  2
National Republican       2
Democratic-Republican     2
Reform                    2
Free Soil                 2
Anti-Masonic              1
National Union            1
Constitutional Union      1
National Democratic       1
Union Labor               1
Greenback                 1
Anti-Monopoly             1
Liberal Republican        1
Southern Democratic       1
Northern Democratic       1
Farmer–Labor              1
Dixiecrat                 1
States' Rights            1
Communist                 1
Union                     1
Taxpayers                 1
New Alliance              1
Citizens                  1
Natural Law               1
Name: count, dtype: int64

Grading Assistance (Bonus)

Fernando is writing a grading script to compute grades for students in Data 101. Recall that many factors go into computing a student’s final grade, including homework, discussion, exams, and labs. In this question, we will help Fernando compute the homework grades for all students using a DataFrame, hw_grades, provided by Gradescope.

The Pandas DataFrame hw_grades contains homework grades for all students for all homework assignments, with one row for each combination of student and homework assignment. Any assignments that are incomplete are denoted by NaN (missing) values, and any late assignments are denoted by a True boolean value in the Late column. You may assume that the names of students are unique. Below is a sample of hw_grades.

1
2
hw_grades = pd.read_csv("hw_grades.csv")
hw_grades.sample(5, random_state = 0)

Name Assignment Grade Late
28 Sid Homework 9 82.517998 True
11 Ash Homework 2 78.264844 True
10 Ash Homework 1 98.421049 False
41 Emily Homework 2 62.900313 False
2 Meg Homework 3 89.785619 False

Question 8a

Assuming there is a late penalty that causes a 10% grade reduction to the student’s current score (i.e. a 65% score would become a 65% - 6.5% = 58.5%), write a line of Pandas code to calculate all the homework grades, including the late penalty if applicable, and store it in a column named ’LPGrade’.

1
2
3
# Your answer here
hw_grades['LPGrade'] = hw_grades['Grade'] * (1 - hw_grades['Late'] * 0.1) # 用个隐式转换
hw_grades.head()

Name Assignment Grade Late LPGrade
0 Meg Homework 1 NaN False NaN
1 Meg Homework 2 64.191844 False 64.191844
2 Meg Homework 3 89.785619 False 89.785619
3 Meg Homework 4 74.420033 False 74.420033
4 Meg Homework 5 74.372434 True 66.935190

Question 8b

Which of the following expressions outputs the students’ names and number of late assignments, from least to greatest number of late assignments?

  • hw_grades.groupby([’Name’]).sum().sort_values()
  • hw_grades.groupby([’Name’, ’Late’]).sum().sort_values()
  • hw_grades.groupby([’Name’]).sum()[’Late’].sort_values()
  • hw_grades.groupby([’Name’]).sum().sort_values()[’Late’]
1
2
3
# Your answer here
# hw_grades.groupby(['Name']).sum().sort_values() # <---- Try to sort on df, but have to  give 'by=...' into sort_values()
hw_grades.groupby(['Name']).sum()['Late'].sort_values()
Name
Sid      1
Emily    2
Meg      2
Ash      3
Smith    3
Name: Late, dtype: int64

Question 8c

If each assignment is weighted equally, fill in the blanks below to calculate each student’s overall homework grade, including late penalties for any applicable assignments.

Hint: Recall that incomplete assignments have NaN values. How can we use fillna to replace these null values?

1
2
3
hw_grades._________(_______) \
         .groupby(___________)[____________] \
         .agg(____________)
1
2
3
4
5
# Your answer here
hw_grades.fillna(0)\
    .groupby(['Name'])['LPGrade']\
    .agg('mean')
# Python中,反斜杠 \ 用作行续字符,它允许你将一行代码分割成多行,以提高代码的可读性。这在编写较长的一行代码时特别有用,可以避免代码过于拥挤,使得代码更易于阅读和维护。
Name
Ash      80.830657
Emily    84.297725
Meg      69.218137
Sid      63.020729
Smith    58.332233
Name: LPGrade, dtype: float64

Question 8d

Of all the homework assignments, which are the most difficult in terms of the median grade? Order by the median grade, from lowest to greatest. Do not consider incomplete assignments or late penalties in this calculation.

Fill in the blanks below to answer this question.

Hint: Recall that incomplete assignments have NaN values. How can we use dropna to remove these null values?

1
2
3
4
hw_grades._________() \
         .groupby(___________)[____________] \
         .agg(____________) \
         .sort_values()
1
2
3
4
5
# Your answer here
hw_grades.dropna()\
    .groupby('Assignment')['Grade']\
    .agg('median')\
    .sort_values()
Assignment
Homework 2     64.160918
Homework 10    66.366211
Homework 5     74.372434
Homework 8     76.362904
Homework 4     78.207572
Homework 3     78.348163
Homework 9     82.517998
Homework 6     84.369535
Homework 1     85.473281
Homework 7     92.200688
Name: Grade, dtype: float64