DATA100-disc02

2024-07-15 1043 words 5 minutes

Contents

Discussion 2: Pandas Practice

We will begin our discussion of Pandas. You will practice:

Selecting columns
Filtering with boolean conditions
Counting with value_counts

1
2


import pandas as pd
import numpy as np

Pandas Practise

In the first Pandas question, we will be working with the elections dataset from lecture.

1
2


elections = pd.read_csv("elections.csv") # read in the elections data into a pandas dataframe!
elections.head(5)

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122
1	1824	John Quincy Adams	Democratic-Republican	113142	win	42.789878
2	1828	Andrew Jackson	Democratic	642806	win	56.203927
3	1828	John Quincy Adams	National Republican	500897	loss	43.796073
4	1832	Andrew Jackson	Democratic	702735	win	54.574789

Question 5

We want to select the “Popular vote” column as a pd.Series. Which of the following lines of code will error?

elections['Popular vote']
elections.iloc['Popular vote']
elections.loc['Popular vote']
elections.loc[:, 'Popular vote']
elections.iloc[:, 'Popular vote']

Run each line in the cell below and see for yourself!

1
2
3
4
5


# elections.iloc['Popular vote'] # wrong
# elections.iloc[:, 'popular votes'] # wrong
# elections['Popular vote'] # right
# elections.loc['Popular vote'] # ket error
# elections.loc[:,'Popular vote'] # right

0        151271
1        113142
2        642806
3        500897
4        702735
         ...   
173    62984828
174      732273
175     4489235
176    65853514
177     1457226
Name: Popular vote, Length: 178, dtype: int64

Question 6

Write one line of Pandas code that returns a pd.DataFrame that only contains election results from the 1900s.

1

elections[(elections['Year'] >= 1900) & (elections['Year'] < 2000)] # 注意是 &

	Year	Candidate	Party	Popular vote	Result	%
54	1900	John G. Woolley	Prohibition	210864	loss	1.526821
55	1900	William Jennings Bryan	Democratic	6370932	loss	46.130540
56	1900	William McKinley	Republican	7228864	win	52.342640
57	1904	Alton B. Parker	Democratic	5083880	loss	37.685116
58	1904	Eugene V. Debs	Socialist	402810	loss	2.985897
...	...	...	...	...	...	...
146	1996	Harry Browne	Libertarian	485759	loss	0.505198
147	1996	Howard Phillips	Taxpayers	184656	loss	0.192045
148	1996	John Hagelin	Natural Law	113670	loss	0.118219
149	1996	Ralph Nader	Green	685297	loss	0.712721
150	1996	Ross Perot	Reform	8085294	loss	8.408844

97 rows × 6 columns

Question 7

Write one line of Pandas code that returns a pd.Series, where the index is the Party, and the values are how many times that party won an election. Hint: use value_counts.

1
2


# Your answer here
elections['Party'].value_counts()

Party
Democratic               46
Republican               40
Prohibition              11
Libertarian              11
Socialist                10
Independent               6
Whig                      6
Green                     6
Progressive               4
Populist                  3
Constitution              3
American Independent      3
American                  2
National Republican       2
Democratic-Republican     2
Reform                    2
Free Soil                 2
Anti-Masonic              1
National Union            1
Constitutional Union      1
National Democratic       1
Union Labor               1
Greenback                 1
Anti-Monopoly             1
Liberal Republican        1
Southern Democratic       1
Northern Democratic       1
Farmer–Labor              1
Dixiecrat                 1
States' Rights            1
Communist                 1
Union                     1
Taxpayers                 1
New Alliance              1
Citizens                  1
Natural Law               1
Name: count, dtype: int64

Grading Assistance (Bonus)

Fernando is writing a grading script to compute grades for students in Data 101. Recall that many factors go into computing a student’s final grade, including homework, discussion, exams, and labs. In this question, we will help Fernando compute the homework grades for all students using a DataFrame, hw_grades, provided by Gradescope.

The Pandas DataFrame hw_grades contains homework grades for all students for all homework assignments, with one row for each combination of student and homework assignment. Any assignments that are incomplete are denoted by NaN (missing) values, and any late assignments are denoted by a True boolean value in the Late column. You may assume that the names of students are unique. Below is a sample of hw_grades.

1
2


hw_grades = pd.read_csv("hw_grades.csv")
hw_grades.sample(5, random_state = 0)

	Name	Assignment	Grade	Late
28	Sid	Homework 9	82.517998	True
11	Ash	Homework 2	78.264844	True
10	Ash	Homework 1	98.421049	False
41	Emily	Homework 2	62.900313	False
2	Meg	Homework 3	89.785619	False

Question 8a

Assuming there is a late penalty that causes a 10% grade reduction to the student’s current score (i.e. a 65% score would become a 65% - 6.5% = 58.5%), write a line of Pandas code to calculate all the homework grades, including the late penalty if applicable, and store it in a column named ’LPGrade’.

1
2
3


# Your answer here
hw_grades['LPGrade'] = hw_grades['Grade'] * (1 - hw_grades['Late'] * 0.1) # 用个隐式转换
hw_grades.head()

	Name	Assignment	Grade	Late	LPGrade
0	Meg	Homework 1	NaN	False	NaN
1	Meg	Homework 2	64.191844	False	64.191844
2	Meg	Homework 3	89.785619	False	89.785619
3	Meg	Homework 4	74.420033	False	74.420033
4	Meg	Homework 5	74.372434	True	66.935190

Question 8b

Which of the following expressions outputs the students’ names and number of late assignments, from least to greatest number of late assignments?

hw_grades.groupby([’Name’]).sum().sort_values()
hw_grades.groupby([’Name’, ’Late’]).sum().sort_values()
hw_grades.groupby([’Name’]).sum()[’Late’].sort_values()
hw_grades.groupby([’Name’]).sum().sort_values()[’Late’]

1
2
3


# Your answer here
# hw_grades.groupby(['Name']).sum().sort_values() # <---- Try to sort on df, but have to  give 'by=...' into sort_values()
hw_grades.groupby(['Name']).sum()['Late'].sort_values()

Name
Sid      1
Emily    2
Meg      2
Ash      3
Smith    3
Name: Late, dtype: int64

Question 8c

If each assignment is weighted equally, fill in the blanks below to calculate each student’s overall homework grade, including late penalties for any applicable assignments.

Hint: Recall that incomplete assignments have NaN values. How can we use fillna to replace these null values?

1
2
3


hw_grades._________(_______) \
         .groupby(___________)[____________] \
         .agg(____________)

1
2
3
4
5


# Your answer here
hw_grades.fillna(0)\
    .groupby(['Name'])['LPGrade']\
    .agg('mean')
# Python中，反斜杠 \ 用作行续字符，它允许你将一行代码分割成多行，以提高代码的可读性。这在编写较长的一行代码时特别有用，可以避免代码过于拥挤，使得代码更易于阅读和维护。

Name
Ash      80.830657
Emily    84.297725
Meg      69.218137
Sid      63.020729
Smith    58.332233
Name: LPGrade, dtype: float64

Question 8d

Of all the homework assignments, which are the most difficult in terms of the median grade? Order by the median grade, from lowest to greatest. Do not consider incomplete assignments or late penalties in this calculation.

Fill in the blanks below to answer this question.

Hint: Recall that incomplete assignments have NaN values. How can we use dropna to remove these null values?

1
2
3
4


hw_grades._________() \
         .groupby(___________)[____________] \
         .agg(____________) \
         .sort_values()

1
2
3
4
5


# Your answer here
hw_grades.dropna()\
    .groupby('Assignment')['Grade']\
    .agg('median')\
    .sort_values()

Assignment
Homework 2     64.160918
Homework 10    66.366211
Homework 5     74.372434
Homework 8     76.362904
Homework 4     78.207572
Homework 3     78.348163
Homework 9     82.517998
Homework 6     84.369535
Homework 1     85.473281
Homework 7     92.200688
Name: Grade, dtype: float64