DATA100-lab14: Clustering
|
|
Ungraded Lab 14: Clustering
In this lab you will explore K-Means, Agglomerative Clustering, and Spectral Clustering. Spectral Clustering is out of scope for Spring 2022.ref.
Note: This is an ungraded assignment. There is no Gradescope submission for this assignment. As this is a bonus and ungraded assignment, there will also be more limited staff office hours devoted to this ungraded homework. We may prioritize students who have other questions.
|
|
Note: you may need to restart the kernel to use updated packages.
In the first part of this lab, we work with three different toy datasets, all with different clustering characteristics. In the second part, we explore a real-world dataset from the World Bank.
Toy Data 1: Balanced Clusters
Let us begin with a toy dataset with three groups that are completely separated with the variables given. There are the same number of points per group and the same variance within each group.
|
|
Below, we create a cluster.KMeans
object (documentation) which implements the K-Means algorithm.
|
|
We observe that K-Means is able to accurately pick out the three initial clusters.
Question 1: Initial Centers
In the previous example, the K-Means algorithm was able to accurately find the three initial clusters. However, changing the starting centers for K-Means can change the final clusters that K-Means gives us. Change the initial centers to the points [0, 1]
, [1, 1]
, and [2, 2]
; and fit a cluster.KMeans
object (documentation) called kmeans_q1
on the toy dataset from the previous example. Keep the random_state
parameter as 42 and the n_clusters
parameter as 3.
Hint: You will need to change the init
and n_init
parameters in cluster.KMeans
.
|
|
|
|
q1
passed! 🍀
Running the K-Means algorithm with these centers gives us a different result from before, and this particular run of K-Means was unable to accurately find the three initial clusters.
|
|
Toy Data 2: Clusters of Different Sizes
Sometimes, K-Means will have a difficult time finding the “correct” clusters even with ideal starting centers. For example, consider the data below.
|
|
There are two groups of different sizes in two different senses: variability (i.e., spread) and number of datapoints. The smaller group has both smaller variability and has fewer datapoints, and the larger of the two groups is more diffuse and populated.
Question 2
Question 2a: K-Means
Fit a cluster.KMeans
object (documentation) called kmeans_q2a
on the dataset above with two clusters and a random_state
parameter of 42.
|
|
|
|
q2a
passed! 🍀
(For notational simplicity we will call the initial cluster on the bottom left $A$ and the initial cluster on the top right $B$. We will call the bottom left cluster found by K-Means as cluster $a$ and the top right cluster found by K-Means as cluster $b$.)
As seen below, K-Means is unable to find the two intial clusters because cluster $A$ includes points from cluster $B$. Recall that K-Means attempts to minimize inertia (惯性?), so it makes sense that points in the bottom left of cluster $B$ would prefer to be in cluster $A$ rather than cluster $B$. If these points were in cluster $B$ instead, then the resulting cluster assignments would have a larger distortion(歪曲).
|
|
Agglomerative Clustering: The Linkage Criterion
It turns out agglomerative clustering works better for this task, as long as we choose the right definition of distance between two clusters. Recall that agglomerative clustering starts with every data point in its own cluster and iteratively joins the two closest clusters until there are $k$ clusters remaining. However, the “distance” between two clusters is ambiguous.
In lecture, we used the maximum distance between a point in the first cluster and a point in the second as this notion of distance, but there are other ways to define the distance between two clusters.
Our choice of definition for the distance is sometimes called the “linkage criterion.” We will discuss three linkage criteria, each of which is a different definition of “distance” between two clusters:
- Complete linkage considers the distance between two clusters as the maximum distance between a point in the first cluster and a point in the second. This is what you will see in Lecture 26.
- Single linkage considers the distance between two clusters as the minimum distance between a point in the first cluster and a point in the second.
- Average linkage considers the distance between two clusters as the average distance between a point in the first cluster and a point in the second.
Below, we fit a cluster.AgglomerativeClustering
object (documentation) called agg_complete
on the dataset above with two clusters, using the complete linkage criterion.
|
|
Below we visualize the results:
|
|
It looks like complete linkage agglomerative clustering has the same issue as K-Means! The bottom left cluster found by complete linkage agglomerative clustering includes points from the top right cluster. However, we can remedy this by picking a different linkage criterion.
Question 2b: Agglomerative Clustering
Now, use the single linkage criterion to fit a cluster.AgglomerativeClustering
object called agg_single
on the dataset above with two clusters.
|
|
|
|
q2b
passed! 🌟
Finally, we see that single linkage agglomerative clustering is able to find the two initial clusters.
|
|
You might be curious why single linkage “works” while complete linkage does not in this scenario; we will leave this as an exercise for students who are interested.
Toy Data 3: Oddly Shaped Clusters
Another example when k-means fails is when the clusters have odd shapes. For example, look at the following dataset.
|
|
Looking at this data, we might say there are 3 clusters, corresponding to each of the 3 concentric circles, with the same center. However, k-means will fail.
|
|
(Bonus) Question 3: Spectral Clustering
(Note in Spring 2022 we did not go over Spectral Clustering. Spectral Clustering is out of scope for exams.)
Let’s try spectral clustering instead.
In the cell below, create and fit a cluster.SpectralClustering
object (documentation), and assign it to spectral
. Use 3 clusters, and make sure you set affinity
to "nearest_neighbors"
and a random_state
of 10.
Note: Ignore any warnings about the graph not being fully connected.
|
|
d:\miniconda3\envs\ds100\Lib\site-packages\sklearn\manifold\_spectral_embedding.py:329: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
warnings.warn(
|
|
q3
passed! ✨
Below, we see that spectral clustering is able to find the three rings, when k-means does not.
|
|
The World Bank Dataset
In the previous three questions, we looked at clustering on two dimensional datasets. However, we can easily use clustering on data which have more than two dimensions. For this, let us turn to a World Bank dataset, containing various features for the world’s countries.
This data comes from https://databank.worldbank.org/source/world-development-indicators#.
|
|
Age dependency ratio (% of working-age population) | Age dependency ratio, old (% of working-age population) | Age dependency ratio, young (% of working-age population) | Bird species, threatened | Business extent of disclosure index (0=less disclosure to 10=more disclosure) | Contributing family workers, female (% of female employment) (modeled ILO estimate) | Contributing family workers, male (% of male employment) (modeled ILO estimate) | Contributing family workers, total (% of total employment) (modeled ILO estimate) | Cost of business start-up procedures (% of GNI per capita) | Cost of business start-up procedures, female (% of GNI per capita) | ... | Unemployment, youth total (% of total labor force ages 15-24) (modeled ILO estimate) | Urban population | Urban population (% of total population) | Urban population growth (annual %) | Vulnerable employment, female (% of female employment) (modeled ILO estimate) | Vulnerable employment, male (% of male employment) (modeled ILO estimate) | Vulnerable employment, total (% of total employment) (modeled ILO estimate) | Wage and salaried workers, female (% of female employment) (modeled ILO estimate) | Wage and salaried workers, male (% of male employment) (modeled ILO estimate) | Wage and salaried workers, total (% of total employment) (modeled ILO estimate) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | |||||||||||||||||||||
Algeria | 57.508032 | 10.021442 | 47.486590 | 15.0 | 4.0 | 2.720000 | 1.836 | 1.978000 | 0.0 | 11.8 | ... | 29.952999 | 30670086.0 | 72.629 | 2.804996 | 24.337001 | 27.227001 | 26.762000 | 73.734001 | 68.160004 | 69.056000 |
Afghanistan | 84.077656 | 4.758273 | 79.319383 | 16.0 | 8.0 | 71.780998 | 9.606 | 31.577999 | 0.0 | 6.4 | ... | 2.639000 | 9477100.0 | 25.495 | 3.350383 | 95.573997 | 85.993001 | 89.378998 | 4.282000 | 13.292000 | 10.108000 |
Albania | 45.810037 | 20.041214 | 25.768823 | 8.0 | 9.0 | 37.987000 | 20.795 | 28.076000 | 0.0 | 11.3 | ... | 30.979000 | 1728969.0 | 60.319 | 1.317162 | 54.663000 | 54.994001 | 54.854000 | 44.320999 | 41.542999 | 42.720001 |
American Samoa | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 48339.0 | 87.153 | -0.299516 | NaN | NaN | NaN | NaN | NaN | NaN |
Andorra | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 67813.0 | 88.062 | -0.092859 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 209 columns
There are some missing values. For the sake of convenience and of keeping the lab short, we will fill them all with zeros.
|
|
Like with PCA, it sometimes makes sense to center and scale our data so that features with higher variance don’t dominate the analysis. For example, without standardization, statistics like population will completely dominate features like “percent of total population that live in urban areas.” This is because the range of populations is on the order of billions, whereas percentages are always between 0 and 100. The ultimate effect is that many of our columns are not really considered by our clustering algorithm.
Question 4
Below, fit a cluster.KMeans
object called kmeans_q4
with four clusters and a random_state
parameter of 42.
Make sure you should use a centered and scaled version of the world bank data. By centered and scaled we mean that the mean in each column should be zero and the variance should be 1.
|
|
|
|
[np.int64(3), np.int64(23), np.int64(85), np.int64(106)]
below is very interesting, note that in ref. all tests passed. —> maybe Cluster itself changed?
|
|
q4
results:
q4 - 1
result:
❌ Test case failed Trying: sorted(np.unique(kmeans_q4.labels_, return_counts = True)[1]) == [3, 25, 90, 99] Expecting: True ********************************************************************** Line 1, in q4 0 Failed example: sorted(np.unique(kmeans_q4.labels_, return_counts = True)[1]) == [3, 25, 90, 99] Expected: True Got: False
Looking at these new clusters, we see that they seem to correspond to:
0: Very small countries.
1: Developed countries.
2: Less developed countries.
3: Huge countries.
|
|
>>> Cluster 0:
['Afghanistan', 'Angola', 'Bangladesh', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Botswana', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Central African Republic', 'Chad', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', "Cote d'Ivoire", 'Djibouti', 'Ecuador', 'Egypt, Arab Rep.', 'Equatorial Guinea', 'Eritrea', 'Eswatini', 'Ethiopia', 'Fiji', 'Gabon', 'Gambia, The', 'Ghana', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Indonesia', 'Iraq', 'Kenya', 'Kiribati', 'Kyrgyz Republic', 'Lao PDR', 'Lesotho', 'Liberia', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Micronesia, Fed. Sts.', 'Mongolia', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Nicaragua', 'Niger', 'Nigeria', 'Pakistan', 'Papua New Guinea', 'Paraguay', 'Philippines', 'Rwanda', 'Samoa', 'Sao Tome and Principe', 'Senegal', 'Sierra Leone', 'Solomon Islands', 'Somalia', 'South Sudan', 'Sudan', 'Syrian Arab Republic', 'Tajikistan', 'Tanzania', 'Timor-Leste', 'Togo', 'Tonga', 'Uganda', 'Uzbekistan', 'Vanuatu', 'Venezuela, RB', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe']
>>> Cluster 1:
['China', 'India', 'United States']
>>> Cluster 2:
['American Samoa', 'Andorra', 'Bermuda', 'British Virgin Islands', 'Cayman Islands', 'Dominica', 'Faroe Islands', 'Gibraltar', 'Greenland', 'Isle of Man', 'Kosovo', 'Liechtenstein', 'Marshall Islands', 'Monaco', 'Nauru', 'Northern Mariana Islands', 'Palau', 'San Marino', 'Sint Maarten (Dutch part)', 'St. Kitts and Nevis', 'St. Martin (French part)', 'Turks and Caicos Islands', 'Tuvalu']
>>> Cluster 3:
['Algeria', 'Albania', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas, The', 'Bahrain', 'Barbados', 'Belarus', 'Belgium', 'Bosnia and Herzegovina', 'Brazil', 'Brunei Darussalam', 'Bulgaria', 'Cabo Verde', 'Canada', 'Channel Islands', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'El Salvador', 'Estonia', 'Finland', 'France', 'French Polynesia', 'Georgia', 'Germany', 'Greece', 'Grenada', 'Guam', 'Hong Kong SAR, China', 'Hungary', 'Iceland', 'Iran, Islamic Rep.', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Korea, Dem. People’s Rep.', 'Korea, Rep.', 'Kuwait', 'Latvia', 'Lebanon', 'Libya', 'Lithuania', 'Luxembourg', 'Macao SAR, China', 'Malaysia', 'Maldives', 'Malta', 'Mauritius', 'Mexico', 'Moldova', 'Montenegro', 'Netherlands', 'New Caledonia', 'New Zealand', 'North Macedonia', 'Norway', 'Oman', 'Panama', 'Peru', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Romania', 'Russian Federation', 'Saudi Arabia', 'Serbia', 'Seychelles', 'Singapore', 'Slovak Republic', 'Slovenia', 'South Africa', 'Spain', 'Sri Lanka', 'St. Lucia', 'St. Vincent and the Grenadines', 'Suriname', 'Sweden', 'Switzerland', 'Thailand', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Virgin Islands (U.S.)']