K-Means Clustering

Stacy Chandisingh, Leo Pena, Shaif Hossain

12/1/23

Introduction: What is K-Means Clustering?

An unsupervised machine learning technique

K-means is a fixed number (k) of clusters in a dataset

Used to draw conclusions about a dataset based on groups of similar variables

Introduction: What is K-Means Clustering?

Different ways to determine the ‘k’

Image source: https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/c71ea970-0f3c-4973-8d3a-b09a7a6553c1.xhtml

Our focus?

Used a Spotify dataset focused on attributes (examples: tempo and popularity) surrounding a music track to group similar features by clustering them together

We want to enhance our listener experience by recommending popular songs with similar song attributes to the favorite songs that they enjoy.

Methods: diving into the K-Means algorithm

Assumptions
WCSS
Euclidean distance

K-means assumes the clusters are symmetric, there can be asymmetrical clusters.
The data is assumed to be independent.
Properly scaled data is important for K-Means, it can lead to uneven clusters.
There is a fixed number of clusters for K-Means.

Descriptive statistics/histograms/correlation matrix to visualize the spread of the data.
Preprocess and scale data for ease of comparability.
Libraries used: tidyverse, cluster, and factoextra.
The Euclidean distances are calculated to get the clustering distances measurement
The distance data is visualized using the fviz_dist() function in R from the factoextra package.

\[WCSS = \sum_{i=1}^{K}\sum_{j=1}^{ni}\left \| x_{ij} - c_{i} \right \|^2\]

K is number of clusters.
n_i is number of data points in cluster i.
C_i is centroid.
X_ij is the j^th data point in cluster I.

The WCSS can be used to measure how the data within a cluster are grouped. The variables that are used from the dataset are then scaled. Then, after they are scaled the next steps are:

Euclidean distance is calculated between each attribute, and the cluster center: \[ D_{euclidean}(x, Ki) = \sqrt{\sum_{i=1}^{n}((x_{i}) -(k)_{ij})^{2}} \]

Where:

x is data point.
k_i is cluster center.
n is number of attributes.
x_i is the i^th attribute of the data point.
K_ij is the j^th attribute of the centroid of cluster i.

Analysis and Results: Data and Vizualisation

Data source: Spotify data from Kaggle

Data Description
Data Preprocessing

For our segmentation we will be using a spotify dataset that contains the audio features of each of track. There are 32,828 observations across 22 variables. There are 12 numerical variables and 10 character variables.

Variables	Description
‘Track_id’	Spotify ID for each track.
‘track_name’	Name of the Track.
‘track_artist’	Artist Name.
‘track_popularity’	The popularity of each track measured 1-100.
‘track_album_id’	Key specific to each album.
‘track_album_name’	Name of track.
‘track_alblum_release_date’	Date Album was released
‘playlist_name’	Style of music track can be found in.Contains genre and subgenre.
‘playlist_id’	Key to the style category.
‘playlist_genre’	Main category of each playlist.
‘playlist_subgenre’	Secondary category of each playlist.
‘danceability’	A well a track is for dancing through a combination of tempo, rythm, beat, and regularity on a scale between 0 and 1.
‘energy’	Energy is a measure between perception and energy measured between 0 and 1.
‘key’	The musical key the track is in.
‘loudness’	How loud the track is in decibels (dB).
‘mode’	Whether the scale is major or minor.
‘speechiness’	The level of spoken words in a track (similar to a podcast or talk show) measured on a level of 0 to 1.
‘acousticness’	How confident we are the track is acoustic on a scale of 0 to 1.
‘instrumentalness’	Determines the lack of vocals in a track.
‘liveness’	Listens for an audience in the track. This is determined on a scale of 0 to 1.
‘valence’	The musical positiveness conveyed in a track. Measured on a scale of 0 to 1.
‘tempo’	The tempo of a track measured in beats per minute.
‘duration_ms’	The duration of the track in milliseconds.

#convert tibble to data frame
spotify_df <- as.data.frame(spotify_df)

spotify_df <- spotify_df %>% select(-track_id, -playlist_id, -track_album_id) %>% 
  mutate_at(c("playlist_genre", "playlist_subgenre", "mode", "key"), as.factor)

# Check for missing data. 5 missing values found for three variables. 
colSums(is.na(spotify_df))

# Omit missing values
spotify_df <- na.omit(spotify_df)

# Verify missing data has been removed
colSums(is.na(spotify_df))

# Convert duration to minutes (from milliseconds)
spotify_df <- spotify_df %>% mutate(duration_ms = duration_ms/60000) %>% rename(duration_min = duration_ms)
head(spotify_df)

Covariance is a statistical measure revealing the relationship between variables.
Positive numbers indicate a positive relationship, such as the positive number at loudness and energy, telling us that as the loudness increases in a track so does the energy.
In the other direction, the higher the negative number the smaller correlation that is involved, such as the difference between loudness and acousticness.

In the correlation plot, the darker the blue the greater the correlation between the variables. The chart shows a positive correlation between energy and loudness and a negative correlation between acousticness and energy.

Histogram of variables to assess distribution and normality

We want to gain an understanding of which genre is the most popular among our datset. To start we split the tracks between popular (1) and unpopular (0) based on a popularity score of over 57 (based on a score of 0-99). We filter out all of the unpopular songs and build our graph by playlist genre. We see that pop, latin, and rap have the highest count of popular songs in our review.

Visualizing the data clusters

The Elbow Curve
Cluster plot

The point at which the line starts to curve is where we get our “k” clusters. Hence the name “Elbow Curve”
The visual will shown a bend “elbow” due to the increasing value of K decreasing the number of datapoints. This bend is the indicator of the ideal number of clusters for our model.
In our plot, we see that 3 is the ideal number of clusters, meaning there are 3 centroids to determine where the data points will fit the closest too.

Three distinct clusters are shown here.

What did we find?

Clusters
Centroid
BSS/TSS Ratio
Min/Max

In cluster 1, there are 1,202 records.
In cluster 2, there are 3,190 records.
In cluster 3, there are 2,173 records.

Centroid Center Positions:

The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS).
This measurement shows how well spread the clusters are between a value of 0 and 1.
The closer to 1, the more distinct the clusters are within the dataset.
In our model, the BSS/TSS ratio is 0.2039852, which is a pretty low ratio for this type of model. However, we determine a low number of clusters was sufficient in this model, which would also result in a low BSS/TSS ratio.

BSS/TSS Ratio: 0.2039852

This table shows which cluster has the miminum and maximum of each attribute.
This table briefly summarizes the types of songs we have in each cluster.

Popularity Assessment

Review
Cluster 1
Cluster 2
Cluster 3

In cluster 1, there are 388 popular tracks.
In cluster 2, there are 1,400 popular tracks.
In cluster 3, there are 410 popular tracks.

The highest duration_min, energy, instrumentalness, liveness, loudness, and tempo
Top songs include Blinding Lights, Crazy Train, and Sweet Child of Mine
Top genres are R&B, rock, and Pop

The highest danceability, speechiness, valence
Top songs include Memories, Falling, and everything I Wanted
Top genres are Latin, Pop, and Rap
Has the highest number of popular tracks

The highest attribute is acousticness.
Top songs include Roxanne, The Box, and Circles
Top genres are Rock, Pop, and EDM
The lowest attributes are energy, liveness, loudness, and tempo

Conclusions

Unsupervised machine learning finds hidden data structures from an unlabeled dataset and its aim is to find the similarities within the data groups.
Three cluster groups were found.
Study objective is to enhance a listener’s experience by recommending popular songs with similar musical attributes to their favorite songs.
Explored the relationship between different musical genres and variables to understand musical preferences.
Discuss the versatility of the K-Means algorithm.
Explore its potential for future research in various areas beyond user experience.