The researchers’ main focus for this project is to use an unsupervised machine learning technique called K-means clustering to glean insights from data. Unsupervised machine learning finds hidden data structures from an unlabeled dataset and its aim is to find the similarities within the data groups. The goal of the k-means clustering technique is to determine the initial optimal centroids of clusters (Zubair et al. 2022). The aim is to group similar data points together into ‘K’ number of clusters based on the similarity of the data or the measure of the distance to a centroid. The K-means algorithm does not rely on an initial center, but starting with one cluster, finds the best centroid compared to the center in the dataset (Xie and Jiang 2010). The centroid is generally the average of the position of all the points on a cluster. We observed that an overarching theme of the K-Means analysis is focused on improved cost functions and efficiency of the K-Means model clustering in datasets (Žalik 2008).
In this paper, we explore the usefulness of the K-Means clustering algorithm to gain insights using data from the open source website, Kaggle. The data focuses on attributes such as tempo and popularity surrounding a music track from the musical streaming platform, Spotify. We aim to determine the best number of clusters or groups for which these attributes belong. By identifying clusters, this will help us to understand the nature of a musical track and help us to offer a song recommendation based on the similar features that are clustered together. We want to enhance our listener experience by recommending popular songs with similar song attributes to the favorite songs that they enjoy.
A few different methods of obtaining the K number of clusters in K-Means Clustering include: the elbow method, by rule of thumb, information criterion approach, and information theoretic approach, and choosing k using the Silhouette, and cross-validation (Kodinariya and Makwana 2013). There are different statistical approaches that can be used in a k-means clustering algorithm such as Principal Component Analysis (PCA) amongst others, that help to improve the efficiency and accuracy of a k-means algorithm. One effective way of finding the k-means’ initial centroids and reducing the number of iterations is using principal component analysis (PCA) and percentile; this includes having the data undergo PCA, apply a percentile, split the dataset and find the mean of each attribute, determine the centroids, and finally, cluster generation (Zubair et al. 2022). Other methods involved in a k-means analysis include the hash algorithm or canopy algorithm to improve defining the center cluster (Ashabi, Sahibuddin, and Haghighi 2020). A filtering algorithm is also discussed in (Kanungo et al. 2002).
There are different aspects of the k-means clustering algorithm which are best suited for certain datasets and use cases. This can include the fuzzy k-means algorithm which is combined with other techniques that best prepare and analyze the data to get the most appropriate result for the particular use case. It can show a transition between zones and build a practical story from the associated patterns (Huang and Ng 1999). It was observed that k-means algorithms are typically tailored to suit the use case being studied. Fuzzy k-means caters for overlapping datasets in terms of similarity of the dataset features (k-means does not) and is used as part of the process in identifying and diagnosing melanoma skin cancer in patients. A skin cancer detection method involves deep learning and fuzzy k means which can positively impact early skin cancer detection by increasing the accuracy of detection (Nawaz et al. 2021). Deep learning was also involved in the two step clustering technique for data reduction, combining a Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm and a k-means clustering (Kremers et al. 2023). Bilateral filtering was also used to perform edge-preserving and denoising preprocessing on the target variable, and the mixed model was used to improve the K-means algorithm; this algorithm was used to establish a performance appraisal scoring system for college counselors (Wang and Tian 2022). K-Means clustering is also very popular and used in pattern recognition and machine learning by selecting initial cluster centers and determining the distance of a cluster from the centroid for validation (Qin and Zheng 2009).
The researchers understand the value of how the selection of the K value can affect the results of a study. The seismic data analyzed in (Xiong, Qiu, and Niu 2021) highlights that K=5 provides the least stable results while the selection of K=4 is the optimal K as supported by the Elbow Method.
A way in which the value of the k-means value can be estimated is by using various tools like the Davies Bouldin (DB) Index where the value for each k was studied. The DB index measures how good the clusters are by assessing low within-cluster variation and high between-cluster separation. The DB index generated for the clusters between a more enhanced k-means algorithm, the Principal Component Analysis-Particle Swarm Optimization (PCA-PSO) K-Means algorithm versus the general K-Means algorithm found that the PCA-PSO-K Means algorithm generated a lower DB index for all the clusters indicating better cluster performance compared to the general K-Means algorithm (Sadeghi et al. 2022). When adapting the right measures for K-Means Clustering, some other positive researched methods include data mining, information retrieval, machine learning, and statistics, as these can all be evaluated through the contingency matrix, which shows the frequency distribution between multiple variables, comparing the “purity” of the clusters (Junjie Wu and Chen 2009). The general framework of the PCA-PSO-K Means algorithm includes computing the mean, centering the data, computing the covariance matrix, computing the eigenvalues and eigenvectors, finding the fraction of total variance, choosing the dimensionality, reducing the basis and dimensionality data, initializing particles and calculating the fitness value (Sadeghi et al. 2022).
After reviewing multiple peer-reviewed articles, the general trend we observed was the importance of cleaning and preprocessing data to a suitable format for running through a k-means algorithm including k-means clustering with outlier removal (Gan and Ng 2017). Normalization of data is also a necessary improvement when validating the k-means clustering algorithms (Junjie Wu and Chen 2009). There are various ways in how this process can be achieved like using preprocessing techniques available in the R software program such as tidyverse, cluster, and factoextra. Missing values in the dataset should be considered in how they should be treated where they could either be removed or replaced with a value. Variables can be standardized to be made comparable. The Euclidean distances are calculated to get the clustering distances measurement and the distance data can be visualized using the fviz_dist() function in R from the factoextra package. Descriptive statistics and bar graphs can also be generated to visualize the spread of the data (Abdullah et al. 2021). The researchers of this paper set out to explore the k-means clustering algorithm.
Methods
The focus is to apply unsupervised machine learning technique K-Means clustering to the Spotify dataset. K-Means clustering is an algorithm for grouping data in K number of clusters focused on similarity, using the Euclidean distance metric (Xie and Jiang 2010). Initially, the centroids are determined. There are data points assigned to the clusters, and centroids are updated until convergence (Zubair et al. 2022). This algorithm is called the within-cluster sum of squares (WCSS), where:
The WCSS can be used to measure how the data within a cluster are grouped. The variables that are used from the dataset are then scaled. Then, after they are scaled the next steps are:
Euclidean distance is calculated between each attribute, and the cluster center: \[ D_{euclidean}(x, Ki) = \sqrt{\sum_{i=1}^{n}((x_{i}) -(k)_{ij})^{2}} \] Where:
x is data point.
ki is cluster center.
n is number of attributes.
xi is the ith attribute of the data point.
Kij is the jth attribute of the centroid of cluster i.
Recalculate the new cluster centers by finding the average of the attributes and repeat it until convergence. There are different techniques like silhouette score used to identify optimal number of clusters (Kodinariya and Makwana 2013).
Assumptions:
K-means assumes the clusters are symmetric, there can be asymmetrical clusters.
The data is assumed to be independent.
Properly scaled data is important for K-Means, it can lead to uneven clusters.
There is a fixed number of clusters for K-Means.
The final step is to look at the quality of clusters. The performance of the K-Means algorithm using visualizations, and other metrics like WCSS (Abdullah et al. 2021).
We will employ the use of the elbow method which uses a visual approach in determining the value of K. The elbow method uses a graph which is viewed to see where there is a pivot on the plot. The point which corresponds to the value on the x-axis that resembles the point of an elbow is used as the K value. In our Spotify analysis, we found the optimal K or groups of similar data, were best put in 3 clusters or groups.
In this section, we aim to uncover the number of clusters that can be associated with distinct musical genres based on grouping similar Spotify variables. We aim to establish an efficient method of categorizing the Spotify musical selections available in our dataset.
Data inspection and visualization was conducted as part of this analysis to determine the structure of our Spotify dataset.
For our segmentation we will be using a spotify dataset that contains the audio features of each of track. There are 32,828 observations across 22 variables. There are 12 numerical variables and 10 character variables.
Style of music track can be found in.Contains genre and subgenre.
‘playlist_id’
Key to the style category.
‘playlist_genre’
Main category of each playlist.
‘playlist_subgenre’
Secondary category of each playlist.
‘danceability’
A well a track is for dancing through a combination of tempo, rythm, beat, and regularity on a scale between 0 and 1.
‘energy’
Energy is a measure between perception and energy measured between 0 and 1.
‘key’
The musical key the track is in.
‘loudness’
How loud the track is in decibels (dB).
‘mode’
Whether the scale is major or minor.
‘speechiness’
The level of spoken words in a track (similar to a podcast or talk show) measured on a level of 0 to 1.
‘acousticness’
How confident we are the track is acoustic on a scale of 0 to 1.
‘instrumentalness’
Determines the lack of vocals in a track.
‘liveness’
Listens for an audience in the track. This is determined on a scale of 0 to 1.
‘valence’
The musical positiveness conveyed in a track. Measured on a scale of 0 to 1.
‘tempo’
The tempo of a track measured in beats per minute.
‘duration_ms’
The duration of the track in milliseconds.
Data Pre-Processing
To start, we convert the data set to a data frame to work with multiple classes of data.
There are columns that do not tell us any information toward the popularity or sound performance of a track. These are the ID information for track, playlist, and album. They have been removed from the dataset for our analysis because we can identify each data point by the track name.
Missing data will skew our results. When we check for missing columns we notice their are 5 observations missing in track_name, track_artist, and track_album. We omit these observations, as 5 rows against 32,000 will not drastically change our model.
We will review which genres and sub genres are the most popular in our dataset, changing the data type to factors for analysis.
When removed, we verified the missing values process was completed.
duration_ms is the length of the song in milliseconds. We want to work with a column that makes sense to the average person, so we change milliseconds to minutes.
Code
#convert tibble to data framespotify_df <-as.data.frame(spotify_df)spotify_df <- spotify_df %>%select(-track_id, -playlist_id, -track_album_id) %>%mutate_at(c("playlist_genre", "playlist_subgenre", "mode", "key"), as.factor)# Check for missing data. 5 missing values found for three variables. # colSums(is.na(spotify_df))# Omit missing valuesspotify_df <-na.omit(spotify_df)# Verify missing data has been removed# colSums(is.na(spotify_df))# Convert duration to minutes (from milliseconds)spotify_df <- spotify_df %>%mutate(duration_ms = duration_ms/60000) %>%rename(duration_min = duration_ms)#head(spotify_df)
The summary function returns descriptive statistics from our numerical variables. ( We can choose specific ones)
Code
# Remove duplicates based on track_id columns#spotify_new_df <- spotify_df[!duplicated(spotify_df$track_id),]#head(spotify_new_df)# Filter only numeric variables for table summaryspotify_num<- spotify_df %>%select(where(is.numeric))# head(spotify_num)# summarize the numeric data with our packagetable1 <- spotify_num %>%tbl_summary#(include = c(track_popularity,danceability,energy,loudness,speechiness,# acousticness,instrumentalness,liveness,valence,tempo,duration_min))table1
Characteristic
N = 32,8281
track_popularity
45 (24, 62)
danceability
0.67 (0.56, 0.76)
energy
0.72 (0.58, 0.84)
loudness
-6.17 (-8.17, -4.65)
speechiness
0.06 (0.04, 0.13)
acousticness
0.08 (0.02, 0.26)
instrumentalness
0.00 (0.00, 0.00)
liveness
0.13 (0.09, 0.25)
valence
0.51 (0.33, 0.69)
tempo
122 (100, 134)
duration_min
3.60 (3.13, 4.23)
1 Median (IQR)
Explore Data
Covariance is a statistical measure revealing the relationship between variables. Positive numbers indicate a positive relationship, such as the positive number at loudness and energy, telling us that as the loudness increases in a track so does the energy. In the other direction, the higher the negative number the smaller correlation that is involved, such as the difference between loudness and acousticness.
In the correlation plot, we visualize the numbers shown in the correlation table. The darker the blue the greater the correlation between the variables. The chart shows a positive correlation between energy and loudness and a negative correlation between acousticness and energy.
Code
# Show correlation matrixcorrplot(cor(spotify_num))
Next, we will explore the distribution of data for each variable using histograms:
Code
# Show histogram of variables to assess distribution and normalityspotify_num %>%keep(is.numeric) %>%gather() %>%ggplot(aes(value), fill='skyblue') +facet_wrap(~ key, scales ="free") +geom_histogram()
We want to gain an understanding of which genre is the most popular among our datset. To start we split the tracks between popular (1) and unpopular (0) based on a popularity score of over 57 (based on a score of 0-99). We filter out all of the unpopular songs and build our graph by playlist genre. We see that pop, latin, and rap have the highest count of popular songs in our review.
Code
spotify_df <- spotify_df %>%mutate(popular =if_else(track_popularity >=57, "1", "0"))popular <- spotify_df %>%filter(popular =="1")popular_genre <- popular %>%count(playlist_genre) %>%rename("total"="n") %>%arrange(desc(total))popular_genre %>%ggplot(aes(y =reorder(playlist_genre, total), x = total)) +geom_bar(aes(fill = total), stat ="identity") +scale_fill_gradient(low ="#007a33", high ="#004c97") +labs(title ="Most Popular Genre",y ="genre",x ="total of popular songs") +theme_minimal()
Music has several subgenre styles we can dive deeper into to see how these popular genres are determined. In our subgenre view hip pop, post-teen pop, and urban contemporary are the most popular for our dataset.
Code
popular_subgenre <- popular %>%count(playlist_subgenre) %>%rename("total"="n") %>%arrange(desc(total))popular_subgenre %>%ggplot(aes(y =reorder(playlist_subgenre, total), x = total)) +geom_bar(aes(fill = total), stat ="identity") +scale_fill_gradient(low ="#007a33", high ="#004c97") +labs(title ="Most Popular SubGenre",y ="genre",x ="total of popular songs") +theme_minimal()
Statistical Modeling
Before we create our model we need to scale the data as it cannot allocate vector at current dataset size.
Code
#scale data as it cannot allocate vector at current dataset sizespotify_new_df <- spotify_df %>%mutate(popular =as.factor(popular))split <-sample(nrow(spotify_new_df), 0.2*nrow(spotify_new_df))reduced_kmeans <- spotify_new_df[split,]reduced_num<- reduced_kmeans %>%select(where(is.numeric))scale<-scale(reduced_num)
Elbow method is a popular graphing method in K-Means clustering to find the optimal K-value. This shows the within-clustering-sum-of-square errors (WCSS) plotted against the number of clusters of K. The visual will shown a bend “elbow” due to the increasing value of K decreasing the number of datapoints. This bend is the indicator of the ideal number of clusters for our model. In our plot, we see that 3 is the ideal number of clusters, meaning there are 3 centroids to determine where the data points will fit the closest too. The within-cluster sum of square (wss) measures the compactness of the clustering, and we hope to get as close as possible.
Plot the WSS values against the number of clusters The bend in the above elbow method is at 3, so this is the number of clusters we will set in our in our model.
We can view the clusters in plot, with additional information on each cluster below.
BSS/TSS Ratio The ratio of between-cluster sum of squares(BSS) to total sum of squares (TSS). This measurement shows how well spread the clusters are between a value of 0 and 1. The closer to 1, the more distinct the clusters are within the dataset. The BSS/TSS ratio is relatively low ratio for this type of model. However, we determine a low number of clusters was sufficient in this model, which would also result in a low BSS/TSS ratio.
This table shows which cluster has the miminum and maximum of each attribute. This table briefly summarizes the types of songs we have in each cluster.
Code
#Which Cluster has the high and low for each variablespotify_centroid %>%pivot_longer(-cluster) %>%group_by(name) %>%summarize(group_min =which.min(value),group_max =which.max(value))
We would like to see the number of popular songs in each cluster. Just like our previous analysis we take our spotify review set and see the number of popular tracks in each cluster based on our popularity score of >57. The graph compares the groups as we use this new only_pop data to see what type of popular tracks are in each cluster.
clusters <- only_pop %>%count(cluster) %>%rename("total"="n") %>%arrange(desc(total))clusters %>%ggplot(aes(y =reorder(cluster, total), x = total)) +geom_bar(aes(fill = total), stat ="identity") +scale_fill_gradient(low ="#007a33", high ="#004c97") +labs(title ="Popular Songs by Cluster",y ="Cluster",x ="Total of Popular Songs") +theme_minimal() +coord_flip()
We want to compare the audio attributes of our clusters to determine the types of clusters created. We create a new review tibble from the new only_pop data to compare those popular songs in each cluster and show those attributes that have normal distributes. In the charts below you will see the average attributes for songs in those clusters, a plot of where each track compares to the average of our total population, and a graph comparing the genres of each cluster. In summary, our first cluster is an enery cluster, our second cluster is a cheerful cluster, and our third is an acoustic cluster.
In our k means clustering model cluster 1 has the highest duration_min, energy, instrumentalness, liveness, loudness, and tempo. These are fast and upbeat songs that include Blinding Lights, Crazy Train, and Sweet Child of Mine. The majority of genres are pop, rock, and EDM, which make up our most popular groups. These sounds are low on acousticness, dancebilitity, speechiness, and valence. This cluster also has the lowest number of popular tracks.
In our k means clustering model cluster 2 has the highest acousticness. These are soft rock or acoustic songs that include Roxanne, The Box, and Circles. The majority of genres are pop, latin, and rock. These sounds are low on duration_min, dancebilitity and instrumentalness.
In our k means clustering model cluster 3 has the highest danceability, speechiness, track_popularity, and valence. These sounds are low on energy, liveness, loudness, and tempo. These are cheerful and vocal songs that include Memories, Falling, and everything I wanted as its top songs. This cluster has the highest number of popular tracks, and the top genres are latin, rap, and pop.
The goal of our study was to demonstrate the application of the K-Means clustering algorithm. Our question was to determine track popularity using the Spotify data. The process of data preparation, needed to thoroughly clean and scale the musical features of the large dataset. This made sure the dataset is transformed to a format where the application of unsupervised K-Means clustering algorithm can be used accurately and effectively. The objective was to determine the number of clusters, as well as grouping similar musical attributes to see how close their proximity was in the Euclidean distance.
Our analysis showed three distinct clusters. Each of the clusters are composed of musical features that are correlated with different musical genres. The clusters show categorization of the tracks from the Spotify dataset, which also include the different attributes that the tracks have. The categorization shows an insight in potentially guiding listeners to tracks they may enjoy by catering to their musical tastes and preferences. In the end improving and enhancing listeners’ experience in Spotify.
The K-Means algorithm can be multifaceted. The application of this algorithm can help future research go deeper into the analysis not just helping user experience but other areas as well. It can determine if the analysis can help find valuable insights for current and future solutions. For example, if Spotify listeners enjoyed certain levels of tempo, acoustic ness, danceability, etc. they can make a specific playlist to cater to certain listeners. The study also had data preparation challenges due to the sheer volume and variability of the data. The removal of non-informative data and missing values. We had to transform certain variables into more user-friendly representation so it can be easier to understand the attributes (e.g. duration of songs).
The study explored beyond the clustering to understand the relationship between the genres and the variability to understand the musical preferences of listeners to show what makes a certain track so popular compared to others. It helps us understand the intricacies of music by categorization and recommendation system which enhances as well as personalizes the process for Spotify users.
References
Abdullah, Dahlan, S. Susilo, Ansari Saleh Ahmar, R. Rusli, and Rahmat Hidayat. 2021. “The Application of k-Means Clustering for Province Clustering in Indonesia of the Risk of the COVID-19 Pandemic Based on COVID-19 Data.”Quality and Quantity 56 (3): 1283–91. https://doi.org/10.1007/s11135-021-01176-w.
Ashabi, Ardavan, Shamsul Bin Sahibuddin, and Mehdi Salkhordeh Haghighi. 2020. “The Systematic Review of k-Means Clustering Algorithm.” In 2020 the 9th International Conference on Networks, Communication and Computing. ACM. https://doi.org/10.1145/3447654.3447657.
Gan, Guojun, and Michael Kwok-Po Ng. 2017. “K -Means Clustering with Outlier Removal.”Pattern Recognition Letters 90 (April): 8–14. https://doi.org/10.1016/j.patrec.2017.03.008.
Huang, Zhexue, and M. K. Ng. 1999. “A Fuzzy k-Modes Algorithm for Clustering Categorical Data.”IEEE Transactions on Fuzzy Systems 7 (4): 446–52. https://doi.org/10.1109/91.784206.
Junjie Wu, Hui Xiong, and Jian Chen. 2009. “Adapting the Right Measures for k-Means Clustering.”Association for Computing Machinery. ACM. https://doi.org/10.1145/1557019.
Kanungo, T., D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. 2002. “An Efficient k-Means Clustering Algorithm: Analysis and Implementation.”IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7): 881–92. https://doi.org/10.1109/tpami.2002.1017616.
Kremers, Bart J. J., Jonathan Citrin, Aaron Ho, and Karel L. van de Plassche. 2023. “Two-Step Clustering for Data Reduction Combining DBSCAN and k-Means Clustering.”Contributions to Plasma Physics 63 (5-6). https://doi.org/10.1002/ctpp.202200177.
Nawaz, Marriam, Zahid Mehmood, Tahira Nazir, Rizwan Ali Naqvi, Amjad Rehman, Munwar Iqbal, and Tanzila Saba. 2021. “Skin Cancer Detection from Dermoscopic Images Using Deep Learning and Fuzzy k-Means Clustering.”Microscopy Research and Technique 85 (1): 339–51. https://doi.org/10.1002/jemt.23908.
Qin, Xiaoping, and Shijue Zheng. 2009. “A New Method for Initialising the k-Means Clustering Algorithm.” In 2009 Second International Symposium on Knowledge Acquisition and Modeling. IEEE. https://doi.org/10.1109/kam.2009.20.
Sadeghi, Maryam, Mohammad Naderi Dehkordi, Behrang Barekatain, and Naser Khani. 2022. “Improve Customer Churn Prediction Through the Proposed PCA-PSO-k Means Algorithm in the Communication Industry.”The Journal of Supercomputing 79 (6): 6871–88. https://doi.org/10.1007/s11227-022-04907-4.
Wang, Zhichao, and Qing Tian. 2022. “Performance Appraisal and Automatic Scoring System for College Counselors Based on Kmeans Clustering.” Edited by Gengxin Sun. Mathematical Problems in Engineering 2022 (September): 1–12. https://doi.org/10.1155/2022/4167842.
Xie, Juanying, and Shuai Jiang. 2010. “A Simple and Fast Algorithm for Global k-Means Clustering.” In 2010 Second International Workshop on Education Technology and Computer Science. IEEE. https://doi.org/10.1109/etcs.2010.347.
Xiong, Neng, Hongrui Qiu, and Fenglin Niu. 2021. “Data-Driven Velocity Model Evaluation Using k-Means Clustering.”Geophysical Research Letters 48 (23). https://doi.org/10.1029/2021gl096040.
Zubair, Md., MD. Asif Iqbal, Avijeet Shil, M. J. M. Chowdhury, Mohammad Ali Moni, and Iqbal H. Sarker. 2022. “An Improved k-Means Clustering Algorithm Towards an Efficient Data-Driven Modeling.”Annals of Data Science, June. https://doi.org/10.1007/s40745-022-00428-2.