12/1/23
Image source: https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/c71ea970-0f3c-4973-8d3a-b09a7a6553c1.xhtml
Descriptive statistics/histograms/correlation matrix to visualize the spread of the data.
Preprocess and scale data for ease of comparability.
Libraries used: tidyverse, cluster, and factoextra.
The Euclidean distances are calculated to get the clustering distances measurement
The distance data is visualized using the fviz_dist() function in R from the factoextra package.
\[WCSS = \sum_{i=1}^{K}\sum_{j=1}^{ni}\left \| x_{ij} - c_{i} \right \|^2\]
The WCSS can be used to measure how the data within a cluster are grouped. The variables that are used from the dataset are then scaled. Then, after they are scaled the next steps are:
Euclidean distance is calculated between each attribute, and the cluster center: \[ D_{euclidean}(x, Ki) = \sqrt{\sum_{i=1}^{n}((x_{i}) -(k)_{ij})^{2}} \]
Where:
Data source: Spotify data from Kaggle
For our segmentation we will be using a spotify dataset that contains the audio features of each of track. There are 32,828 observations across 22 variables. There are 12 numerical variables and 10 character variables.
Variables | Description |
---|---|
‘Track_id’ | Spotify ID for each track. |
‘track_name’ | Name of the Track. |
‘track_artist’ | Artist Name. |
‘track_popularity’ | The popularity of each track measured 1-100. |
‘track_album_id’ | Key specific to each album. |
‘track_album_name’ | Name of track. |
‘track_alblum_release_date’ | Date Album was released |
‘playlist_name’ | Style of music track can be found in.Contains genre and subgenre. |
‘playlist_id’ | Key to the style category. |
‘playlist_genre’ | Main category of each playlist. |
‘playlist_subgenre’ | Secondary category of each playlist. |
‘danceability’ | A well a track is for dancing through a combination of tempo, rythm, beat, and regularity on a scale between 0 and 1. |
‘energy’ | Energy is a measure between perception and energy measured between 0 and 1. |
‘key’ | The musical key the track is in. |
‘loudness’ | How loud the track is in decibels (dB). |
‘mode’ | Whether the scale is major or minor. |
‘speechiness’ | The level of spoken words in a track (similar to a podcast or talk show) measured on a level of 0 to 1. |
‘acousticness’ | How confident we are the track is acoustic on a scale of 0 to 1. |
‘instrumentalness’ | Determines the lack of vocals in a track. |
‘liveness’ | Listens for an audience in the track. This is determined on a scale of 0 to 1. |
‘valence’ | The musical positiveness conveyed in a track. Measured on a scale of 0 to 1. |
‘tempo’ | The tempo of a track measured in beats per minute. |
‘duration_ms’ | The duration of the track in milliseconds. |
#convert tibble to data frame
spotify_df <- as.data.frame(spotify_df)
spotify_df <- spotify_df %>% select(-track_id, -playlist_id, -track_album_id) %>%
mutate_at(c("playlist_genre", "playlist_subgenre", "mode", "key"), as.factor)
# Check for missing data. 5 missing values found for three variables.
colSums(is.na(spotify_df))
# Omit missing values
spotify_df <- na.omit(spotify_df)
# Verify missing data has been removed
colSums(is.na(spotify_df))
# Convert duration to minutes (from milliseconds)
spotify_df <- spotify_df %>% mutate(duration_ms = duration_ms/60000) %>% rename(duration_min = duration_ms)
head(spotify_df)
In the correlation plot, the darker the blue the greater the correlation between the variables. The chart shows a positive correlation between energy and loudness and a negative correlation between acousticness and energy.
We want to gain an understanding of which genre is the most popular among our datset. To start we split the tracks between popular (1) and unpopular (0) based on a popularity score of over 57 (based on a score of 0-99). We filter out all of the unpopular songs and build our graph by playlist genre. We see that pop, latin, and rap have the highest count of popular songs in our review.
In cluster 1, there are 1,202 records.
In cluster 2, there are 3,190 records.
In cluster 3, there are 2,173 records.
Centroid Center Positions:
The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS).
This measurement shows how well spread the clusters are between a value of 0 and 1.
The closer to 1, the more distinct the clusters are within the dataset.
In our model, the BSS/TSS ratio is 0.2039852, which is a pretty low ratio for this type of model. However, we determine a low number of clusters was sufficient in this model, which would also result in a low BSS/TSS ratio.
BSS/TSS Ratio: 0.2039852
In cluster 1, there are 388 popular tracks.
In cluster 2, there are 1,400 popular tracks.
In cluster 3, there are 410 popular tracks.