This demo uses an open-source dataset to explore different data wrangling and plotting techniques in R. Data organization is hugely important because it can impact the quality and accuracy of any statistical tests or visualizations. A popular collection of data organization packages is the tidyverse
, which all share an “underlying design philosophy, grammar, and data structure” (see more).
The basic structure of tidy data is every variable goes into a column and every column is a variable. Using this framework, we can manipulate the data to calculate new values, run statistical tests, and generate a graphic. In this demo, I will primarily use dplyr
, ggplot2
, and tidyr
to organize my data and make some beautiful plots. I will be using the Top Hits Spotify from 2000 - 2019 dataset, available on Kaggle.
First, I will load the packages I need. If you do not already have tidyverse packages installed on your computer, you should install them using install.packages('tidyverse')
first. The other packages I’m loading will be useful for customizing my plots. I’ll also set a global theme to theme_classic
.
library(ggpubr) # ggplot customization
library(Rmisc) # basic statistics helper
library(reactable) # format tables in a nice way
library(gganimate) # animate plots
library(scales) # ggplot scale customization
library(icons) # icon library
library(tidyverse) # load all tidyverse packages
theme_set(theme_classic()) # set classic theme
I will also read in the dataset as df_raw
and look at the first few rows to get a sense of the variables I’m working with.
# read in data
df_raw <- read.csv('./data/songs_normalize.csv')
# see first six rows of all variables
reactable(head(df_raw), compact = T)
Based on the data, it looks like I have 18 variables. These are further explained on the Kaggle page for this dataset. This is a lot of data, so it’s useful to break down and organize the data depending on my analysis questions. This is where dplyr
and tidyr
come in handy.
I also noticed that despite the dataset saying it includes songs from 2000 to 2019, I see some songs from before 2000 and after 2019 included. I will remove those observations.
df <- subset(df_raw, df_raw$year >= 2000 & df_raw$year <= 2019)
Since I have so much data, I’ll want to narrow down my analysis questions for this demo. The main questions I will explore are:
Which artists have the most hit songs?
Are positive songs more energetic and danceable than negative songs?
Do songs in major and minor scale change in popularity over time?
Are songs with explicit lyrics speechier than songs without explicit lyrics?
Does song tempo or duration influence song popularity?
Each of these questions will highlight a different data visualization method. In my experience, it can be helpful to test different plotting methods to find the best way to display results.
In order to find out which artists have the most hit songs, I need to count the number of songs by every artist in the data frame. I can easily do this using %>%
(pipe) notation, which allows me to express a sequence of multiple operations. The pipe comes from the magrittr
package, but tidyverse
loads it automatically. Pipes allow me to write a step-by-step command that is executed in a certain order.
Here, I gather the data contained in df
and I ultimately want to store it in a new data frame called artists
. To do this, I first group all the data by the unique artist name. I know that this is the first step because it’s the first statement that comes after my initial %>%
. Then, while grouping the data by artist, I can count the number of songs by grabbing the length of the song
variable.
Finally, I want to see a list of the top ten artists in descending order by the number of songs.
artists <- df %>% # create new data frame
group_by(artist) %>% # group by unique artist name
summarize(SongCount = length(song)) # count the number of songs
# sort list in descending order of number of songs
artists <- arrange(artists, desc(SongCount))
# print table of top 10 artists
reactable(head(artists, n = 10), compact = T)
To visualize this information in a plot, I can save this information as a data frame and make a very simple plot using ggplot2
.
# save top ten
TopTen <- head(artists, n = 10)
ggplot(TopTen, aes(x = artist, y = SongCount)) +
# outline bars in black, fill with light teal blue
geom_col(color = "black", fill = "#fcbc66", alpha = 0.8) +
# order bars in descending order and wrap text so last name appears on second line
scale_x_discrete(limits = TopTen$artist, labels = function(x) str_wrap(x, width = 10)) +
# label x axis
xlab(NULL) +
# label y axis
ylab("Number of Hit Songs") +
# write a descriptive title
ggtitle("Top Ten Artists with Hit Songs on Spotify from 2000 - 2019") +
# add song count value above each bar
geom_text(aes(label = SongCount), position = position_dodge(width = 0.9), vjust = -0.5)
In order to determine if positive songs are both more energetic and more “danceable” than negative songs, I first want to binarize valence into two categories – positive and negative. The Kaggle dataset mentions that songs with a valence greater than 0.5 are considered more positive, while songs with a valence that is less than 0.5 are considered more negative. With this in mind, I will create a new variable called valence.category
that reflects this binary split.
I also want to get some statistical measures for my plot. Using summarySE
from the Rmisc
package, I can calculate mean, standard deviation, standard error, and 95% confidence intervals for a measurement variable while grouping by another variable. In this case, I want to calculate these statistical measures for both danceability
and energy
while grouping by the newly created valence.category
.
# bin valence values into positive and negative categories
df$valence.category[df$valence >= 0.5] <- "Positive"
df$valence.category[df$valence < 0.5] <- "Negative"
# get stats for danceability and energy (mean, 95% confidence interval, etc.)
dance <- summarySE(df, measurevar = "danceability", groupvars = "valence.category")
energy <- summarySE(df, measurevar = "energy", groupvars = "valence.category")
Now I can create my plot! I will be making a violin plot to show not only the mean difference between my valence groups, but also what the distribution is within my two valence categories. I will create a plot for energy
and for danceability
, and combine those into one joint plot using ggarrange
.
# build violin plot for danceability
dance.plot <- ggplot(data = df, aes(x = valence.category, y = danceability,
fill = valence.category, color = valence.category)) +
# violin plot
geom_violin(scale = "area", alpha = 0.8) +
# fill with my selected colors
scale_fill_manual(values = c("#8dc6bf","#fcbc66")) +
scale_color_manual(values = c("#8dc6bf","#fcbc66")) +
# add point for mean of each valence category
geom_point(data = dance, aes(x = valence.category, y = danceability), color = "black") +
# add 95% confidence intervals
geom_errorbar(data = dance, aes(ymin = danceability-ci, ymax = danceability+ci),
width = 0.25, position = "dodge", color = "black") +
# label x axis
xlab(NULL) +
# label y axis
ylab("Danceability") +
# don't include legend
theme(legend.position = "none")
# build violin plot for energy
energy.plot <- ggplot(data = df, aes(x = valence.category, y = energy,
fill = valence.category, color = valence.category)) +
# violin plot
geom_violin(scale = "area", alpha = 0.8) +
# fill with my selected colors
scale_fill_manual(values = c("#8dc6bf","#fcbc66")) +
scale_color_manual(values = c("#8dc6bf","#fcbc66")) +
# add point for mean of each valence category
geom_point(data = energy, aes(x = valence.category, y = energy), color = "black") +
# add 95% confidence intervals
geom_errorbar(data = energy, aes(ymin = energy-ci, ymax = energy+ci),
width = 0.25, position = "dodge", color = "black") +
# label x axis
xlab(NULL) +
# label y axis
ylab("Energy") +
# don't include legend
theme(legend.position = "none")
# combine dance plot and energy plot using ggarrange
plot <- ggarrange(dance.plot, energy.plot, ncol = 2)
# add title and note to plot
annotate_figure(plot, top = text_grob("Danceability and Energy in Positive and Negative Songs",
color = "black", face = "bold", size = 14),
bottom = text_grob("Bars indicate 95% confidence intervals around the mean.",
color = "black", face = "italic", size = 8))
It looks like positive songs are both more energetic and more danceable than negative songs, which makes sense. Interestingly, negative songs extend to both the lower and higher ends of the energy and danceability scales, while positive songs tend to be more densely clustered toward the higher end of the scales.
Here I’m interested to see if songs that are in major versus minor scale change in popularity over time. The major scale is a more commonly used scale, especially in Western music. On the flip, side the minor scale is used when musicians want to evoke a feeling of eeriness or suspense (some examples include “Stairway to Heaven” by Led Zeppelin and “Scarborough Fair” by Simon & Garfunkel).
I first need to organize the data and calculate an average popularity score for each scale type (the mode
variable) and for each year. Then I will create a new character variable called mode.char
that indicates whether the scale is major or minor.
popularity <- df %>%
group_by(year,mode) %>%
summarize(MeanPopularity = mean(popularity))
popularity$mode.char[popularity$mode == 1] <- "Major"
popularity$mode.char[popularity$mode == 0] <- "Minor"
Now I can make my plot! Because I’m showing change over time, I decided to use the gganimate
package to reveal each data point sequentially on an animated plot.
# make plot
plot <- ggplot(popularity, aes(x = year, y = MeanPopularity, group = mode.char)) +
# animate to reveal points over time
transition_reveal(year) +
# add point for each year
geom_point(aes(color = mode.char), size = 2) +
# connect points with line
geom_line(aes(color = mode.char), size = 1.1) +
# show all years on the x-axis
scale_x_continuous(breaks = pretty_breaks(n=20)) +
# label x-axis and y-axis
ylab("Average Popularity") + xlab("Year") +
# add a descriptive title
ggtitle("Average Popularity of Major and Minor Songs from 2000 - 2019") +
# use my colors and change legend title
scale_color_manual(values = c("#8dc6bf","#fcbc66"), name = "Scale") +
# angle and move x-axis labels and text
theme(
axis.text.x = element_text(hjust = 1, angle = 45),
axis.title.x = element_text(vjust = -1)
)
# animate
animate(plot, duration = 8, fps = 20, renderer = gifski_renderer(), end_pause = 60)
In this dataset, “speechiness” is a measure of spoken words in a song. Songs with more exclusively speech-like contents (like a talk show, podcast, etc.) have scores between 0.66 and 1. Songs with a speechiness value between 0.33 and 0.66 describe tracks that contain both music and speech. This range is where I’d expect most of these songs within. Finally, songs with values below 0.33 represent instrumental songs and songs with more music than words.
First, I want to see the range of speechiness values in my dataset to see where songs in this dataset fall.
# find range of speechiness variable
min <- min(df$speechiness)
max <- max(df$speechiness)
# show histogram of speechiness scores
hist(df$speechiness, col = "#FF6347", xlab = "Speechiness", main = "Histogram of Speechiness Values")
# print range
paste0("Speechiness values range from ", min, " to ", max, " in this dataset.", sep = "")
## [1] "Speechiness values range from 0.0232 to 0.576 in this dataset."
Interesting! In this dataset, there are actually more songs that have more instrumental music. I wonder if songs that contain more speech also contain more explicit dialogue than songs with less speech. I can investigate this question using a density plot.
df$explicit.char[df$explicit == "False"] <- "No"
df$explicit.char[df$explicit == "True"] <- "Yes"
ggplot(df, aes(speechiness)) +
# create density plot to show distribution
geom_density(aes(fill = factor(explicit.char)), alpha = 0.8) +
# use my colors and rename legend
scale_fill_manual(values = c("#8dc6bf","#fcbc66"), name = "Explicit Lyrics") +
# label x-axis and y-axis
ylab("Density") + xlab("Average Track Speechiness") +
# add a descriptive title
ggtitle("Average Speechiness of Songs With and Without Explicit Lyrics")
From the density plot, we see that yes, songs with more speech and music tend to have more explicit language than songs with less speech!
Finally, I want to investigate if there is a relationship between song tempo and song popularity or between song duration and song popularity. To do this, I’m going to run two simple linear regressions using base R’s lm
function. I want to see if song tempo (independent, predictor variable) influences song popularity (dependent, response variable). I also want to see if song duration in minutes influences song popularity.
# does song tempo influence popularity?
ggplot(df, aes(x = tempo, y = popularity)) +
# show data points
geom_jitter(size = 1, shape = 1) +
# draw regression line
geom_smooth(method = "lm", color = "#8dc6bf", fill = "#8dc6bf") +
# label x-axis and y-axis
xlab("Tempo (Beats Per Minute)") + ylab("Average Popularity")
# run a linear regression
model1 <- lm(popularity ~ tempo, data = df)
summary(model1)
##
## Call:
## lm(formula = popularity ~ tempo, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.668 -3.518 5.860 13.439 29.151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.47742 2.21991 26.342 <2e-16 ***
## tempo 0.01106 0.01804 0.613 0.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.51 on 1956 degrees of freedom
## Multiple R-squared: 0.0001921, Adjusted R-squared: -0.000319
## F-statistic: 0.3759 on 1 and 1956 DF, p-value: 0.5399
Based on the regression results, there does not appear to be an effect of song tempo on popularity (p = .54).
# calculate song duration in minutes
df$duration_min <- df$duration_ms/60000
# does song duration influence popularity?
ggplot(df, aes(x = duration_min, y = popularity)) +
# show data points
geom_jitter(size = 1, shape = 1) +
# draw regression line
geom_smooth(method = "lm", color = "#fcbc66", fill = "#fcbc66") +
# label x-axis and y-axis
xlab("Song Duration (Minutes)") + ylab("Average Popularity")
# run a linear regression
model2 <- lm(popularity ~ duration_min, data = df)
summary(model2)
##
## Call:
## lm(formula = popularity ~ duration_min, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.951 -3.755 5.791 13.531 28.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.3904 2.8842 18.511 <2e-16 ***
## duration_min 1.6860 0.7472 2.256 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.49 on 1956 degrees of freedom
## Multiple R-squared: 0.002596, Adjusted R-squared: 0.002086
## F-statistic: 5.091 on 1 and 1956 DF, p-value: 0.02416
Based on the regression results, there is a significant effect of song duration on popularity (p = .024). As songs get longer, they increase in popularity.