This case study is the capstone project of the Google Data Analytics course. The capstone brings everything Iβve learned together.
I used the dataset provided in the course: Divvy_Trips_2019_Q1. This is a fictional dataset created for educational purposes, so its source and credibility cannot be verified. Fortunately, the data was mostly cleanβthere were no missing values or duplicates in the columns relevant to my analysis.
Following the case study steps, I created several new columns:
ride_length
: Calculates the duration of each ride in
HH:MM:SS format by subtracting start_time
from
end_time
.
day_of_week
: Assigns a numeric value to each day (1 for
Sunday through 7 for Saturday).
day_of_week_str
: Converts the numeric day into its name
(e.g., βMondayβ).
I used RStudio to visualize the data and generate insights. First, I installed and loaded the necessary packages:
if (!require("tidyverse")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tidyverse")
}
## Loading required package: tidyverse
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β dplyr 1.1.4 β readr 2.1.5
## β forcats 1.0.0 β stringr 1.5.1
## β ggplot2 3.5.2 β tibble 3.3.0
## β lubridate 1.9.4 β tidyr 1.3.1
## β purrr 1.1.0
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require("ggplot2")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("ggplot2")
}
if (!require("plotly")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("plotly")
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
library(ggplot2)
library(plotly)
library(readxl)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Then, I imported the dataset from Excel:
Divvy_Trips_2019_Q1 <- read_excel("C:/Users/User/OneDrive/Desktop/Cyclists Case Study/XLS/Divvy_Trips_2019_Q1.xlsx")
π₯ Trips by User Type and Gender
user_plot <- Divvy_Trips_2019_Q1 %>%
filter(!is.na(gender)) %>%
ggplot(aes(x= usertype, fill = gender)) +
geom_bar(position = "dodge") +
scale_y_continuous(labels = label_comma()) +
labs(
title = "Trips by User Type and Gender",
x = "User Type",
y = "Number of Trips",
fill = "Gender"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 1),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(user_plot)
π Trips by Day of the Week
day_plot <- ggplot(Divvy_Trips_2019_Q1, aes(x = day_of_week_str)) +
geom_bar(fill = "dark green") +
labs (
title = "Trips by Day of the Week",
x = "Day of the Week",
y = "Number of Trips"
) +
theme_minimal() +
theme (
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 1),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(day_plot)
π Trips by Birth Year
First, in order to work with data that make more sense I created a table with birth year between 1910 and 2003.
Divvy_Trips_2019_Q1_clean <- Divvy_Trips_2019_Q1 %>%
filter(birthyear >= 1910 & birthyear <= 2003)
Then I plotted the cleaned data with plotly:
year_plot <- ggplot(Divvy_Trips_2019_Q1_clean, aes(x = birthyear)) +
geom_bar(fill = "dark blue") +
scale_x_continuous(limits = c(1940, 2000)) +
labs (
title = "Trips by Birth Year",
x = "Birth Year",
y = "Number of Trips"
) +
theme_minimal() +
theme (
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.85),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(year_plot)
## Warning: Removed 331 rows containing non-finite outside the scale range
## (`stat_count()`).
I also calculated summary statistics to better understand the dataset:
## maximum number of trips by usertype and gender
max_user <- Divvy_Trips_2019_Q1 %>%
count(usertype, gender) %>%
filter(!is.na(gender)) %>%
pull(n) %>%
max()
## maximum number of trips by day of the week
max_day <- Divvy_Trips_2019_Q1 %>%
count(day_of_week_str) %>%
pull(n) %>%
max()
## average birth year
mean_year <- Divvy_Trips_2019_Q1_clean %>%
summarise(mean_year = mean(birthyear, na.rm = TRUE)) %>%
pull(mean_year) %>%
round()
## average age
dataset_year <- 2019
mean_age <- dataset_year - mean_year
## maximum number of trips by birth year
max_year <- Divvy_Trips_2019_Q1_clean %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
pull(n) %>%
max()
## minimum number of trips by birth year
min_year <- Divvy_Trips_2019_Q1_clean %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
pull(n) %>%
min()
## the year with the maximum number of trips
top_year <- Divvy_Trips_2019_Q1_clean %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
slice_max(n, n = 1) %>%
pull(birthyear)
## the year(s) with the minimum number of trips
least_year <- Divvy_Trips_2019_Q1 %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
filter(n == min(n)) %>%
pull(birthyear)
Here are some key insights: