Cyclistic Case Study

This case study is the capstone project of the Google Data Analytics course. The capstone brings everything I’ve learned together.

🧹 Data Cleaning and Organization

I used the dataset provided in the course: Divvy_Trips_2019_Q1. This is a fictional dataset created for educational purposes, so its source and credibility cannot be verified. Fortunately, the data was mostly clean—there were no missing values or duplicates in the columns relevant to my analysis.

♻ Data Processing and Analysis

Following the case study steps, I created several new columns:

ride_length: Calculates the duration of each ride in HH:MM:SS format by subtracting start_time from end_time.

day_of_week: Assigns a numeric value to each day (1 for Sunday through 7 for Saturday).

day_of_week_str: Converts the numeric day into its name (e.g., “Monday”).

📊 Data Visualization

I used RStudio to visualize the data and generate insights. First, I installed and loaded the necessary packages:

if (!require("tidyverse")) {
  options(repos = c(CRAN = "https://cloud.r-project.org"))
  install.packages("tidyverse")
}

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

if (!require("ggplot2")) {
  options(repos = c(CRAN = "https://cloud.r-project.org"))
  install.packages("ggplot2")
}

if (!require("plotly")) {
  options(repos = c(CRAN = "https://cloud.r-project.org"))
  install.packages("plotly")
}

## Loading required package: plotly
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(tidyverse)
library(ggplot2)
library(plotly)
library(readxl)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Then, I imported the dataset from Excel:

Divvy_Trips_2019_Q1 <- read_excel("C:/Users/User/OneDrive/Desktop/Cyclists Case Study/XLS/Divvy_Trips_2019_Q1.xlsx")

👥 Trips by User Type and Gender

user_plot <- Divvy_Trips_2019_Q1 %>%
  filter(!is.na(gender)) %>%
  ggplot(aes(x= usertype, fill = gender)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = label_comma()) +
  labs(
    title = "Trips by User Type and Gender",
    x = "User Type",
    y = "Number of Trips",
    fill = "Gender"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    axis.title.y = element_text(margin = margin(r = 15))
  )

ggplotly(user_plot)

📅 Trips by Day of the Week

day_plot <- ggplot(Divvy_Trips_2019_Q1, aes(x = day_of_week_str)) +
  geom_bar(fill = "dark green") + 
  labs (
    title = "Trips by Day of the Week",
    x = "Day of the Week",
    y = "Number of Trips"
  ) +
  theme_minimal() +
  theme (
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    axis.title.y = element_text(margin = margin(r = 15))
  )

ggplotly(day_plot)

🎂 Trips by Birth Year

First, in order to work with data that make more sense I created a table with birth year between 1910 and 2003.

Divvy_Trips_2019_Q1_clean <- Divvy_Trips_2019_Q1 %>%
  filter(birthyear >= 1910 & birthyear <= 2003)

Then I plotted the cleaned data with plotly:

year_plot <- ggplot(Divvy_Trips_2019_Q1_clean, aes(x = birthyear)) +
  geom_bar(fill = "dark blue") +
  scale_x_continuous(limits = c(1940, 2000)) +
  labs (
  title = "Trips by Birth Year",
  x = "Birth Year",
  y = "Number of Trips"
  ) +
  theme_minimal() +
  theme (
  plot.title = element_text(hjust = 0.5, face = "bold"),
  plot.subtitle = element_text(hjust = 0.85),
  axis.title.x = element_text(margin = margin(t = 15)),
  axis.title.y = element_text(margin = margin(r = 15))
  )

ggplotly(year_plot)

## Warning: Removed 331 rows containing non-finite outside the scale range
## (`stat_count()`).

🔍 Data Exploration Summary and Insights

I also calculated summary statistics to better understand the dataset:

 ## maximum number of trips by usertype and gender
max_user <- Divvy_Trips_2019_Q1 %>%
  count(usertype, gender) %>%
  filter(!is.na(gender)) %>%
  pull(n) %>%
  max()

## maximum number of trips by day of the week
max_day <- Divvy_Trips_2019_Q1 %>% 
  count(day_of_week_str) %>%
  pull(n) %>%
  max()

## average birth year
mean_year <- Divvy_Trips_2019_Q1_clean %>% 
  summarise(mean_year = mean(birthyear, na.rm = TRUE)) %>%
  pull(mean_year) %>%
  round()

## average age 
dataset_year <- 2019
mean_age <- dataset_year - mean_year

## maximum number of trips by birth year
max_year <- Divvy_Trips_2019_Q1_clean %>%
  filter(!is.na(birthyear)) %>%
  count(birthyear) %>%
  pull(n) %>%
  max()

## minimum number of trips by birth year
min_year <- Divvy_Trips_2019_Q1_clean %>%
  filter(!is.na(birthyear)) %>%
  count(birthyear) %>%
  pull(n) %>%
  min()

## the year with the maximum number of trips
top_year <- Divvy_Trips_2019_Q1_clean %>%
  filter(!is.na(birthyear)) %>%
  count(birthyear) %>%
  slice_max(n, n = 1) %>%
  pull(birthyear)

## the year(s) with the minimum number of trips
least_year <- Divvy_Trips_2019_Q1 %>%
  filter(!is.na(birthyear)) %>%
  count(birthyear) %>%
  filter(n == min(n)) %>%
  pull(birthyear)

Here are some key insights:

The highest number of trips in a single day was 66,903, occurring on Thursday.
The most frequent user type was Male Subscribers, with 274,380 trips.
The birth year with the most trips was 1989, totaling 21,014 rides.
The birth years with the least trips were 1938, 1941, totaling 1 rides.
The average birth year of riders was 1982, which as of the year 2019 are 37 years old.
Analysis of the Trips by User Type and Gender plot reveals a significant difference in trip volume between Customers and Subscribers, with Subscribers accounting for a significantly higher number of trips.

Cyclistic Case Study

🚲 How Does a Bike-Share Navigate Speedy Success?

🧹 Data Cleaning and Organization

♻ Data Processing and Analysis

📊 Data Visualization

🔍 Data Exploration Summary and Insights