This analysis is for case study 1 from the Google Data Analytics Certificate (Cyclistic). It’s originally based on the case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study).
We will be using the Divvy data set for the case
study. The purpose of this script is to consolidate downloaded Divvy
data into a single data frame and then conduct simple analysis to help
answer the key question:
“In what ways do members and
casual riders use Divvy bikes differently?”
using the following script https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy
Cyclistic case study
Install required packages
tidyverse for
data import and wrangling
libridate for date functions
ggplot
for visualization
Markdown for creating HTML report
install packages
install.packages("dplyr")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("markdown")
install.packages("data.table")
install.packages("kableExtra")
install.packages("knitr")
install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(markdown)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(knitr)
library(dplyr)
getwd()
## [1] "C:/Users/tsheef/Desktop/case study Cyclistic"
setwd("/Users/tsheef/Desktop/case study Cyclistic")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(q3_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q4_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q2_2019)
## [1] "01 - Rental Details Rental ID"
## [2] "01 - Rental Details Local Start Time"
## [3] "01 - Rental Details Local End Time"
## [4] "01 - Rental Details Bike ID"
## [5] "01 - Rental Details Duration In Seconds Uncapped"
## [6] "03 - Rental Start Station ID"
## [7] "03 - Rental Start Station Name"
## [8] "02 - Rental End Station ID"
## [9] "02 - Rental End Station Name"
## [10] "User Type"
## [11] "Member Gender"
## [12] "05 - Member Details Member Birthday Year"
colnames(q1_2020)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
(q4_2019 <- rename(q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
## # A tibble: 704,054 × 12
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20 2215 940
## 2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34 6328 258
## 3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43 3003 850
## 4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43 3275 2350
## 5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42 5294 1867
## 6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51 1891 373
## 7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45 1061 1072
## 8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16 1274 1458
## 9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18 6011 1437
## 10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46 2957 8306
## # ℹ 704,044 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## # end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## # gender <chr>, birthyear <dbl>
(q3_2019 <- rename(q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
## # A tibble: 1,640,718 × 12
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41 3591 1214
## 2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44 5353 1048
## 3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42 6180 1554
## 4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10 5540 1503
## 5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26 6014 1213
## 6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31 4941 310
## 7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12 3770 1248
## 8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16 5442 1550
## 9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57 2957 1583
## 10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14 6091 1589
## # ℹ 1,640,708 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## # end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## # gender <chr>, birthyear <dbl>
(q2_2019 <- rename(q2_2019
,ride_id = "01 - Rental Details Rental ID"
,rideable_type = "01 - Rental Details Bike ID"
,started_at = "01 - Rental Details Local Start Time"
,ended_at = "01 - Rental Details Local End Time"
,start_station_name = "03 - Rental Start Station Name"
,start_station_id = "03 - Rental Start Station ID"
,end_station_name = "02 - Rental End Station Name"
,end_station_id = "02 - Rental End Station ID"
,member_casual = "User Type"))
## # A tibble: 1,108,163 × 12
## ride_id started_at ended_at rideable_type
## <dbl> <dttm> <dttm> <dbl>
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123
## 7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41 6418
## 8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11 4513
## 9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44 3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39 5534
## # ℹ 1,108,153 more rows
## # ℹ 8 more variables: `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## # start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## # end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## # `05 - Member Details Member Birthday Year` <dbl>
str(q1_2020)
## spc_tbl_ [426,887 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : POSIXct[1:426887], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
## $ ended_at : POSIXct[1:426887], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
## $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_id : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
## $ end_station_name : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_id : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
## $ start_lat : num [1:426887] 42 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:426887] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:426887] 42 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:426887] -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:426887] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(q4_2019)
## spc_tbl_ [704,054 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...
## $ started_at : POSIXct[1:704054], format: "2019-10-01 00:01:39" "2019-10-01 00:02:16" ...
## $ ended_at : POSIXct[1:704054], format: "2019-10-01 00:17:20" "2019-10-01 00:06:34" ...
## $ rideable_type : num [1:704054] 2215 6328 3003 3275 5294 ...
## $ tripduration : num [1:704054] 940 258 850 2350 1867 ...
## $ start_station_id : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...
## $ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
## $ end_station_id : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...
## $ end_station_name : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
## $ member_casual : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr [1:704054] "Male" "Male" "Female" "Male" ...
## $ birthyear : num [1:704054] 1987 1998 1991 1990 1987 ...
## - attr(*, "spec")=
## .. cols(
## .. trip_id = col_double(),
## .. start_time = col_datetime(format = ""),
## .. end_time = col_datetime(format = ""),
## .. bikeid = col_double(),
## .. tripduration = col_number(),
## .. from_station_id = col_double(),
## .. from_station_name = col_character(),
## .. to_station_id = col_double(),
## .. to_station_name = col_character(),
## .. usertype = col_character(),
## .. gender = col_character(),
## .. birthyear = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(q3_2019)
## spc_tbl_ [1,640,718 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...
## $ started_at : POSIXct[1:1640718], format: "2019-07-01 00:00:27" "2019-07-01 00:01:16" ...
## $ ended_at : POSIXct[1:1640718], format: "2019-07-01 00:20:41" "2019-07-01 00:18:44" ...
## $ rideable_type : num [1:1640718] 3591 5353 6180 5540 6014 ...
## $ tripduration : num [1:1640718] 1214 1048 1554 1503 1213 ...
## $ start_station_id : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...
## $ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
## $ end_station_id : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...
## $ end_station_name : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
## $ member_casual : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...
## $ gender : chr [1:1640718] "Male" NA NA NA ...
## $ birthyear : num [1:1640718] 1992 NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. trip_id = col_double(),
## .. start_time = col_datetime(format = ""),
## .. end_time = col_datetime(format = ""),
## .. bikeid = col_double(),
## .. tripduration = col_number(),
## .. from_station_id = col_double(),
## .. from_station_name = col_character(),
## .. to_station_id = col_double(),
## .. to_station_name = col_character(),
## .. usertype = col_character(),
## .. gender = col_character(),
## .. birthyear = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(q2_2019)
## spc_tbl_ [1,108,163 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...
## $ started_at : POSIXct[1:1108163], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
## $ ended_at : POSIXct[1:1108163], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
## $ rideable_type : num [1:1108163] 6251 6226 5649 4151 3270 ...
## $ 01 - Rental Details Duration In Seconds Uncapped: num [1:1108163] 446 1048 252 357 1007 ...
## $ start_station_id : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...
## $ start_station_name : chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ end_station_id : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...
## $ end_station_name : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ member_casual : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ Member Gender : chr [1:1108163] "Male" "Female" "Male" "Male" ...
## $ 05 - Member Details Member Birthday Year : num [1:1108163] 1975 1984 1990 1993 1992 ...
## - attr(*, "spec")=
## .. cols(
## .. `01 - Rental Details Rental ID` = col_double(),
## .. `01 - Rental Details Local Start Time` = col_datetime(format = ""),
## .. `01 - Rental Details Local End Time` = col_datetime(format = ""),
## .. `01 - Rental Details Bike ID` = col_double(),
## .. `01 - Rental Details Duration In Seconds Uncapped` = col_number(),
## .. `03 - Rental Start Station ID` = col_double(),
## .. `03 - Rental Start Station Name` = col_character(),
## .. `02 - Rental End Station ID` = col_double(),
## .. `02 - Rental End Station Name` = col_character(),
## .. `User Type` = col_character(),
## .. `Member Gender` = col_character(),
## .. `05 - Member Details Member Birthday Year` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
all_trips <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020)
all_trips <- all_trips %>%
select(-c(birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
colnames(all_trips)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "start_station_id" "start_station_name"
## [7] "end_station_id" "end_station_name" "member_casual"
## [10] "start_lat" "start_lng" "end_lat"
## [13] "end_lng"
nrow(all_trips)
## [1] 3879822
dim(all_trips)
## [1] 3879822 13
head(all_trips)
## # A tibble: 6 × 13
## ride_id started_at ended_at rideable_type start_station_id
## <chr> <dttm> <dttm> <chr> <dbl>
## 1 221785… 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 81
## 2 221785… 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 317
## 3 221785… 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 283
## 4 221785… 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 26
## 5 221785… 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 202
## 6 221785… 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 420
## # ℹ 8 more variables: start_station_name <chr>, end_station_id <dbl>,
## # end_station_name <chr>, member_casual <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>
str(all_trips)
## tibble [3,879,822 × 13] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
## $ started_at : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
## $ ended_at : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
## $ rideable_type : chr [1:3879822] "6251" "6226" "5649" "4151" ...
## $ start_station_id : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
## $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ end_station_id : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
## $ end_station_name : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ member_casual : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ start_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ start_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
summary(all_trips)
## ride_id started_at
## Length:3879822 Min. :2019-04-01 00:02:22.00
## Class :character 1st Qu.:2019-06-23 07:49:09.25
## Mode :character Median :2019-08-14 17:43:38.00
## Mean :2019-08-26 00:49:59.38
## 3rd Qu.:2019-10-12 12:10:21.00
## Max. :2020-03-31 23:51:34.00
##
## ended_at rideable_type start_station_id
## Min. :2019-04-01 00:09:48.00 Length:3879822 Min. : 1.0
## 1st Qu.:2019-06-23 08:20:27.75 Class :character 1st Qu.: 77.0
## Median :2019-08-14 18:02:04.00 Mode :character Median :174.0
## Mean :2019-08-26 01:14:37.06 Mean :202.9
## 3rd Qu.:2019-10-12 12:36:16.75 3rd Qu.:291.0
## Max. :2020-05-19 20:10:34.00 Max. :675.0
##
## start_station_name end_station_id end_station_name member_casual
## Length:3879822 Min. : 1.0 Length:3879822 Length:3879822
## Class :character 1st Qu.: 77.0 Class :character Class :character
## Mode :character Median :174.0 Mode :character Mode :character
## Mean :203.8
## 3rd Qu.:291.0
## Max. :675.0
## NA's :1
## start_lat start_lng end_lat end_lng
## Min. :42 Min. :-88 Min. :42 Min. :-88
## 1st Qu.:42 1st Qu.:-88 1st Qu.:42 1st Qu.:-88
## Median :42 Median :-88 Median :42 Median :-88
## Mean :42 Mean :-88 Mean :42 Mean :-88
## 3rd Qu.:42 3rd Qu.:-88 3rd Qu.:42 3rd Qu.:-88
## Max. :42 Max. :-88 Max. :42 Max. :-88
## NA's :3452935 NA's :3452935 NA's :3452936 NA's :3452936
table(all_trips$member_casual)
##
## casual Customer member Subscriber
## 48480 857474 378407 2595461
all_trips <- all_trips %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
table(all_trips$member_casual)
##
## casual member
## 905954 2973868
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$time_start <- format(as.POSIXct(all_trips$started_at,format="%H:%M:%S"),"%H:%M")
all_trips$time_end <- format(as.POSIXct(all_trips$ended_at,format="%H:%M:%S"),"%H:%M")
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips)
## tibble [3,879,822 × 21] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
## $ started_at : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
## $ ended_at : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
## $ rideable_type : chr [1:3879822] "6251" "6226" "5649" "4151" ...
## $ start_station_id : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
## $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ end_station_id : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
## $ end_station_name : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ member_casual : chr [1:3879822] "member" "member" "member" "member" ...
## $ start_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ start_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date[1:3879822], format: "2019-04-01" "2019-04-01" ...
## $ month : chr [1:3879822] "04" "04" "04" "04" ...
## $ day : chr [1:3879822] "01" "01" "01" "01" ...
## $ year : chr [1:3879822] "2019" "2019" "2019" "2019" ...
## $ day_of_week : chr [1:3879822] "Monday" "Monday" "Monday" "Monday" ...
## $ time_start : chr [1:3879822] "00:02" "00:03" "00:11" "00:13" ...
## $ time_end : chr [1:3879822] "00:09" "00:20" "00:15" "00:18" ...
## $ ride_length : 'difftime' num [1:3879822] 446 1048 252 357 ...
## ..- attr(*, "units")= chr "secs"
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
str(all_trips_v2)
## tibble [3,876,042 × 21] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:3876042] "22178529" "22178530" "22178531" "22178532" ...
## $ started_at : POSIXct[1:3876042], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
## $ ended_at : POSIXct[1:3876042], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
## $ rideable_type : chr [1:3876042] "6251" "6226" "5649" "4151" ...
## $ start_station_id : num [1:3876042] 81 317 283 26 202 420 503 260 211 211 ...
## $ start_station_name: chr [1:3876042] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ end_station_id : num [1:3876042] 56 59 174 133 129 426 500 499 211 211 ...
## $ end_station_name : chr [1:3876042] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ member_casual : chr [1:3876042] "member" "member" "member" "member" ...
## $ start_lat : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
## $ start_lng : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lat : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lng : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date[1:3876042], format: "2019-04-01" "2019-04-01" ...
## $ month : chr [1:3876042] "04" "04" "04" "04" ...
## $ day : chr [1:3876042] "01" "01" "01" "01" ...
## $ year : chr [1:3876042] "2019" "2019" "2019" "2019" ...
## $ day_of_week : chr [1:3876042] "Monday" "Monday" "Monday" "Monday" ...
## $ time_start : chr [1:3876042] "00:02" "00:03" "00:11" "00:13" ...
## $ time_end : chr [1:3876042] "00:09" "00:20" "00:15" "00:18" ...
## $ ride_length : num [1:3876042] 446 1048 252 357 1007 ...
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 412 712 1479 1289 9387024
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 3552.7502
## 2 member 850.0662
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1546
## 2 member 589
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 9387024
## 2 member 9056634
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 2
## 2 member 1
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Friday 3773.8351
## 2 member Friday 824.5305
## 3 casual Monday 3372.2869
## 4 member Monday 842.5726
## 5 casual Saturday 3331.9138
## 6 member Saturday 968.9337
## 7 casual Sunday 3581.4054
## 8 member Sunday 919.9746
## 9 casual Thursday 3682.9847
## 10 member Thursday 823.9278
## 11 casual Tuesday 3596.3599
## 12 member Tuesday 826.1427
## 13 casual Wednesday 3718.6619
## 14 member Wednesday 823.9996
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 3581.4054
## 2 member Sunday 919.9746
## 3 casual Monday 3372.2869
## 4 member Monday 842.5726
## 5 casual Tuesday 3596.3599
## 6 member Tuesday 826.1427
## 7 casual Wednesday 3718.6619
## 8 member Wednesday 823.9996
## 9 casual Thursday 3682.9847
## 10 member Thursday 823.9278
## 11 casual Friday 3773.8351
## 12 member Friday 824.5305
## 13 casual Saturday 3331.9138
## 14 member Saturday 968.9337
all_trips_v2 %>%
mutate(weekday = wday(started_at)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <int> <int> <dbl>
## 1 casual 1 181293 3581.
## 2 casual 2 103296 3372.
## 3 casual 3 90510 3596.
## 4 casual 4 92457 3719.
## 5 casual 5 102679 3683.
## 6 casual 6 122404 3774.
## 7 casual 7 209543 3332.
## 8 member 1 267965 920.
## 9 member 2 472196 843.
## 10 member 3 508445 826.
## 11 member 4 500329 824.
## 12 member 5 484177 824.
## 13 member 6 452790 825.
## 14 member 7 287958 969.
membership <- count(all_trips_v2,member_casual , name = "count")
ggplot(membership, aes(x = member_casual, y = count)) +
geom_col(fill="lightblue",width = 0.5)+
geom_text(aes(label=count), vjust = 2, fontface = "bold") +
scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))
all_trips_v2 %>%
mutate(weekday = wday(started_at)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = position_dodge(width = 0.5))+
scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_trips_v2 %>%
mutate(weekday = wday(started_at)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = position_dodge(width = 0.5))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
departure_station_members <- all_trips_v2 %>%
filter(member_casual == "member" ) %>%
group_by(start_station_name,.drop = FALSE, ) %>%
count(sort = TRUE , name = "count") %>%
head(5)
ggplot(departure_station_members, aes(x = count, y = start_station_name)) +
geom_col(fill="blue",width = 0.5)
departure_station_casual <- all_trips_v2 %>%
filter(member_casual == "casual" ) %>%
group_by(start_station_name,.drop = FALSE, ) %>%
count(sort = TRUE, name = "count") %>%
head(5)
ggplot(departure_station_casual, aes(x = count, y = start_station_name)) +
geom_col(fill="red",width = 0.5)
arrival_station_members <- all_trips_v2 %>%
filter(member_casual == "member" ) %>%
group_by(end_station_name,.drop = FALSE, ) %>%
count(sort = TRUE, name = "count") %>%
head(5)
ggplot(arrival_station_members, aes(x = count, y = end_station_name)) +
geom_col(fill="blue",width = 0.5)
arrival_station_casual <- all_trips_v2 %>%
filter(member_casual == "casual" ) %>%
group_by(end_station_name,.drop = FALSE, ) %>%
count(sort = TRUE, name = "count") %>%
head(5)
ggplot(arrival_station_casual, aes(x = count, y = end_station_name)) +
geom_col(fill="red",width = 0.5)
ggplot(data = all_trips) +
geom_bar(mapping = aes(x = time_start, fill = member_casual)) +
facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))
ggplot(data = all_trips) +
geom_bar(mapping = aes(x = time_end, fill = member_casual)) +
facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))
start_stations_location <- q1_2020 %>%
add_count(start_station_name, sort = TRUE) %>%
distinct(start_station_name, .keep_all = TRUE) %>%
select(c(start_station_name, n, start_lat,start_lng))
kable(head(start_stations_location,10),format = "html", align = "lcll") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
start_station_name | n | start_lat | start_lng |
---|---|---|---|
Canal St & Adams St | 7813 | 41.8793 | -87.6399 |
Clinton St & Madison St | 6797 | 41.8822 | -87.6411 |
Clinton St & Washington Blvd | 5941 | 41.8834 | -87.6412 |
Kingsbury St & Kinzie St | 4626 | 41.8892 | -87.6385 |
Columbus Dr & Randolph St | 4425 | 41.8847 | -87.6195 |
HQ QR | 3767 | 41.8899 | -87.6803 |
Franklin St & Monroe St | 3711 | 41.8803 | -87.6352 |
Canal St & Madison St | 3637 | 41.8821 | -87.6398 |
Clinton St & Lake St | 3580 | 41.8856 | -87.6418 |
Larrabee St & Kingsbury St | 3550 | 41.8978 | -87.6429 |
end_stations_location <- q1_2020 %>%
add_count(end_station_name, sort = TRUE) %>%
distinct(end_station_name, .keep_all = TRUE) %>%
select(c(end_station_name, n, end_lat,end_lng))
kable(head(end_stations_location,10),format = "html", align = "lcll") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name | n | end_lat | end_lng |
---|---|---|---|
Canal St & Adams St | 8323 | 41.8793 | -87.6399 |
Clinton St & Washington Blvd | 7166 | 41.8834 | -87.6412 |
Clinton St & Madison St | 6854 | 41.8822 | -87.6411 |
Kingsbury St & Kinzie St | 4615 | 41.8892 | -87.6385 |
Michigan Ave & Washington St | 4227 | 41.8840 | -87.6247 |
HQ QR | 3766 | 41.8899 | -87.6803 |
Franklin St & Monroe St | 3748 | 41.8803 | -87.6352 |
Canal St & Madison St | 3706 | 41.8821 | -87.6398 |
Clinton St & Lake St | 3646 | 41.8856 | -87.6418 |
St. Clair St & Erie St | 3618 | 41.8944 | -87.6227 |
Top5_departure_station_members <-left_join(departure_station_members,start_stations_location,by = "start_station_name") %>%
select(c(start_station_name, count, start_lat,start_lng))
kable(Top5_departure_station_members, format = "html", align = "lcll") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
start_station_name | count | start_lat | start_lng |
---|---|---|---|
Canal St & Adams St | 51948 | 41.8793 | -87.6399 |
Clinton St & Madison St | 46191 | 41.8822 | -87.6411 |
Clinton St & Washington Blvd | 43590 | 41.8834 | -87.6412 |
Columbus Dr & Randolph St | 31053 | 41.8847 | -87.6195 |
Franklin St & Monroe St | 30982 | 41.8803 | -87.6352 |
Top5_departure_station_casual <-left_join(departure_station_casual,start_stations_location,by = "start_station_name") %>%
select(c(start_station_name, count, start_lat,start_lng))
kable(Top5_departure_station_casual, format = "html", align = "lcll") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
start_station_name | count | start_lat | start_lng |
---|---|---|---|
Streeter Dr & Grand Ave | 53415 | 41.8923 | -87.6120 |
Lake Shore Dr & Monroe St | 39686 | 41.8810 | -87.6167 |
Millennium Park | 21901 | 41.8810 | -87.6241 |
Michigan Ave & Oak St | 21633 | 41.9010 | -87.6238 |
Shedd Aquarium | 20781 | 41.8672 | -87.6154 |
Top5_arrival_station_members <-left_join(arrival_station_members, end_stations_location,by = "end_station_name") %>%
select(c(end_station_name, count, end_lat,end_lng))
kable(Top5_arrival_station_members, format = "html", align = "lcll") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name | count | end_lat | end_lng |
---|---|---|---|
Canal St & Adams St | 48839 | 41.8793 | -87.6399 |
Clinton St & Washington Blvd | 47633 | 41.8834 | -87.6412 |
Clinton St & Madison St | 44285 | 41.8822 | -87.6411 |
Daley Center Plaza | 30845 | 41.8842 | -87.6296 |
Kingsbury St & Kinzie St | 30404 | 41.8892 | -87.6385 |
Top5_arrival_station_casual <-left_join(arrival_station_casual, end_stations_location,by = "end_station_name") %>%
select(c(end_station_name, count, end_lat,end_lng))
kable(Top5_arrival_station_casual, format = "html", align = "lcll") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name | count | end_lat | end_lng |
---|---|---|---|
Streeter Dr & Grand Ave | 67507 | 41.8923 | -87.6120 |
Lake Shore Dr & Monroe St | 31051 | 41.8810 | -87.6167 |
Millennium Park | 25509 | 41.8810 | -87.6241 |
Michigan Ave & Oak St | 23982 | 41.9010 | -87.6238 |
Lake Shore Dr & North Blvd | 23477 | 41.9117 | -87.6268 |
After Cleaning, organizing the data and analyzing each graph we had the following observations:
First number of casual customers is 905954 while number of members is 2973868
Second number of rides are far more on the members side than casuals.
Third casuals duration time is far more than members duration time
Forth visualizing the number of rides, it is clear that members take more rides than casuals, every day of the week.
Fifth visualizing the average duration we see that Casuals Spend much more time per trip than members
Sixth visualizing the Starting time for Casuals compared to members by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off peak times mostly in the afternoons and late at night. while on week ends casuals out number members most of the day.
Seventh visualizing the end time for members compared to Casuals by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off-peak times mostly again in the afternoons and late at night. while on week ends casuals out number members most of the day.
Eighth visualizing the top 5 departure Stations for members we check their locations on Google maps where it shows random type of departure points
Ninth visualizing the top 5 departure Stations for casuals. we check their locations on Google maps
where it shows random type of departure points.
Tenth visualizing the top 5 arrival Stations for members. we check their locations on Google maps
where it shows business type of arrival points
Eleventh visualizing the top 5 arrival Stations for casuals. we check their locations on Google maps
where it shows entertainment type of arrival points.
Members |
Casuals | |
Count |
2973868 | 905954 |
Count % |
77% | 23% |
Frequency of use |
much higher | |
Duration of use |
much higher | |
weekdays Departure Time |
mostly peak | mostly off peak |
weekdays Departure location |
random | random |
weekdays arrival Time |
mostly peak | mostly of peak |
weekdays arrival location |
mostly Business | mostly entertainment |
weekend departure time |
mostly early | mostly early |
weekend departure location |
random | random |
weekend arrival time |
mostly late | mostly late |
weekend arrival location |
entertainment | entertainment |
indicating that they mostly commute bet ween work and home |
indicating that they are mostly using bicycles for entertainment. |
Finally based on the above how can we convert casuals to members ?
firstly, we need to convince casuals that they can use these bikes not only for leisure but also for going to work by passing a massage like
OR