Cyclistic_Exercise_Full_Year_Analysis

This analysis is for case study 1 from the Google Data Analytics Certificate (Cyclistic). It’s originally based on the case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study).

We will be using the Divvy data set for the case study. The purpose of this script is to consolidate downloaded Divvy data into a single data frame and then conduct simple analysis to help answer the key question:
“In what ways do members and casual riders use Divvy bikes differently?”

using the following script https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Cyclistic case study
Install required packages
tidyverse for data import and wrangling
libridate for date functions
ggplot for visualization
Markdown for creating HTML report

install packages

install.packages("dplyr")

install.packages("lubridate")

install.packages("ggplot2")

install.packages("markdown")

install.packages("data.table")

install.packages("kableExtra")

install.packages("knitr")

install.packages("tidyverse")

load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)  
library(ggplot2)  
library(markdown)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(knitr)
library(dplyr)

Displays working directory and set new directory

getwd()
## [1] "C:/Users/tsheef/Desktop/case study Cyclistic"
setwd("/Users/tsheef/Desktop/case study Cyclistic")

STEP 1: COLLECT DATA

Upload Divvy data sets (csv files)
https://www.divvybikes.com/data-license-agreement
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num  (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE
Compare column names each of the files

While the names don’t have to be in the same order
They DO need to match perfectly before we can use a
Command to join them into one file
colnames(q3_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q4_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q2_2019)
##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"
colnames(q1_2020)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))
## # A tibble: 704,054 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20          2215          940
##  2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34          6328          258
##  3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43          3003          850
##  4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43          3275         2350
##  5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42          5294         1867
##  6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51          1891          373
##  7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45          1061         1072
##  8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16          1274         1458
##  9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18          6011         1437
## 10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46          2957         8306
## # ℹ 704,044 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## #   gender <chr>, birthyear <dbl>
(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))
## # A tibble: 1,640,718 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41          3591         1214
##  2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44          5353         1048
##  3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42          6180         1554
##  4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10          5540         1503
##  5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26          6014         1213
##  6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31          4941          310
##  7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12          3770         1248
##  8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16          5442         1550
##  9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57          2957         1583
## 10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14          6091         1589
## # ℹ 1,640,708 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## #   gender <chr>, birthyear <dbl>
(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))
## # A tibble: 1,108,163 × 12
##     ride_id started_at          ended_at            rideable_type
##       <dbl> <dttm>              <dttm>                      <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534
## # ℹ 1,108,153 more rows
## # ℹ 8 more variables: `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## #   start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## #   `05 - Member Details Member Birthday Year` <dbl>

Inspect the data frames and look for inconsistencies

str(q1_2020)
## spc_tbl_ [426,887 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:426887], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
##  $ ended_at          : POSIXct[1:426887], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
##  $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
##  $ start_lat         : num [1:426887] 42 42 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:426887] -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:426887] 42 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:426887] -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:426887] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(q4_2019)
## spc_tbl_ [704,054 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...
##  $ started_at        : POSIXct[1:704054], format: "2019-10-01 00:01:39" "2019-10-01 00:02:16" ...
##  $ ended_at          : POSIXct[1:704054], format: "2019-10-01 00:17:20" "2019-10-01 00:06:34" ...
##  $ rideable_type     : num [1:704054] 2215 6328 3003 3275 5294 ...
##  $ tripduration      : num [1:704054] 940 258 850 2350 1867 ...
##  $ start_station_id  : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...
##  $ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...
##  $ end_station_name  : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
##  $ member_casual     : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender            : chr [1:704054] "Male" "Male" "Female" "Male" ...
##  $ birthyear         : num [1:704054] 1987 1998 1991 1990 1987 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(q3_2019)
## spc_tbl_ [1,640,718 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...
##  $ started_at        : POSIXct[1:1640718], format: "2019-07-01 00:00:27" "2019-07-01 00:01:16" ...
##  $ ended_at          : POSIXct[1:1640718], format: "2019-07-01 00:20:41" "2019-07-01 00:18:44" ...
##  $ rideable_type     : num [1:1640718] 3591 5353 6180 5540 6014 ...
##  $ tripduration      : num [1:1640718] 1214 1048 1554 1503 1213 ...
##  $ start_station_id  : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...
##  $ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...
##  $ end_station_name  : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
##  $ member_casual     : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...
##  $ gender            : chr [1:1640718] "Male" NA NA NA ...
##  $ birthyear         : num [1:1640718] 1992 NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(q2_2019)
## spc_tbl_ [1,108,163 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id                                         : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...
##  $ started_at                                      : POSIXct[1:1108163], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at                                        : POSIXct[1:1108163], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type                                   : num [1:1108163] 6251 6226 5649 4151 3270 ...
##  $ 01 - Rental Details Duration In Seconds Uncapped: num [1:1108163] 446 1048 252 357 1007 ...
##  $ start_station_id                                : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name                              : chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id                                  : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name                                : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual                                   : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ Member Gender                                   : chr [1:1108163] "Male" "Female" "Male" "Male" ...
##  $ 05 - Member Details Member Birthday Year        : num [1:1108163] 1975 1984 1990 1993 1992 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `01 - Rental Details Rental ID` = col_double(),
##   ..   `01 - Rental Details Local Start Time` = col_datetime(format = ""),
##   ..   `01 - Rental Details Local End Time` = col_datetime(format = ""),
##   ..   `01 - Rental Details Bike ID` = col_double(),
##   ..   `01 - Rental Details Duration In Seconds Uncapped` = col_number(),
##   ..   `03 - Rental Start Station ID` = col_double(),
##   ..   `03 - Rental Start Station Name` = col_character(),
##   ..   `02 - Rental End Station ID` = col_double(),
##   ..   `02 - Rental End Station Name` = col_character(),
##   ..   `User Type` = col_character(),
##   ..   `Member Gender` = col_character(),
##   ..   `05 - Member Details Member Birthday Year` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Convert ride_id and rideable_type to character so that they can stack correctly

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 

Stack individual quarter’s data frames into one big data frame

all_trips <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020)

Remove lat, long, birth year, and gender fields as this data was dropped beginning in 2020

all_trips <- all_trips %>%  
  select(-c(birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
Inspect the new table that has been created

colnames(all_trips)  
##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "start_station_id"   "start_station_name"
##  [7] "end_station_id"     "end_station_name"   "member_casual"     
## [10] "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"
nrow(all_trips)
## [1] 3879822
dim(all_trips)  
## [1] 3879822      13
head(all_trips)  
## # A tibble: 6 × 13
##   ride_id started_at          ended_at            rideable_type start_station_id
##   <chr>   <dttm>              <dttm>              <chr>                    <dbl>
## 1 221785… 2019-04-01 00:02:22 2019-04-01 00:09:48 6251                        81
## 2 221785… 2019-04-01 00:03:02 2019-04-01 00:20:30 6226                       317
## 3 221785… 2019-04-01 00:11:07 2019-04-01 00:15:19 5649                       283
## 4 221785… 2019-04-01 00:13:01 2019-04-01 00:18:58 4151                        26
## 5 221785… 2019-04-01 00:19:26 2019-04-01 00:36:13 3270                       202
## 6 221785… 2019-04-01 00:19:39 2019-04-01 00:23:56 3123                       420
## # ℹ 8 more variables: start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>
str(all_trips)  
## tibble [3,879,822 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3879822] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ start_lat         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
summary(all_trips)  
##    ride_id            started_at                    
##  Length:3879822     Min.   :2019-04-01 00:02:22.00  
##  Class :character   1st Qu.:2019-06-23 07:49:09.25  
##  Mode  :character   Median :2019-08-14 17:43:38.00  
##                     Mean   :2019-08-26 00:49:59.38  
##                     3rd Qu.:2019-10-12 12:10:21.00  
##                     Max.   :2020-03-31 23:51:34.00  
##                                                     
##     ended_at                      rideable_type      start_station_id
##  Min.   :2019-04-01 00:09:48.00   Length:3879822     Min.   :  1.0   
##  1st Qu.:2019-06-23 08:20:27.75   Class :character   1st Qu.: 77.0   
##  Median :2019-08-14 18:02:04.00   Mode  :character   Median :174.0   
##  Mean   :2019-08-26 01:14:37.06                      Mean   :202.9   
##  3rd Qu.:2019-10-12 12:36:16.75                      3rd Qu.:291.0   
##  Max.   :2020-05-19 20:10:34.00                      Max.   :675.0   
##                                                                      
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:3879822     Min.   :  1.0   Length:3879822     Length:3879822    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :203.8                                        
##                     3rd Qu.:291.0                                        
##                     Max.   :675.0                                        
##                     NA's   :1                                            
##    start_lat         start_lng          end_lat           end_lng       
##  Min.   :42        Min.   :-88       Min.   :42        Min.   :-88      
##  1st Qu.:42        1st Qu.:-88       1st Qu.:42        1st Qu.:-88      
##  Median :42        Median :-88       Median :42        Median :-88      
##  Mean   :42        Mean   :-88       Mean   :42        Mean   :-88      
##  3rd Qu.:42        3rd Qu.:-88       3rd Qu.:42        3rd Qu.:-88      
##  Max.   :42        Max.   :-88       Max.   :42        Max.   :-88      
##  NA's   :3452935   NA's   :3452935   NA's   :3452936   NA's   :3452936

There are a few problems we will need to fix:

(1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.
(2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.
(3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “trip duration” column. We will add “ride_length” to the entire data frame for consistency.
(4) There are some rides where trip duration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.
In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual”
Before 2020, Divvy used different labels for these two types of riders … we will want to make our dataframe consistent with their current nomenclature
N.B.: “Level” is a special property of a column that is retained even if a subset does not contain any values from a specific level
Begin by seeing how many observations fall under each user type
table(all_trips$member_casual)
## 
##     casual   Customer     member Subscriber 
##      48480     857474     378407    2595461

Reassign to the desired values (we will go with the current 2020 labels)

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))

Check to make sure the proper number of observations were reassigned

table(all_trips$member_casual)
## 
##  casual  member 
##  905954 2973868

Add columns that list the date, month, day, and year of each ride

This will allow us to aggregate ride data for each month, day, or year … before completing these operations we could only aggregate at the ride level
all_trips$date <- as.Date(all_trips$started_at) 
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$time_start <- format(as.POSIXct(all_trips$started_at,format="%H:%M:%S"),"%H:%M")
all_trips$time_end <- format(as.POSIXct(all_trips$ended_at,format="%H:%M:%S"),"%H:%M")

Add a “ride_length” calculation to all_trips (in seconds)

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Inspect the structure of the columns

str(all_trips)
## tibble [3,879,822 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3879822] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3879822] "member" "member" "member" "member" ...
##  $ start_lat         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ date              : Date[1:3879822], format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr [1:3879822] "04" "04" "04" "04" ...
##  $ day               : chr [1:3879822] "01" "01" "01" "01" ...
##  $ year              : chr [1:3879822] "2019" "2019" "2019" "2019" ...
##  $ day_of_week       : chr [1:3879822] "Monday" "Monday" "Monday" "Monday" ...
##  $ time_start        : chr [1:3879822] "00:02" "00:03" "00:11" "00:13" ...
##  $ time_end          : chr [1:3879822] "00:09" "00:20" "00:15" "00:18" ...
##  $ ride_length       : 'difftime' num [1:3879822] 446 1048 252 357 ...
##   ..- attr(*, "units")= chr "secs"

Convert “ride_length” from Factor to numeric so we can run calculations on the data

is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE

Remove “bad” data

The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

We will create a new version of the data frame (v2) since data is being removed

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
str(all_trips_v2)
## tibble [3,876,042 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3876042] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3876042], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3876042], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3876042] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3876042] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3876042] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3876042] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3876042] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3876042] "member" "member" "member" "member" ...
##  $ start_lat         : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ date              : Date[1:3876042], format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr [1:3876042] "04" "04" "04" "04" ...
##  $ day               : chr [1:3876042] "01" "01" "01" "01" ...
##  $ year              : chr [1:3876042] "2019" "2019" "2019" "2019" ...
##  $ day_of_week       : chr [1:3876042] "Monday" "Monday" "Monday" "Monday" ...
##  $ time_start        : chr [1:3876042] "00:02" "00:03" "00:11" "00:13" ...
##  $ time_end          : chr [1:3876042] "00:09" "00:20" "00:15" "00:18" ...
##  $ ride_length       : num [1:3876042] 446 1048 252 357 1007 ...

STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

Descriptive analysis on ride_length (all figures in seconds)
Mean(all_trips_v2 ride_length) straight average (total ride length / rides)
Median(all_trips_v2 ride_length) midpoint number in the ascending array of ride lengths
Max(all_trips_v2 ride_length) longest ride
Min(all_trips_v2 ride_length) shortest ride
You can condense the four lines above to one line using summary() on the specific attribute
summary(all_trips_v2$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     412     712    1479    1289 9387024

Compare members and casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                3552.7502
## 2                     member                 850.0662
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                     1546
## 2                     member                      589
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                  9387024
## 2                     member                  9056634
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                        2
## 2                     member                        1

See the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Friday                3773.8351
## 2                      member                   Friday                 824.5305
## 3                      casual                   Monday                3372.2869
## 4                      member                   Monday                 842.5726
## 5                      casual                 Saturday                3331.9138
## 6                      member                 Saturday                 968.9337
## 7                      casual                   Sunday                3581.4054
## 8                      member                   Sunday                 919.9746
## 9                      casual                 Thursday                3682.9847
## 10                     member                 Thursday                 823.9278
## 11                     casual                  Tuesday                3596.3599
## 12                     member                  Tuesday                 826.1427
## 13                     casual                Wednesday                3718.6619
## 14                     member                Wednesday                 823.9996

Notice that the days of the week are out of order. Let’s fix that.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Now, let’s run the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Sunday                3581.4054
## 2                      member                   Sunday                 919.9746
## 3                      casual                   Monday                3372.2869
## 4                      member                   Monday                 842.5726
## 5                      casual                  Tuesday                3596.3599
## 6                      member                  Tuesday                 826.1427
## 7                      casual                Wednesday                3718.6619
## 8                      member                Wednesday                 823.9996
## 9                      casual                 Thursday                3682.9847
## 10                     member                 Thursday                 823.9278
## 11                     casual                   Friday                3773.8351
## 12                     member                   Friday                 824.5305
## 13                     casual                 Saturday                3331.9138
## 14                     member                 Saturday                 968.9337
analyze ridership data by type and weekday
creates weekday field using wday()
groups by usertype and weekday
calculates the number of rides and average duration
calculates the average duration
sorts
all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>%  
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>           <int>           <int>            <dbl>
##  1 casual              1          181293            3581.
##  2 casual              2          103296            3372.
##  3 casual              3           90510            3596.
##  4 casual              4           92457            3719.
##  5 casual              5          102679            3683.
##  6 casual              6          122404            3774.
##  7 casual              7          209543            3332.
##  8 member              1          267965             920.
##  9 member              2          472196             843.
## 10 member              3          508445             826.
## 11 member              4          500329             824.
## 12 member              5          484177             824.
## 13 member              6          452790             825.
## 14 member              7          287958             969.

Let’s visualize the ratio of Members To Casuals

membership <- count(all_trips_v2,member_casual , name = "count")

ggplot(membership, aes(x = member_casual, y = count)) +
  geom_col(fill="lightblue",width = 0.5)+
  geom_text(aes(label=count), vjust = 2, fontface = "bold") +
  scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))

Let’s visualize the number of rides by rider type

all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.5))+
  scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Let’s create a visualization for average duration

all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.5))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Let’s create a visualization for the top 5 departure Stations for members

departure_station_members <- all_trips_v2 %>% 
     filter(member_casual == "member" ) %>%
     group_by(start_station_name,.drop = FALSE, ) %>% 
     count(sort = TRUE , name = "count") %>%
    head(5)
     
ggplot(departure_station_members, aes(x = count, y = start_station_name)) +
  geom_col(fill="blue",width = 0.5)

Let’s create a visualization for the top 5 departure Stations for casuals

departure_station_casual <- all_trips_v2 %>% 
  filter(member_casual == "casual" ) %>%
  group_by(start_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5) 
  
ggplot(departure_station_casual, aes(x = count, y = start_station_name)) +
  geom_col(fill="red",width = 0.5)

Let’s create a visualization for the top 5 Arrival Stations for members

arrival_station_members <- all_trips_v2 %>% 
  filter(member_casual == "member" ) %>%
  group_by(end_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5)

ggplot(arrival_station_members, aes(x = count, y = end_station_name)) +
  geom_col(fill="blue",width = 0.5)

Let’s create a visualization for the top 5 Arrival Stations for casuals

arrival_station_casual <- all_trips_v2 %>% 
  filter(member_casual == "casual" ) %>%
  group_by(end_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5)
ggplot(arrival_station_casual, aes(x = count, y = end_station_name)) +
  geom_col(fill="red",width = 0.5)

Let’s create a visualization for the Starting time for Casuals And members by each day of the week

  ggplot(data = all_trips) +
    geom_bar(mapping = aes(x = time_start, fill = member_casual)) +
    facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))

Let’s create a visualization for the ending time for Casual And members by day of the week

  ggplot(data = all_trips) +
    geom_bar(mapping = aes(x = time_end, fill = member_casual)) +
    facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))

lets create a table of all starting stations with coordination to check later on Google Maps

start_stations_location <- q1_2020 %>%     
add_count(start_station_name, sort = TRUE) %>%    
  distinct(start_station_name, .keep_all = TRUE) %>%         
  select(c(start_station_name, n, start_lat,start_lng))       

kable(head(start_stations_location,10),format = "html", align = "lcll") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
start_station_name n start_lat start_lng
Canal St & Adams St 7813 41.8793 -87.6399
Clinton St & Madison St 6797 41.8822 -87.6411
Clinton St & Washington Blvd 5941 41.8834 -87.6412
Kingsbury St & Kinzie St 4626 41.8892 -87.6385
Columbus Dr & Randolph St 4425 41.8847 -87.6195
HQ QR 3767 41.8899 -87.6803
Franklin St & Monroe St 3711 41.8803 -87.6352
Canal St & Madison St 3637 41.8821 -87.6398
Clinton St & Lake St 3580 41.8856 -87.6418
Larrabee St & Kingsbury St 3550 41.8978 -87.6429

lets create a table of all ending stations with coordination to check later on Google Maps

end_stations_location <- q1_2020 %>%      
  add_count(end_station_name, sort = TRUE) %>%        
  distinct(end_station_name, .keep_all = TRUE) %>%     
  select(c(end_station_name, n, end_lat,end_lng))       

 
   kable(head(end_stations_location,10),format = "html", align = "lcll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name n end_lat end_lng
Canal St & Adams St 8323 41.8793 -87.6399
Clinton St & Washington Blvd 7166 41.8834 -87.6412
Clinton St & Madison St 6854 41.8822 -87.6411
Kingsbury St & Kinzie St 4615 41.8892 -87.6385
Michigan Ave & Washington St 4227 41.8840 -87.6247
HQ QR 3766 41.8899 -87.6803
Franklin St & Monroe St 3748 41.8803 -87.6352
Canal St & Madison St 3706 41.8821 -87.6398
Clinton St & Lake St 3646 41.8856 -87.6418
St. Clair St & Erie St 3618 41.8944 -87.6227

lets visualize the top 5 departure stations for members on Google Maps (red markers)

Top5_departure_station_members <-left_join(departure_station_members,start_stations_location,by = "start_station_name") %>%   
select(c(start_station_name, count, start_lat,start_lng))  

kable(Top5_departure_station_members, format = "html", align = "lcll") %>%   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))    
start_station_name count start_lat start_lng
Canal St & Adams St 51948 41.8793 -87.6399
Clinton St & Madison St 46191 41.8822 -87.6411
Clinton St & Washington Blvd 43590 41.8834 -87.6412
Columbus Dr & Randolph St 31053 41.8847 -87.6195
Franklin St & Monroe St 30982 41.8803 -87.6352

Click on the image to view in Google Maps

Click to see on Google Maps

lets visualize the top 5 departure stations for casuals on Google Maps (red markers)

Top5_departure_station_casual <-left_join(departure_station_casual,start_stations_location,by = "start_station_name") %>%   
select(c(start_station_name, count, start_lat,start_lng))  

kable(Top5_departure_station_casual, format = "html", align = "lcll") %>%   
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
start_station_name count start_lat start_lng
Streeter Dr & Grand Ave 53415 41.8923 -87.6120
Lake Shore Dr & Monroe St 39686 41.8810 -87.6167
Millennium Park 21901 41.8810 -87.6241
Michigan Ave & Oak St 21633 41.9010 -87.6238
Shedd Aquarium 20781 41.8672 -87.6154

Click on the image to view in Google Maps

click to see on google maps

lets visualize the top 5 arrival stations for members on Google Maps (red markers)

Top5_arrival_station_members <-left_join(arrival_station_members, end_stations_location,by = "end_station_name") %>% 
  select(c(end_station_name, count, end_lat,end_lng))  

kable(Top5_arrival_station_members, format = "html", align = "lcll") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name count end_lat end_lng
Canal St & Adams St 48839 41.8793 -87.6399
Clinton St & Washington Blvd 47633 41.8834 -87.6412
Clinton St & Madison St 44285 41.8822 -87.6411
Daley Center Plaza 30845 41.8842 -87.6296
Kingsbury St & Kinzie St 30404 41.8892 -87.6385

Click on the image to view in Google Maps

click to see on google maps

lets visualize the top 5 arrival stations for casuals on Google Maps (red markers)

Top5_arrival_station_casual <-left_join(arrival_station_casual, end_stations_location,by = "end_station_name") %>% 
  select(c(end_station_name, count, end_lat,end_lng))  

kable(Top5_arrival_station_casual, format = "html", align = "lcll") %>%   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
end_station_name count end_lat end_lng
Streeter Dr & Grand Ave 67507 41.8923 -87.6120
Lake Shore Dr & Monroe St 31051 41.8810 -87.6167
Millennium Park 25509 41.8810 -87.6241
Michigan Ave & Oak St 23982 41.9010 -87.6238
Lake Shore Dr & North Blvd 23477 41.9117 -87.6268

Click on the image to view in Google Maps

click to see on google maps

Our observations and analysts

After Cleaning, organizing the data and analyzing each graph we had the following observations:

First number of casual customers is 905954 while number of members is 2973868

Let’s visualize the ratio of Members To Casuals

Second number of rides are far more on the members side than casuals.

Let’s visualize the number of rides by rider type

Third casuals duration time is far more than members duration time

Let’s create a visualization for average duration

Forth visualizing the number of rides, it is clear that members take more rides than casuals, every day of the week.

Let’s visualize the number of rides by rider type

Fifth visualizing the average duration we see that Casuals Spend much more time per trip than members

Let’s create a visualization for average duration

Sixth visualizing the Starting time for Casuals compared to members by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off peak times mostly in the afternoons and late at night. while on week ends casuals out number members most of the day.

Let’s create a visualization for the Starting time for Casuals And mambers by each day of the week

Seventh visualizing the end time for members compared to Casuals by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off-peak times mostly again in the afternoons and late at night. while on week ends casuals out number members most of the day.

Let’s create a visualization for the ending time for Casual And members by day of the week

Eighth visualizing the top 5 departure Stations for members we check their locations on Google maps where it shows random type of departure points

lets visualize the top 5 departure stations for members on Google Maps (red markers)

Ninth visualizing the top 5 departure Stations for casuals. we check their locations on Google maps

where it shows random type of departure points.

lets visualize the top 5 departure stations for casuals on Google Maps (red markers)

Tenth visualizing the top 5 arrival Stations for members. we check their locations on Google maps

where it shows business type of arrival points

lets visualize the top 5 arrival stations for members on Google Maps (red markers)

Eleventh visualizing the top 5 arrival Stations for casuals. we check their locations on Google maps

where it shows entertainment type of arrival points.

Based on all the previous analyses and the Eleven observations, to answer our key Question: <br>

“In what ways do members and casual riders use Divvy bikes differently?”

we can summarize that:
most members, while they use their bikes much more frequently, they tend to use it for less duration time they use their bikes departing at peak working hours and arriving back at peak working hours departing from random locations probably their homes to business location then back. indicating that they mostly commute between work and home.
while casuals though they use their bikes much less number of times but for much longer duration, they tend to use their bikes mostly off peak times and late after-work from random location to entertainment locations as we can induce from Google maps therefore casuals are mostly using bicycles for entertainment.

Members

Casuals

Count

2973868 905954

Count %

77% 23%

Frequency of use

much higher

Duration of use

much higher

weekdays Departure Time

mostly peak mostly off peak

weekdays Departure location

random random

weekdays arrival Time

mostly peak mostly of peak

weekdays arrival location

mostly Business mostly entertainment

weekend departure time

mostly early mostly early

weekend departure location

random random

weekend arrival time

mostly late mostly late

weekend arrival location

entertainment entertainment

indicating that they mostly commute bet ween work and home
indicating that they are mostly using bicycles for entertainment.

Finally based on the above how can we convert casuals to members ?

firstly, we need to convince casuals that they can use these bikes not only for leisure but also for going to work by passing a massage like

““why don’t you extend your leisure time?””

OR

““Workout to work by a Healthy, Green, Inexpensive commute .””
these massages can be sent directly to our customers, excluding members by contact information, such as emails and/or SMS. also could be placed on Google adds based on the top 5 locations and also on billboards at those same locations.
we could also give a special discount for a limited period of time.
In addition since these customers are heading for entertainments we could also arrange for packages with the most desired locations. like a discount or a free drink or a day pass or a free child pass etc.
However based on my experience in sales and marketing i think those two categories members and casuals are so much different that i would recommend for better CPC “cost per customer rates” to spend our advertising budget on getting more customers in general rather than trying to convert our casuals.
Also since we have no pricing structure to compare, we do not know if converting casuals to members wont in fact reduce our profitability buy reducing income while adding more time consumption on our vehicles since casuals if converted would use bikes more frequently as indicated by our analyses while they already use it for much more extended times.