Cyclistic

Cyclistic_Exercise_Full_Year_Analysis

This analysis is for case study 1 from the Google Data Analytics Certificate (Cyclistic). It’s originally based on the case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study).

We will be using the Divvy data set for the case study. The purpose of this script is to consolidate downloaded Divvy data into a single data frame and then conduct simple analysis to help answer the key question:
“In what ways do members and casual riders use Divvy bikes differently?”

using the following script https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Cyclistic case study
Install required packages
tidyverse for data import and wrangling
libridate for date functions
ggplot for visualization
Markdown for creating HTML report

`install packages`

install.packages("dplyr")

install.packages("lubridate")

install.packages("ggplot2")

install.packages("markdown")

install.packages("data.table")

install.packages("kableExtra")

install.packages("knitr")

install.packages("tidyverse")

load Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)  
library(ggplot2)  
library(markdown)
library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(knitr)
library(dplyr)

Displays working directory and set new directory

getwd()

## [1] "C:/Users/tsheef/Desktop/case study Cyclistic"

setwd("/Users/tsheef/Desktop/case study Cyclistic")

STEP 1: COLLECT DATA

Upload Divvy data sets (csv files)

https://divvy-tripdata.s3.amazonaws.com/index.html

https://www.divvybikes.com/data-license-agreement

q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")

## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num  (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")

## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")

## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE
Compare column names each of the files

While the names don’t have to be in the same order

They DO need to match perfectly before we can use a

Command to join them into one file

colnames(q3_2019)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(q4_2019)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(q2_2019)

##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"

colnames(q1_2020)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

## # A tibble: 704,054 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20          2215          940
##  2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34          6328          258
##  3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43          3003          850
##  4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43          3275         2350
##  5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42          5294         1867
##  6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51          1891          373
##  7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45          1061         1072
##  8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16          1274         1458
##  9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18          6011         1437
## 10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46          2957         8306
## # ℹ 704,044 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## #   gender <chr>, birthyear <dbl>

(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

## # A tibble: 1,640,718 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41          3591         1214
##  2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44          5353         1048
##  3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42          6180         1554
##  4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10          5540         1503
##  5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26          6014         1213
##  6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31          4941          310
##  7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12          3770         1248
##  8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16          5442         1550
##  9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57          2957         1583
## 10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14          6091         1589
## # ℹ 1,640,708 more rows
## # ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## #   gender <chr>, birthyear <dbl>

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))

## # A tibble: 1,108,163 × 12
##     ride_id started_at          ended_at            rideable_type
##       <dbl> <dttm>              <dttm>                      <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534
## # ℹ 1,108,153 more rows
## # ℹ 8 more variables: `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## #   start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## #   `05 - Member Details Member Birthday Year` <dbl>

Inspect the data frames and look for inconsistencies

str(q1_2020)

## spc_tbl_ [426,887 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:426887], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
##  $ ended_at          : POSIXct[1:426887], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
##  $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
##  $ start_lat         : num [1:426887] 42 42 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:426887] -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:426887] 42 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:426887] -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:426887] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(q4_2019)

## spc_tbl_ [704,054 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...
##  $ started_at        : POSIXct[1:704054], format: "2019-10-01 00:01:39" "2019-10-01 00:02:16" ...
##  $ ended_at          : POSIXct[1:704054], format: "2019-10-01 00:17:20" "2019-10-01 00:06:34" ...
##  $ rideable_type     : num [1:704054] 2215 6328 3003 3275 5294 ...
##  $ tripduration      : num [1:704054] 940 258 850 2350 1867 ...
##  $ start_station_id  : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...
##  $ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...
##  $ end_station_name  : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
##  $ member_casual     : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender            : chr [1:704054] "Male" "Male" "Female" "Male" ...
##  $ birthyear         : num [1:704054] 1987 1998 1991 1990 1987 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(q3_2019)

## spc_tbl_ [1,640,718 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...
##  $ started_at        : POSIXct[1:1640718], format: "2019-07-01 00:00:27" "2019-07-01 00:01:16" ...
##  $ ended_at          : POSIXct[1:1640718], format: "2019-07-01 00:20:41" "2019-07-01 00:18:44" ...
##  $ rideable_type     : num [1:1640718] 3591 5353 6180 5540 6014 ...
##  $ tripduration      : num [1:1640718] 1214 1048 1554 1503 1213 ...
##  $ start_station_id  : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...
##  $ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...
##  $ end_station_name  : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
##  $ member_casual     : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...
##  $ gender            : chr [1:1640718] "Male" NA NA NA ...
##  $ birthyear         : num [1:1640718] 1992 NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(q2_2019)

## spc_tbl_ [1,108,163 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id                                         : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...
##  $ started_at                                      : POSIXct[1:1108163], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at                                        : POSIXct[1:1108163], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type                                   : num [1:1108163] 6251 6226 5649 4151 3270 ...
##  $ 01 - Rental Details Duration In Seconds Uncapped: num [1:1108163] 446 1048 252 357 1007 ...
##  $ start_station_id                                : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name                              : chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id                                  : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name                                : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual                                   : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ Member Gender                                   : chr [1:1108163] "Male" "Female" "Male" "Male" ...
##  $ 05 - Member Details Member Birthday Year        : num [1:1108163] 1975 1984 1990 1993 1992 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `01 - Rental Details Rental ID` = col_double(),
##   ..   `01 - Rental Details Local Start Time` = col_datetime(format = ""),
##   ..   `01 - Rental Details Local End Time` = col_datetime(format = ""),
##   ..   `01 - Rental Details Bike ID` = col_double(),
##   ..   `01 - Rental Details Duration In Seconds Uncapped` = col_number(),
##   ..   `03 - Rental Start Station ID` = col_double(),
##   ..   `03 - Rental Start Station Name` = col_character(),
##   ..   `02 - Rental End Station ID` = col_double(),
##   ..   `02 - Rental End Station Name` = col_character(),
##   ..   `User Type` = col_character(),
##   ..   `Member Gender` = col_character(),
##   ..   `05 - Member Details Member Birthday Year` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Convert ride_id and rideable_type to character so that they can stack correctly

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))

Stack individual quarter’s data frames into one big data frame

all_trips <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020)

Remove lat, long, birth year, and gender fields as this data was dropped beginning in 2020

all_trips <- all_trips %>%  
  select(-c(birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
Inspect the new table that has been created

colnames(all_trips)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "start_station_id"   "start_station_name"
##  [7] "end_station_id"     "end_station_name"   "member_casual"     
## [10] "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"

nrow(all_trips)

## [1] 3879822

dim(all_trips)

## [1] 3879822      13

head(all_trips)

## # A tibble: 6 × 13
##   ride_id started_at          ended_at            rideable_type start_station_id
##   <chr>   <dttm>              <dttm>              <chr>                    <dbl>
## 1 221785… 2019-04-01 00:02:22 2019-04-01 00:09:48 6251                        81
## 2 221785… 2019-04-01 00:03:02 2019-04-01 00:20:30 6226                       317
## 3 221785… 2019-04-01 00:11:07 2019-04-01 00:15:19 5649                       283
## 4 221785… 2019-04-01 00:13:01 2019-04-01 00:18:58 4151                        26
## 5 221785… 2019-04-01 00:19:26 2019-04-01 00:36:13 3270                       202
## 6 221785… 2019-04-01 00:19:39 2019-04-01 00:23:56 3123                       420
## # ℹ 8 more variables: start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>

str(all_trips)

## tibble [3,879,822 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3879822] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ start_lat         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...

summary(all_trips)

##    ride_id            started_at                    
##  Length:3879822     Min.   :2019-04-01 00:02:22.00  
##  Class :character   1st Qu.:2019-06-23 07:49:09.25  
##  Mode  :character   Median :2019-08-14 17:43:38.00  
##                     Mean   :2019-08-26 00:49:59.38  
##                     3rd Qu.:2019-10-12 12:10:21.00  
##                     Max.   :2020-03-31 23:51:34.00  
##                                                     
##     ended_at                      rideable_type      start_station_id
##  Min.   :2019-04-01 00:09:48.00   Length:3879822     Min.   :  1.0   
##  1st Qu.:2019-06-23 08:20:27.75   Class :character   1st Qu.: 77.0   
##  Median :2019-08-14 18:02:04.00   Mode  :character   Median :174.0   
##  Mean   :2019-08-26 01:14:37.06                      Mean   :202.9   
##  3rd Qu.:2019-10-12 12:36:16.75                      3rd Qu.:291.0   
##  Max.   :2020-05-19 20:10:34.00                      Max.   :675.0   
##                                                                      
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:3879822     Min.   :  1.0   Length:3879822     Length:3879822    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :203.8                                        
##                     3rd Qu.:291.0                                        
##                     Max.   :675.0                                        
##                     NA's   :1                                            
##    start_lat         start_lng          end_lat           end_lng       
##  Min.   :42        Min.   :-88       Min.   :42        Min.   :-88      
##  1st Qu.:42        1st Qu.:-88       1st Qu.:42        1st Qu.:-88      
##  Median :42        Median :-88       Median :42        Median :-88      
##  Mean   :42        Mean   :-88       Mean   :42        Mean   :-88      
##  3rd Qu.:42        3rd Qu.:-88       3rd Qu.:42        3rd Qu.:-88      
##  Max.   :42        Max.   :-88       Max.   :42        Max.   :-88      
##  NA's   :3452935   NA's   :3452935   NA's   :3452936   NA's   :3452936

There are a few problems we will need to fix:

(1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.

(2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.

(3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “trip duration” column. We will add “ride_length” to the entire data frame for consistency.

(4) There are some rides where trip duration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual”

Before 2020, Divvy used different labels for these two types of riders … we will want to make our dataframe consistent with their current nomenclature

N.B.: “Level” is a special property of a column that is retained even if a subset does not contain any values from a specific level

Begin by seeing how many observations fall under each user type

table(all_trips$member_casual)

## 
##     casual   Customer     member Subscriber 
##      48480     857474     378407    2595461

Reassign to the desired values (we will go with the current 2020 labels)

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))

Check to make sure the proper number of observations were reassigned

table(all_trips$member_casual)

## 
##  casual  member 
##  905954 2973868

Add columns that list the date, month, day, and year of each ride

This will allow us to aggregate ride data for each month, day, or year … before completing these operations we could only aggregate at the ride level

all_trips$date <- as.Date(all_trips$started_at) 
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$time_start <- format(as.POSIXct(all_trips$started_at,format="%H:%M:%S"),"%H:%M")
all_trips$time_end <- format(as.POSIXct(all_trips$ended_at,format="%H:%M:%S"),"%H:%M")

Add a “ride_length” calculation to all_trips (in seconds)

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Inspect the structure of the columns

str(all_trips)

## tibble [3,879,822 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3879822] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3879822] "member" "member" "member" "member" ...
##  $ start_lat         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ date              : Date[1:3879822], format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr [1:3879822] "04" "04" "04" "04" ...
##  $ day               : chr [1:3879822] "01" "01" "01" "01" ...
##  $ year              : chr [1:3879822] "2019" "2019" "2019" "2019" ...
##  $ day_of_week       : chr [1:3879822] "Monday" "Monday" "Monday" "Monday" ...
##  $ time_start        : chr [1:3879822] "00:02" "00:03" "00:11" "00:13" ...
##  $ time_end          : chr [1:3879822] "00:09" "00:20" "00:15" "00:18" ...
##  $ ride_length       : 'difftime' num [1:3879822] 446 1048 252 357 ...
##   ..- attr(*, "units")= chr "secs"

Convert “ride_length” from Factor to numeric so we can run calculations on the data

is.factor(all_trips$ride_length)

## [1] FALSE

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

## [1] TRUE

Remove “bad” data

The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

We will create a new version of the data frame (v2) since data is being removed

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

str(all_trips_v2)

## tibble [3,876,042 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3876042] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : POSIXct[1:3876042], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at          : POSIXct[1:3876042], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type     : chr [1:3876042] "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : num [1:3876042] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr [1:3876042] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : num [1:3876042] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr [1:3876042] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr [1:3876042] "member" "member" "member" "member" ...
##  $ start_lat         : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng         : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat           : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng           : num [1:3876042] NA NA NA NA NA NA NA NA NA NA ...
##  $ date              : Date[1:3876042], format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr [1:3876042] "04" "04" "04" "04" ...
##  $ day               : chr [1:3876042] "01" "01" "01" "01" ...
##  $ year              : chr [1:3876042] "2019" "2019" "2019" "2019" ...
##  $ day_of_week       : chr [1:3876042] "Monday" "Monday" "Monday" "Monday" ...
##  $ time_start        : chr [1:3876042] "00:02" "00:03" "00:11" "00:13" ...
##  $ time_end          : chr [1:3876042] "00:09" "00:20" "00:15" "00:18" ...
##  $ ride_length       : num [1:3876042] 446 1048 252 357 1007 ...

STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

Descriptive analysis on ride_length (all figures in seconds)
Mean(all_trips_v2 ride_length) straight average (total ride length / rides)
Median(all_trips_v2 ride_length) midpoint number in the ascending array of ride lengths
Max(all_trips_v2 ride_length) longest ride
Min(all_trips_v2 ride_length) shortest ride
You can condense the four lines above to one line using summary() on the specific attribute

summary(all_trips_v2$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     412     712    1479    1289 9387024

Compare members and casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)

##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                3552.7502
## 2                     member                 850.0662

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                     1546
## 2                     member                      589

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                  9387024
## 2                     member                  9056634

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                        2
## 2                     member                        1

See the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Friday                3773.8351
## 2                      member                   Friday                 824.5305
## 3                      casual                   Monday                3372.2869
## 4                      member                   Monday                 842.5726
## 5                      casual                 Saturday                3331.9138
## 6                      member                 Saturday                 968.9337
## 7                      casual                   Sunday                3581.4054
## 8                      member                   Sunday                 919.9746
## 9                      casual                 Thursday                3682.9847
## 10                     member                 Thursday                 823.9278
## 11                     casual                  Tuesday                3596.3599
## 12                     member                  Tuesday                 826.1427
## 13                     casual                Wednesday                3718.6619
## 14                     member                Wednesday                 823.9996

Notice that the days of the week are out of order. Let’s fix that.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Now, let’s run the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Sunday                3581.4054
## 2                      member                   Sunday                 919.9746
## 3                      casual                   Monday                3372.2869
## 4                      member                   Monday                 842.5726
## 5                      casual                  Tuesday                3596.3599
## 6                      member                  Tuesday                 826.1427
## 7                      casual                Wednesday                3718.6619
## 8                      member                Wednesday                 823.9996
## 9                      casual                 Thursday                3682.9847
## 10                     member                 Thursday                 823.9278
## 11                     casual                   Friday                3773.8351
## 12                     member                   Friday                 824.5305
## 13                     casual                 Saturday                3331.9138
## 14                     member                 Saturday                 968.9337

analyze ridership data by type and weekday

creates weekday field using wday()

groups by usertype and weekday

calculates the number of rides and average duration

calculates the average duration

sorts

all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>%  
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>           <int>           <int>            <dbl>
##  1 casual              1          181293            3581.
##  2 casual              2          103296            3372.
##  3 casual              3           90510            3596.
##  4 casual              4           92457            3719.
##  5 casual              5          102679            3683.
##  6 casual              6          122404            3774.
##  7 casual              7          209543            3332.
##  8 member              1          267965             920.
##  9 member              2          472196             843.
## 10 member              3          508445             826.
## 11 member              4          500329             824.
## 12 member              5          484177             824.
## 13 member              6          452790             825.
## 14 member              7          287958             969.

Let’s visualize the ratio of Members To Casuals

membership <- count(all_trips_v2,member_casual , name = "count")

ggplot(membership, aes(x = member_casual, y = count)) +
  geom_col(fill="lightblue",width = 0.5)+
  geom_text(aes(label=count), vjust = 2, fontface = "bold") +
  scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))

Let’s visualize the number of rides by rider type

all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.5))+
  scale_y_continuous(labels =function(y) format(y,big.mark =",",scientific = FALSE))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Let’s create a visualization for average duration

all_trips_v2 %>% 
  mutate(weekday = wday(started_at)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.5))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Let’s create a visualization for the top 5 departure Stations for members

departure_station_members <- all_trips_v2 %>% 
     filter(member_casual == "member" ) %>%
     group_by(start_station_name,.drop = FALSE, ) %>% 
     count(sort = TRUE , name = "count") %>%
    head(5)
     
ggplot(departure_station_members, aes(x = count, y = start_station_name)) +
  geom_col(fill="blue",width = 0.5)

Let’s create a visualization for the top 5 departure Stations for casuals

departure_station_casual <- all_trips_v2 %>% 
  filter(member_casual == "casual" ) %>%
  group_by(start_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5) 
  
ggplot(departure_station_casual, aes(x = count, y = start_station_name)) +
  geom_col(fill="red",width = 0.5)

Let’s create a visualization for the top 5 Arrival Stations for members

arrival_station_members <- all_trips_v2 %>% 
  filter(member_casual == "member" ) %>%
  group_by(end_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5)

ggplot(arrival_station_members, aes(x = count, y = end_station_name)) +
  geom_col(fill="blue",width = 0.5)

Let’s create a visualization for the top 5 Arrival Stations for casuals

arrival_station_casual <- all_trips_v2 %>% 
  filter(member_casual == "casual" ) %>%
  group_by(end_station_name,.drop = FALSE, ) %>% 
  count(sort = TRUE, name = "count") %>%
  head(5)
ggplot(arrival_station_casual, aes(x = count, y = end_station_name)) +
  geom_col(fill="red",width = 0.5)

Let’s create a visualization for the Starting time for Casuals And members by each day of the week

  ggplot(data = all_trips) +
    geom_bar(mapping = aes(x = time_start, fill = member_casual)) +
    facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))

Let’s create a visualization for the ending time for Casual And members by day of the week

  ggplot(data = all_trips) +
    geom_bar(mapping = aes(x = time_end, fill = member_casual)) +
    facet_grid(~factor(day_of_week, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')))

lets create a table of all starting stations with coordination to check later on Google Maps

start_stations_location <- q1_2020 %>%     
add_count(start_station_name, sort = TRUE) %>%    
  distinct(start_station_name, .keep_all = TRUE) %>%         
  select(c(start_station_name, n, start_lat,start_lng))       

kable(head(start_stations_location,10),format = "html", align = "lcll") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

start_station_name	n	start_lat	start_lng
Canal St & Adams St	7813	41.8793	-87.6399
Clinton St & Madison St	6797	41.8822	-87.6411
Clinton St & Washington Blvd	5941	41.8834	-87.6412
Kingsbury St & Kinzie St	4626	41.8892	-87.6385
Columbus Dr & Randolph St	4425	41.8847	-87.6195
HQ QR	3767	41.8899	-87.6803
Franklin St & Monroe St	3711	41.8803	-87.6352
Canal St & Madison St	3637	41.8821	-87.6398
Clinton St & Lake St	3580	41.8856	-87.6418
Larrabee St & Kingsbury St	3550	41.8978	-87.6429

lets create a table of all ending stations with coordination to check later on Google Maps

end_stations_location <- q1_2020 %>%      
  add_count(end_station_name, sort = TRUE) %>%        
  distinct(end_station_name, .keep_all = TRUE) %>%     
  select(c(end_station_name, n, end_lat,end_lng))       

 
   kable(head(end_stations_location,10),format = "html", align = "lcll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

end_station_name	n	end_lat	end_lng
Canal St & Adams St	8323	41.8793	-87.6399
Clinton St & Washington Blvd	7166	41.8834	-87.6412
Clinton St & Madison St	6854	41.8822	-87.6411
Kingsbury St & Kinzie St	4615	41.8892	-87.6385
Michigan Ave & Washington St	4227	41.8840	-87.6247
HQ QR	3766	41.8899	-87.6803
Franklin St & Monroe St	3748	41.8803	-87.6352
Canal St & Madison St	3706	41.8821	-87.6398
Clinton St & Lake St	3646	41.8856	-87.6418
St. Clair St & Erie St	3618	41.8944	-87.6227

lets visualize the top 5 departure stations for members on Google Maps (red markers)

Top5_departure_station_members <-left_join(departure_station_members,start_stations_location,by = "start_station_name") %>%   
select(c(start_station_name, count, start_lat,start_lng))  

kable(Top5_departure_station_members, format = "html", align = "lcll") %>%   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

start_station_name	count	start_lat	start_lng
Canal St & Adams St	51948	41.8793	-87.6399
Clinton St & Madison St	46191	41.8822	-87.6411
Clinton St & Washington Blvd	43590	41.8834	-87.6412
Columbus Dr & Randolph St	31053	41.8847	-87.6195
Franklin St & Monroe St	30982	41.8803	-87.6352

Click on the image to view in Google Maps

lets visualize the top 5 departure stations for casuals on Google Maps (red markers)

Top5_departure_station_casual <-left_join(departure_station_casual,start_stations_location,by = "start_station_name") %>%   
select(c(start_station_name, count, start_lat,start_lng))  

kable(Top5_departure_station_casual, format = "html", align = "lcll") %>%   
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

start_station_name	count	start_lat	start_lng
Streeter Dr & Grand Ave	53415	41.8923	-87.6120
Lake Shore Dr & Monroe St	39686	41.8810	-87.6167
Millennium Park	21901	41.8810	-87.6241
Michigan Ave & Oak St	21633	41.9010	-87.6238
Shedd Aquarium	20781	41.8672	-87.6154

Click on the image to view in Google Maps

$click to see on google maps$

lets visualize the top 5 arrival stations for members on Google Maps (red markers)

Top5_arrival_station_members <-left_join(arrival_station_members, end_stations_location,by = "end_station_name") %>% 
  select(c(end_station_name, count, end_lat,end_lng))  

kable(Top5_arrival_station_members, format = "html", align = "lcll") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

end_station_name	count	end_lat	end_lng
Canal St & Adams St	48839	41.8793	-87.6399
Clinton St & Washington Blvd	47633	41.8834	-87.6412
Clinton St & Madison St	44285	41.8822	-87.6411
Daley Center Plaza	30845	41.8842	-87.6296
Kingsbury St & Kinzie St	30404	41.8892	-87.6385

Click on the image to view in Google Maps

lets visualize the top 5 arrival stations for casuals on Google Maps (red markers)

Top5_arrival_station_casual <-left_join(arrival_station_casual, end_stations_location,by = "end_station_name") %>% 
  select(c(end_station_name, count, end_lat,end_lng))  

kable(Top5_arrival_station_casual, format = "html", align = "lcll") %>%   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

end_station_name	count	end_lat	end_lng
Streeter Dr & Grand Ave	67507	41.8923	-87.6120
Lake Shore Dr & Monroe St	31051	41.8810	-87.6167
Millennium Park	25509	41.8810	-87.6241
Michigan Ave & Oak St	23982	41.9010	-87.6238
Lake Shore Dr & North Blvd	23477	41.9117	-87.6268

Click on the image to view in Google Maps

$click to see on google maps$

Our observations and analysts

After Cleaning, organizing the data and analyzing each graph we had the following observations:

First number of casual customers is 905954 while number of members is 2973868

Let’s visualize the ratio of Members To Casuals

Second number of rides are far more on the members side than casuals.

Let’s visualize the number of rides by rider type

Third casuals duration time is far more than members duration time

Let’s create a visualization for average duration

Forth visualizing the number of rides, it is clear that members take more rides than casuals, every day of the week.

Let’s visualize the number of rides by rider type

Fifth visualizing the average duration we see that Casuals Spend much more time per trip than members

Let’s create a visualization for average duration

Sixth visualizing the Starting time for Casuals compared to members by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off peak times mostly in the afternoons and late at night. while on week ends casuals out number members most of the day.

Let’s create a visualization for the Starting time for Casuals And mambers by each day of the week

Seventh visualizing the end time for members compared to Casuals by each day of the week, we can see that members tend to peak at normal working hours, mostly. while casuals start off-peak times mostly again in the afternoons and late at night. while on week ends casuals out number members most of the day.

Based on all the previous analyses and the Eleven observations, to answer our key Question: <br>

“In what ways do members and casual riders use Divvy bikes differently?”

we can summarize that:

most members, while they use their bikes much more frequently, they tend to use it for less duration time they use their bikes departing at peak working hours and arriving back at peak working hours departing from random locations probably their homes to business location then back. indicating that they mostly commute between work and home.

while casuals though they use their bikes much less number of times but for much longer duration, they tend to use their bikes mostly off peak times and late after-work from random location to entertainment locations as we can induce from Google maps therefore casuals are mostly using bicycles for entertainment.

	Members	Casuals
Count	2973868	905954
Count %	77%	23%
Frequency of use	much higher
Duration of use		much higher
weekdays Departure Time	mostly peak	mostly off peak
weekdays Departure location	random	random
weekdays arrival Time	mostly peak	mostly of peak
weekdays arrival location	mostly Business	mostly entertainment

weekend departure time	mostly early	mostly early
weekend departure location	random	random
weekend arrival time	mostly late	mostly late
weekend arrival location	entertainment	entertainment
	indicating that they mostly commute bet ween work and home	indicating that they are mostly using bicycles for entertainment.

Finally based on the above how can we convert casuals to members ?

firstly, we need to convince casuals that they can use these bikes not only for leisure but also for going to work by passing a massage like

““why don’t you extend your leisure time?””

““Workout to work by a Healthy, Green, Inexpensive commute .””

these massages can be sent directly to our customers, excluding members by contact information, such as emails and/or SMS. also could be placed on Google adds based on the top 5 locations and also on billboards at those same locations.

we could also give a special discount for a limited period of time.

In addition since these customers are heading for entertainments we could also arrange for packages with the most desired locations. like a discount or a free drink or a day pass or a free child pass etc.

Also since we have no pricing structure to compare, we do not know if converting casuals to members wont in fact reduce our profitability buy reducing income while adding more time consumption on our vehicles since casuals if converted would use bikes more frequently as indicated by our analyses while they already use it for much more extended times.

Cyclistic

Mohd J.

2023-11-17

Cyclistic_Exercise_Full_Year_Analysis

install packages

load Libraries

Displays working directory and set new directory

STEP 1: COLLECT DATA

Upload Divvy data sets (csv files)

STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE Compare column names each of the files

While the names don’t have to be in the same order

They DO need to match perfectly before we can use a

Command to join them into one file

Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)

Inspect the data frames and look for inconsistencies

Convert ride_id and rideable_type to character so that they can stack correctly

Stack individual quarter’s data frames into one big data frame

Remove lat, long, birth year, and gender fields as this data was dropped beginning in 2020

STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS Inspect the new table that has been created

There are a few problems we will need to fix:

(1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.

(2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.

(3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “trip duration” column. We will add “ride_length” to the entire data frame for consistency.

(4) There are some rides where trip duration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual”

Before 2020, Divvy used different labels for these two types of riders … we will want to make our dataframe consistent with their current nomenclature

N.B.: “Level” is a special property of a column that is retained even if a subset does not contain any values from a specific level

Begin by seeing how many observations fall under each user type

Reassign to the desired values (we will go with the current 2020 labels)

Check to make sure the proper number of observations were reassigned

Add columns that list the date, month, day, and year of each ride

This will allow us to aggregate ride data for each month, day, or year … before completing these operations we could only aggregate at the ride level

Add a “ride_length” calculation to all_trips (in seconds)

Inspect the structure of the columns

Convert “ride_length” from Factor to numeric so we can run calculations on the data

Remove “bad” data

The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

We will create a new version of the data frame (v2) since data is being removed

STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

Compare members and casual users

See the average ride time by each day for members vs casual users

Notice that the days of the week are out of order. Let’s fix that.

Now, let’s run the average ride time by each day for members vs casual users

analyze ridership data by type and weekday

creates weekday field using wday()

groups by usertype and weekday

calculates the number of rides and average duration

calculates the average duration

sorts

Let’s visualize the ratio of Members To Casuals

Let’s visualize the number of rides by rider type

Let’s create a visualization for average duration

Let’s create a visualization for the top 5 departure Stations for members

Let’s create a visualization for the top 5 departure Stations for casuals

Let’s create a visualization for the top 5 Arrival Stations for members

Let’s create a visualization for the top 5 Arrival Stations for casuals

Let’s create a visualization for the Starting time for Casuals And members by each day of the week

Let’s create a visualization for the ending time for Casual And members by day of the week

lets create a table of all starting stations with coordination to check later on Google Maps

lets create a table of all ending stations with coordination to check later on Google Maps

lets visualize the top 5 departure stations for members on Google Maps (red markers)

Click on the image to view in Google Maps

lets visualize the top 5 departure stations for casuals on Google Maps (red markers)

Click on the image to view in Google Maps

lets visualize the top 5 arrival stations for members on Google Maps (red markers)

Click on the image to view in Google Maps

lets visualize the top 5 arrival stations for casuals on Google Maps (red markers)

Click on the image to view in Google Maps

Our observations and analysts

Based on all the previous analyses and the Eleven observations, to answer our key Question: <br>

“In what ways do members and casual riders use Divvy bikes differently?”

we can summarize that:

Members

Count

Count %

Frequency of use

Duration of use

weekdays Departure Time

weekdays Departure location

weekdays arrival Time

`install packages`

STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE
Compare column names each of the files

STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
Inspect the new table that has been created