The bicycle-sharing service is becoming more and more popular across the globe. It allows people in metropolitan areas to rent bicycles by unlocking in one station and return in any other authorized network station. This is very convenient for short trips, especially ideal for one-way trips.
Ford GoBike, which was initiated in 2013, is one of them that was introduced to the US West Coast. The bikes are available for use 24 hours per day, 7 days per week, 365 days in a whole year. Up until now, it has facilitated hundreds of thousands of people and collected millions of usage data since 2017.
The dataset for this project covers 23 months of data, ranging between 2017-06 to 2019-04, with nearly 3 million records.
The data are originally stored in s3.amazonaws.com. It consists of 17 csv-formatted files (Notice: originally compressed as .zip
file) and named after year + -fordgobike-tripdata.csv
or year-month + -fordgobike-tripdata.csv
. Before analysis and visualization, I executed some data wrangling work.
The data wrangling process went through three main steps:
I. Data Gathering.
II. Data Assessing.
III. Data Cleaning.
I. Data Gathering. I downloaded these 23 files from s3.amazonaws.com programmatically. Then I used zipfile
packages to unzip them all to .csv
format. Finally I merged all these single files as one uniformed dataset.
II. Data Assessing. I conducted several evaluation procudures such as Feature Consistency Check, Data Type Check, Missing Value Evalution, Duplicated Value Check etc. The following data quality issues has been revealed:
- Columns have missing values:
start_station_id
- nullity: 0.38%,start_station_name
- nullity: 0.38%,end_station_id
- nullity: 0.38%,end_station_name
- nullity: 0.38%,member_birth_year
- nullity: 6.7%,member_gender
- nullity: 6.7%,bike_share_for_all_trip
- nullity: 15.97%.
member_birth_year
could beint
instead offloat64
.member_gender
,bike_share_for_all_trip
could beCategoricalDtype
instead ofobject
(string).start_station_id
,end_station_id
,bike_id
could beobject
(string) instead offloat64
orint
.start_time
,end_time
could betimestamps
data type instead ofobject
(string).- Some new features need to be derived:
member_age
frommember_birth_year
start_month
fromstart_time
start_date
fromstart_time
start_day_of_week
fromstart_time
(See Remark 1.)start_hour
fromstart_time
trip_distance
fromstart_station_latitude
,start_station_longitude
,end_station_latitude
,end_station_longitude
((See Remark 2.))
Remark 1: After
start_day_of_week
is created, we can order them in real world by usingpd.api.types.CategoricalDtype
.
Remark 2: This can be achieved by usingdistance()
function fromgeopy.distance
package.
- Abnormal value in
member_birth_year
with 1878.
III. Data Cleaning. The data quality issues found in the Data Assessing process have all been solved one by one and recorded in detail in the jupyter notebook:
- Dealing with Missing Values.
- Conventional Data Type Conversion.
datetime
Data Type Conversion.- Feature Extraction including Time-Based Feature Extraction and Distance-Based Feature Extraction.
- Dealing with Outliers or anomaly points in distance-related feature and age-related features.
- As auxiliary feature for analysis, I created
age_range
feature by cutting age values inmember_age
into several bins. - Finally, drop those features which would be not helpful for this project.
- Univariate Exploration
- Bivariate Exploration
- Multi-variate Exploration
Numeric Features
-
duration_min
-
trip_distance_km
-
member_age
Categorical Features
-
user_type
-
member_gender
-
bike_share_for_all_trip
-
start_time_year_month
-
start_time_month
-
start_time_date
-
start_time_dow
-
start_time_hour
-
age_range
-
For
duration_min
, 99% of the duration values or values less than 64 minutes, Extreme values can go up to as large as 1438 minutes (< 1%); Half of the duration time for a ride fall bewteen 5.75 mins (Q1) and 13.9 mins (Q3) with average value being 12.9 mins; When scaled in log , it exhibits a normal distribution. -
For
trip_distance_km
, when scaled in log, it also presents a bell-like shape which implies it has a long tail on both ends; Almost 99.99% of data is within 9km, and 1km and 2km are the most common ones. -
For
member_age
, the average for all users is 33 y/o, with minimum age 18 and maximum age 66; The distribution in the histogram is right-skewed with a longer tail up to 66 years-old; Most of the users' ages range between 27 and 42, while ages at two ends e.g. 10-20 and 60+ have the least rentals. -
For
member_gender
, there are three types of gender: Males, Female and Other. Male users play a dominant role, almost as 3 times the number as that of Female users. Users in Other gender have the least numbers, only covering 1.6% of the total. -
For
user_type
, there are two types -Customer
andSubscriber
, among whichSubscriber
covers the most, nearly as 8 times as that of customers. The reason for this maybe because being a subscriber can bring more benefits such as less pricing, additional coupons, better service etc. -
For
bike_share_for_all_trip
program, the number of people who choose it is far less than those who don't unexpectedly. -
In terms of time series, the total number of rentals mounts up to 220 thousand, almost 100 times larger than that in the beginning. In detail,
-
The curve bewteen year-month and number of rentals shows seasonal pattern which means more rentals happen during spring and summer, but obvious decreases are always seen in winter or between October and December.
-
On days in a week scale, the bikes are more popular on weekdays than that at the weekends as a clear descent is seen. Numerically, the number of rentals at the weekend is only half of that on weekdays.
-
In different hours of a day, periods such as 7:00 - 9:00 in the morning and 16:00 - 19:00 in the evening, a.k.a. the rush hours, see more rentals than the others which may imply that the dominant users for bike rentals are from working class.
-
-
As the majority of users is Subscribers, so when the population is divided by
user_type
, we can see that the general trend for relationship between number of rentals and year-month in Subscribers is very similar to the trend we observed in the univariate exploration part. It also presents a seasonal pattern when winter comes in and temperature is low, the rentals are going down, while in pleasing climate like spring and summer with cozy weather conditions, the number of rentals are up. However, users with Customer type do not present so much strong pattern as Subscribers. -
When taking genders and bike share for all trip program into consideration, it is clear that Males and those who do not take part in the program lead the number of rentals with time and also exhibit a similar seasonal trend to the total rentals as a whole, this is because Males and non-member of the bike share for all trip program is the majority of the whole user population.
-
When considering the age of users, the rentals in (10, 20] and (60, inf) groups do not show much change with time, while the others show again the seasonal trend to some extent, i.e. for groups of (20, 30], (30, 40], (50, 60] y/o, this pattern is obvious and strong especially in users within (20, 30] and (30, 40] y/o range.
-
For the distribution of duration time and trip distance per ride throughout days of a week, it seems that the median and interquartiles for the trip distance per ride is almost the same, and only the median and interquartile for the duration time are showing slight differences at the weekends (larger and more varied) than that on weekdays.
-
Time (months in a year, days in a week, hours in a day), ages, user type (
Customer
orSubscriber
) seem to have little effect on the average duration time and trip distance for a ride but gender type. Males always tend to have the shortest values in terms of avg. duration and distance. -
Users in different user type have different bike rental patterns at the weekends compared with that on weekdays.
-
From Monday to Friday, rentals mostly happen during commute hours which are 7:00 - 9:00 in the morning and 16:00 - 18:00 in the afternoon, no matter what kind of user type they belong to.
-
On Saturday and Sunday, the difference between these two user types occurs: Customer users tend to rent bikes during weekends more often than they do on weekdays, especially during the period from 10:00 in the morning to 16:00 in the afternoon. However, Subscriber users seem to have disappeared during that period at the weekends, maybe they prefer taking a tour without bicycles or just relaxing at home.
-
-
Adults like 20 - 40 y/o and middle-aged users like 50 - 60 y/o have very close average duration time and trip distance values and present a similar pattern throughout days in a week.
-
Surprisingly, for the 10 - 20 y/o group, no matter it's scaled with other categorical features like gender or user type, their average values generally showed a gap from other age groups and are distributed more dispersedly from the others.
The following explorations are extracted for presentation:
In univariate exploration:
- With respect to
duration_min
, it shows normal distribution on log-scale histogram, with average value being 12.9min; 95% of the data fall within 27.3 minuets and 99% of the data fall within 63.95 minutes. - With respect to
member_age
, it shows right-skewed distribution on the histogram; the interquartile range is (27, 42] y/o; ages between 10-20 and 60+ have the least numbers. - With respect to
member_gender
, there are three types of genders. Male users play a dominant role, Female users cover less than 1/4 of the total. - With respect to
user_type
, there are two user types. The number of Subscriber is almost as 8 times as that of Customer user type. - With respect to the
year-month
time scale, The general trend for the bike rentals is increasing over the years. There seems to be a seasonal pattern in the trend, that all the increases concentrate when the climate is warm and the peaks arrive at around October but it begins to decline as winter comes in. - With respect to the
days in a week
time scale, bike rentals on weekdays are far more than that at the weekends, almost 2 times as large, especially from Tuesday to Thursday. - With respect to the
hours in a day
time scale, peak rentals occur during rush hours, i.e. 7:00 - 9:00 in the morning and 16:00 - 19:00 in the evening.
In bivariate exploration:
- The users with Subscriber type present similar trend to the general trend we found earlier in univariate exploration : the seasonal trend, while for Customer user type, it does not show much change with time.
- (10, 20] group and (60, inf) group do not show much change with time, while the other groups show again the seasonal trend to some extent, i.e. for groups of (20, 30], (30, 40], (50, 60], this pattern is obvious and strong especially in users with age range of (20, 30] and (30, 40] y/o.
In multi-variate exploration:
- Time (months in a year, days in a week, hours in a day), ages, user type (
Customer
orSubscriber
) seem to have little effect on the average duration time and trip distance for a ride but gender type. Males always tend to have the shortest values in terms of avg. duration and distance. - Users in different user type have different bike rental patterns at the weekends compared with that on weekdays. Both Customer and Subscriber type of users are more likely to rent bikes at similar periods of a day through Monday to Friday. However, the difference occurs at the weekends. Customer users are more active from 10:00 in the morning to 16:00 in the afternoon at the weekends while Subscriber users seem to have disappeared during that period at the weekends, maybe they prefer taking a tour without bicycles or just relaxing at home.
This project is ran on Python 3.7.9 and dependent on the following packages:
- pandas 1.1.3
- numpy 1.19.2
- matplotlib 3.3.2
- seaborn 0.11.0
- missingno 0.4.2
- requests 2.24.0
- geopy 2.1.0
- Visualize missing data using missingno package.
- Calculate distance between coordinates using Geopy package.
- Average speed for a bicyclist.
- What is the difference within customer, subscriber and bike-share-for-all-trip.
- How to Manually add legend Items Python matplotlib