Citibke Part 3 : Modeling

This is the third and final in a series of blog posts i’ll be writing on a project to analyze NYC Citibike data. The first post focused on collecting and preparing the data, and a strategy for dealing with a very large data set. In the second part, I did some EDA (exploratory data analysis) of the data. In this final post, i’ll focus on modeling with the aim of predicting the number of Citibike rides taken on a particular day.

From the EDA, I identified several variables that are correlated with the number of rides and should probably be included in my model:

Day of week
the number of stations
temperature
precipitation
wind gust

You can also view the jupyter notebook behind this analysis.

Import libraries and set some defaults for nice plots.

# %load /Users/Andy/jupyter_imports.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# make plots look nice
plt.rcParams['font.size'] = 14
plt.rcParams['axes.labelsize'] = 'large'
plt.rcParams['xtick.labelsize'] = 'large'
plt.rcParams['ytick.labelsize'] = 'large'
plt.rcParams['lines.linewidth'] = 3

Load the full combined dataset, which was made in the notebook citibike_regression_make_data

df_comb = pd.read_csv('data/data_comb.csv')
df_comb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176 entries, 0 to 1175
Data columns (total 15 columns):
yday            1176 non-null int64
Nrides          1176 non-null int64
date            1176 non-null object
Tmean           1176 non-null int64
precip_In       1176 non-null float64
max_gust_mph    1173 non-null float64
cloud_cover     1176 non-null int64
N_stations      1176 non-null int64
wkday_1         1176 non-null int64
wkday_2         1176 non-null int64
wkday_3         1176 non-null int64
wkday_4         1176 non-null int64
wkday_5         1176 non-null int64
wkday_6         1176 non-null int64
public          1176 non-null float64
dtypes: float64(3), int64(11), object(1)
memory usage: 137.9+ KB

Split the data into predictor (X) and target (y) arrays in prep for modelling

I drop ‘Nrides’, which is the target variable, and ‘yday and ‘date’ which I will not be using as a predictor.

X = df_comb.drop(columns=['date','Nrides','yday'], axis=1)
X.head()

	Tmean	precip_In	max_gust_mph	cloud_cover	N_stations	wkday_1	wkday_2	wkday_3	wkday_4
0	76	0.73	26.0	8	326	0	0	0	0
1	78	0.06	23.0	7	327	1	0	0	0
2	80	0.96	23.0	7	326	0	1	0	0
3	84	0.00	24.0	4	324	0	0	1	0
4	85	0.00	23.0	1	325	0	0	0	1

y = df_comb['Nrides']
y.head()

  16650
  22745
  21864
  22326
  21842
Name: Nrides, dtype: int64

Next I split the data into training and test sets for modeling.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state=39)

Model 1: Simple Linear Regression

The first baseline model I will try is the simplest: a multiple linear regression.
First I need to impute some missing values in the dataset. I will fill in missing values with the mean, using Imputer from sklearn.

from sklearn.preprocessing import Imputer

imp = Imputer(strategy='mean')
imp.fit(X_train)
X_train = imp.transform(X_train)
X_test  = imp.transform(X_test)
X_all   = imp.transform(X)

Now time to fit the model

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

reg = LinearRegression()  # instantiate model object
reg.fit(X_train,y_train)  # fit model
reg.score(X_train, y_train) # score on training data

0.8285467376003368

Using cross_val_score gives a better estimate of how the model will perform on unseen data.

cv_score_linreg = np.mean(cross_val_score(reg, X_train, y_train))
print(cv_score_linreg)

0.8180402624574454

We can check the score on the test set also.

# score on test set
lin_reg_test_score=reg.score(X_test,y_test)
lin_reg_test_score

0.8135628727158369

And compute RMSE for comparison w/ other models as well.

from sklearn.metrics import mean_squared_error
RMSE_linreg = mean_squared_error(y_test, reg.predict(X_test))**0.5
RMSE_linreg

5924.665844564357

Scatterplot actual vs predicted values from test set

Plot the timeseries of actual and predicted values from linear regression

Model 2: Next i’ll try a Random forest regression

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train,y_train)

Cross-validated score estimate on training set:

np.mean(cross_val_score(rf,X_train,y_train))

0.8492788229709785

Score on test set:

rf.score(X_test,y_test)

0.8831218780786286

RMSE:

RMSE_rf = mean_squared_error(y_test, rf.predict(X_test))**0.5
RMSE_rf

4690.987458246692

Model 3: Random Forest optimized w/ gridsearchCV

Let’s see if we can improve the performance of the random forest model by doing some hyperparameter tuning.

# optimize RF
from sklearn.model_selection import GridSearchCV

# grid of params to tune over
params = {'n_estimators':[10,50,100,150],
          'min_samples_split':[2,5,10] }

rf2 = RandomForestRegressor()
cv = GridSearchCV(rf2, params)
cv.fit(X_train, y_train)
rf2 = cv.best_estimator_

Cross-validated score estimate on training set:

np.mean( cross_val_score(rf2, X_train, y_train) )

0.8679546096300257

Score on test set:

rf2.score(X_test, y_test)

0.8885318835967604

RMSE:

RMSE_rf2= mean_squared_error(y_test, rf2.predict(X_test))**0.5
RMSE_rf2

4581.133953398355

Scatterplot of predicted vs actual

Compare RMSE from the 3 models

The optimized random forest regressor has the lowest RMSE.

# Compare RMSE
print('Lin Reg.     : ' + str(RMSE_linreg))
print('Rand Forest  : ' + str(RMSE_rf))
print('Rand Forest opti.  : ' + str(RMSE_rf2))

Lin Reg.     : 5924.665844564357
Rand Forest  : 4690.987458246692
Rand Forest opti.  : 4581.133953398355

Variable importance

Examining the variable importance from the model shows that the 3 most important variables are temperature, the number of stations, and precipitation.

rf_imp = pd.DataFrame({'vars':X.columns,'imp':rf.feature_importances_})
rf_imp.sort_values('imp',ascending=False)

	vars	imp
0	Tmean	0.531921
4	N_stations	0.278861
1	precip_In	0.105064
3	cloud_cover	0.021947
2	max_gust_mph	0.018876
10	wkday_6	0.018181
9	wkday_5	0.011947
6	wkday_2	0.004160
11	public	0.002814
5	wkday_1	0.002503
7	wkday_3	0.002008
8	wkday_4	0.001717

Plot model residuals for both linear regression and random forest

The plot below shows that the residuals are significantly reduced by the random forest model.
However there are still a small number of large residuals remaining.

What dates do largest residuals fall on?

Sorting by residual, we can see that a lot of the largest negative residuals (model overpredicts actual value) occur near holidays (July 4th, Christmas, Thanksgiving). We did include holidays in the model, but the day or two before/after a holiday are also strongly affected. This makes sense; people often take off multiple days or a long weekend for big holidays. If really wanted to try to capture this behavior, we might create a ‘near holiday’ variable this is 1 on the day or two before/after a holiday.

X2 = df_comb.copy()
# add residuals
X2['resid'] = y - reg.predict(X_all)
X2[['resid','date']].sort_values('resid').head(10)

	resid	date
4	-23457.483704	2013-07-05
1094	-23002.179137	2016-12-25
15	-20979.056450	2013-07-18
3	-20503.650893	2013-07-04
1099	-20413.004310	2017-01-02
16	-19863.673724	2013-07-19
779	-19483.244629	2015-12-24
5	-19107.903351	2013-07-06
7	-17724.623738	2013-07-08
1065	-17219.932118	2016-11-25

Looking at the positive residuals (model underestimates actual values), there isn’t any obvious pattern that jumps out to me. These could be days there were special events (races, concerts etc.) that disrupted the usual transportation patterns.

X2[['resid','date']].sort_values('resid').tail(10)

	resid	date
912	13251.521710	2016-06-02
1004	13604.826915	2016-09-16
1005	14157.219264	2016-09-17
1043	14443.317593	2016-11-02
697	14449.835545	2015-09-25
1014	14912.695199	2016-09-26
1031	15114.680414	2016-10-20
1003	15120.889805	2016-09-15
740	16168.785091	2015-11-11
260	19579.541596	2014-04-30

Conclusions

We built several models to predict the number of NYC Citibike rides taken on a certain date.
The baseline model, a simple linear regression, performed fairly well with a cross-validated estimated R^2 of 0.82.
A random forest regressor had a cross-validated estimated R^2 of 0.85.
Hyperparameter tuning of the random forest improved the cross-validated estimate of R^2 to 0.86.
This model had a R^2 of 0.89 on the test set.
Examining residuals suggests that the model is significantly overpredicting on days surrounding holidays; this could be an area for future improvement.

Written on October 26, 2018