Citibke Part 3 : Modeling

This is the third and final in a series of blog posts i’ll be writing on a project to analyze NYC Citibike data. The first post focused on collecting and preparing the data, and a strategy for dealing with a very large data set. In the second part, I did some EDA (exploratory data analysis) of the data. In this final post, i’ll focus on modeling with the aim of predicting the number of Citibike rides taken on a particular day.

From the EDA, I identified several variables that are correlated with the number of rides and should probably be included in my model:

  • Day of week
  • the number of stations
  • temperature
  • precipitation
  • wind gust

You can also view the jupyter notebook behind this analysis.

Import libraries and set some defaults for nice plots.

# %load /Users/Andy/jupyter_imports.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# make plots look nice
plt.rcParams['font.size'] = 14
plt.rcParams['axes.labelsize'] = 'large'
plt.rcParams['xtick.labelsize'] = 'large'
plt.rcParams['ytick.labelsize'] = 'large'
plt.rcParams['lines.linewidth'] = 3

Load the full combined dataset, which was made in the notebook citibike_regression_make_data

df_comb = pd.read_csv('data/data_comb.csv')
df_comb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176 entries, 0 to 1175
Data columns (total 15 columns):
yday            1176 non-null int64
Nrides          1176 non-null int64
date            1176 non-null object
Tmean           1176 non-null int64
precip_In       1176 non-null float64
max_gust_mph    1173 non-null float64
cloud_cover     1176 non-null int64
N_stations      1176 non-null int64
wkday_1         1176 non-null int64
wkday_2         1176 non-null int64
wkday_3         1176 non-null int64
wkday_4         1176 non-null int64
wkday_5         1176 non-null int64
wkday_6         1176 non-null int64
public          1176 non-null float64
dtypes: float64(3), int64(11), object(1)
memory usage: 137.9+ KB

Split the data into predictor (X) and target (y) arrays in prep for modelling

I drop ‘Nrides’, which is the target variable, and ‘yday and ‘date’ which I will not be using as a predictor.

X = df_comb.drop(columns=['date','Nrides','yday'], axis=1)
X.head()
Tmean precip_In max_gust_mph cloud_cover N_stations wkday_1 wkday_2 wkday_3 wkday_4 wkday_5 wkday_6 public
0 76 0.73 26.0 8 326 0 0 0 0 0 0 0.0
1 78 0.06 23.0 7 327 1 0 0 0 0 0 0.0
2 80 0.96 23.0 7 326 0 1 0 0 0 0 0.0
3 84 0.00 24.0 4 324 0 0 1 0 0 0 0.0
4 85 0.00 23.0 1 325 0 0 0 1 0 0 0.0
y = df_comb['Nrides']
y.head()
0    16650
1    22745
2    21864
3    22326
4    21842
Name: Nrides, dtype: int64

Next I split the data into training and test sets for modeling.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state=39)

Model 1: Simple Linear Regression

  • The first baseline model I will try is the simplest: a multiple linear regression.
  • First I need to impute some missing values in the dataset. I will fill in missing values with the mean, using Imputer from sklearn.
from sklearn.preprocessing import Imputer

imp = Imputer(strategy='mean')
imp.fit(X_train)
X_train = imp.transform(X_train)
X_test  = imp.transform(X_test)
X_all   = imp.transform(X)

Now time to fit the model

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

reg = LinearRegression()  # instantiate model object
reg.fit(X_train,y_train)  # fit model
reg.score(X_train, y_train) # score on training data
0.8285467376003368

Using cross_val_score gives a better estimate of how the model will perform on unseen data.

cv_score_linreg = np.mean(cross_val_score(reg, X_train, y_train))
print(cv_score_linreg)
0.8180402624574454

We can check the score on the test set also.

# score on test set
lin_reg_test_score=reg.score(X_test,y_test)
lin_reg_test_score
0.8135628727158369

And compute RMSE for comparison w/ other models as well.

from sklearn.metrics import mean_squared_error
RMSE_linreg = mean_squared_error(y_test, reg.predict(X_test))**0.5
RMSE_linreg
5924.665844564357

Scatterplot actual vs predicted values from test set

Plot the timeseries of actual and predicted values from linear regression

Model 2: Next i’ll try a Random forest regression

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train,y_train)

Cross-validated score estimate on training set:

np.mean(cross_val_score(rf,X_train,y_train))
0.8492788229709785

Score on test set:

rf.score(X_test,y_test)
0.8831218780786286

RMSE:

RMSE_rf = mean_squared_error(y_test, rf.predict(X_test))**0.5
RMSE_rf
4690.987458246692

Model 3: Random Forest optimized w/ gridsearchCV

  • Let’s see if we can improve the performance of the random forest model by doing some hyperparameter tuning.
# optimize RF
from sklearn.model_selection import GridSearchCV

# grid of params to tune over
params = {'n_estimators':[10,50,100,150],
          'min_samples_split':[2,5,10] }

rf2 = RandomForestRegressor()
cv = GridSearchCV(rf2, params)
cv.fit(X_train, y_train)
rf2 = cv.best_estimator_

Cross-validated score estimate on training set:

np.mean( cross_val_score(rf2, X_train, y_train) )
0.8679546096300257

Score on test set:

rf2.score(X_test, y_test)
0.8885318835967604

RMSE:

RMSE_rf2= mean_squared_error(y_test, rf2.predict(X_test))**0.5
RMSE_rf2
4581.133953398355

Scatterplot of predicted vs actual

Compare RMSE from the 3 models

The optimized random forest regressor has the lowest RMSE.

# Compare RMSE
print('Lin Reg.     : ' + str(RMSE_linreg))
print('Rand Forest  : ' + str(RMSE_rf))
print('Rand Forest opti.  : ' + str(RMSE_rf2))

Lin Reg.     : 5924.665844564357
Rand Forest  : 4690.987458246692
Rand Forest opti.  : 4581.133953398355

Variable importance

Examining the variable importance from the model shows that the 3 most important variables are temperature, the number of stations, and precipitation.

rf_imp = pd.DataFrame({'vars':X.columns,'imp':rf.feature_importances_})
rf_imp.sort_values('imp',ascending=False)
vars imp
0 Tmean 0.531921
4 N_stations 0.278861
1 precip_In 0.105064
3 cloud_cover 0.021947
2 max_gust_mph 0.018876
10 wkday_6 0.018181
9 wkday_5 0.011947
6 wkday_2 0.004160
11 public 0.002814
5 wkday_1 0.002503
7 wkday_3 0.002008
8 wkday_4 0.001717

Plot model residuals for both linear regression and random forest

  • The plot below shows that the residuals are significantly reduced by the random forest model.
  • However there are still a small number of large residuals remaining.

What dates do largest residuals fall on?

Sorting by residual, we can see that a lot of the largest negative residuals (model overpredicts actual value) occur near holidays (July 4th, Christmas, Thanksgiving). We did include holidays in the model, but the day or two before/after a holiday are also strongly affected. This makes sense; people often take off multiple days or a long weekend for big holidays. If really wanted to try to capture this behavior, we might create a ‘near holiday’ variable this is 1 on the day or two before/after a holiday.

X2 = df_comb.copy()
# add residuals
X2['resid'] = y - reg.predict(X_all)
X2[['resid','date']].sort_values('resid').head(10)
resid date
4 -23457.483704 2013-07-05
1094 -23002.179137 2016-12-25
15 -20979.056450 2013-07-18
3 -20503.650893 2013-07-04
1099 -20413.004310 2017-01-02
16 -19863.673724 2013-07-19
779 -19483.244629 2015-12-24
5 -19107.903351 2013-07-06
7 -17724.623738 2013-07-08
1065 -17219.932118 2016-11-25

Looking at the positive residuals (model underestimates actual values), there isn’t any obvious pattern that jumps out to me. These could be days there were special events (races, concerts etc.) that disrupted the usual transportation patterns.

X2[['resid','date']].sort_values('resid').tail(10)
resid date
912 13251.521710 2016-06-02
1004 13604.826915 2016-09-16
1005 14157.219264 2016-09-17
1043 14443.317593 2016-11-02
697 14449.835545 2015-09-25
1014 14912.695199 2016-09-26
1031 15114.680414 2016-10-20
1003 15120.889805 2016-09-15
740 16168.785091 2015-11-11
260 19579.541596 2014-04-30

Conclusions

  • We built several models to predict the number of NYC Citibike rides taken on a certain date.
  • The baseline model, a simple linear regression, performed fairly well with a cross-validated estimated R^2 of 0.82.
  • A random forest regressor had a cross-validated estimated R^2 of 0.85.
  • Hyperparameter tuning of the random forest improved the cross-validated estimate of R^2 to 0.86.
  • This model had a R^2 of 0.89 on the test set.
  • Examining residuals suggests that the model is significantly overpredicting on days surrounding holidays; this could be an area for future improvement.
Written on October 26, 2018