Skip to content

MeghanBarritt/bike-share-usage-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Project - Bike Share Trends

Project Task

I looked at data from two cities, Washington DC and London England, over two year time spans, to try and find weather-related usage patterns in the cities' bikeshare systems.

Process

Datasets


Found the Washington DC Capital bike share data set on the UC Irvine website via a list of interesting datasets, and a second dataset on Kaggle tracking the same variables when I checked for other data I could use. It was the only other dataset I found that had similar enough data to be usable in a short time frame.

Confirmed the same basic data was available in both datasets. The London data required more processing to get to the same columns at the end, but aside from the DC set having an additional breakdown between regular and guest users, the data available was the same.

The DC data starts on January 1, 2011, and ran for two years, ending on December 31, 2012. The London data missed the New Year and instead starts on January 4, and also runs for two years, but from 2015 until January 3, 2017.

Both datasets include hourly weather data; temperature, humidity, windspeed and 'weather condition', as well as whether or not a day is a holiday, the season, and of course the number of rentals.


EDA


Looking at the target variable, the number of bikes being used (called cnt by default in both tables), there is a significant difference in the range of values, and therefore total useage, between the two cities.

Target feature comparisons

London has much, much higher numbers than DC does, and both target variables have exponential distributions.


The season feature was essentially uniform in both datasets, but there were no time-value columns initially present in the London data to look at. The year, weekday, month and hour columns had to be extracted from the timestamps. As such, the charts below are actually from partway through feature engineering. Given that the data covered entire years, there were certain features that logically should have been uniformly distributed, or at least close to it. These were season, year, weekday, month and hour.

DC features with uniform distribution

London features with uniform distribution

Some minor variation in most of these values makes sense, given that not all months are the same length and the year does not always start at the beginning of the week. Additionally, 2012 was a leap year, so it has an extra day. London's year distribution was always going to be a little weird, given the data technically crosses into a third year. However, the fact that the hours showed as not completely uniform presented a problem; every day has 24 hours, so there was clearly missing data that needed to be filled in. It turned out that there were no rows with 0 rentals in the DC set, and only one in the London set, which lead me to conclude that the missing rows were the ones where there was no rental activity. The missing rows are dealt with during Feature Engineering.


Although they have different names, the temp in the DC dataset and t1 in the London dataset both refer to the actual temperature, while atemp for DC and t2 for London both refer to the apparent temperature (per the documentation), making them directly comparable.

DC features with normal distribution

London features with normal distribution

All of these features are approximately normally distributed.


There are also some columns that were likely to have unpredictable or skewed ditributions. First, for the DC dataset, there are the holiday, workingday and weather columns.

DC features with skew

All of these make sense, as there are more days that are not holidays, are working days, and less severe weather is more common than severe weather.

The London set also has holiday, but has is_weekend instead of workingday, and different value options for weather.

London features with skew

The holiday pattern matches the one for DC, and the weekend pattern is roughly the inverse of the one for workingday. Used together, those can produce a workingday column for the London dataset.

This weather column does not have as straight forward of a pattern as the one for the DC dataset. Its values are not sequential, resulting in the stretched histogram above. There is also one value, 94, that is never used.


Looking at PairPlots (which are omitted due to size and computing time), there are a few features that immediately appear to correlate with the number of bikes being rented.

DC set apparent correlations

DC set apparent correlations

These best fit lines behave the way I expected; both temperature lines have an increase in temperature resulting in an increase in the rental count. Additionally, an increase in humidity, which, in the climates of the cities I am looking at, will most often relate to less pleasant weather, results in a decrease in the rental count.

In the DC set, where there is an hour column to look at, the best fit line loosely goes upwards was the hour increases towards midnight, and looking at the actual scatterplot, its clear that the dips are the overnight and predawn hours, which are (loosely) the lowest hours.

Feature Pair Coefficient
DC set
Count + Hour 0.394
Count + Temp 0.405
Count + Atemp 0.400
Count + Hum -0.323
London Set
Count + T1 0.389
Count + T2 0.369
Count + Hum -0.463

On their own, the correlations are not particularly strong, but it is possible they will have a significant effect as part of a larger model.


Feature Engineering


Performed feature engineering; all values related to time had to be extracted into columns for the London dataset. I also filled in gaps in the data where an hour had been skipped, using the assumption that those times had been omitted due to no bikes being rented, as for my purposes 'no rentals occured' is a useful data point. In the London set, there were some days that had no values at all, in any of the columns, which meant there was no way to fill most of them, so those entire days were dropped to avoid creating skew. Otherwise, different fill methods were used depending on the column; some columns have static values throughout the day, while others move coninuously. In the first case, the value from elsewhere in the day was simply placed, while in the second case interpolation was used to find the average between the values on either side of the missing value.

For the DC dataset, a secondary table counting by day rather than by hour exists, so the weather values for a given day on that were used to fill missing weather values. No such second table existed for the London dataset, but exploration showed that it was reasonably common for the weather code on either side of a missing value to be the same, so simple forward fill was used to fill the missing values.

The London dataset had different values in the season, weather, and weekday categories, so those were remapped to match the DC values. I am running them through separate models, but for consistency and so I don't confuse myself later, it made sense to have them match. In the case of the London weather encoding, it was significantly more granular than the DC encoding, so it had to be reformatted to be properly comparable. Both datasets have descriptions of what qualifies for each code, so I used those to group the London codes and match them to the DC ones. I also created a new precip column, based on those descriptions, which is a categorical 'is there precipitation or not' 0/1 column.

(Note: filling in the missing weather values in the London dataset happened after remapping the weather codes, and as the new values were less complex than the old ones, this may have helped with the values before and after the missing info being the same.)

The temp, atemp (apparent temp), humidity and windspeed columns of the DC dataset came pre-normalized, so the corresponding columns in the London dataset were normalized as well, using the MinMaxScaler, as that matched the behavior of the data as I found it. This was done before splitting to be consistent with the DC set.

Feature code legends:

Code Season
1 Spring
2 Summer
3 Fall
4 Winter

Year Dataset Code
2011 DC 0
2012 DC 1
2015 London 0
2016 London 1
2017 London 2

Code Weather Description
1 Clear, Few clouds, Partly cloudy, Partly cloudy
2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

Feature Mappings


London Weather

1: Clear, Few clouds, Partly cloudy, Partly cloudy


Old values:
1 - Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity
2 - scattered clouds / few clouds


2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist


Old values:
3 - Broken clouds
4 - Cloudy


3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds


Old value:
7 = Rain/ light Rain shower/ Light rain


4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog


Old values:
10 = rain with thunderstorm
26 = snowfall
94 = Freezing Fog


Mappings for precip column

Weather code Description precip code
1 Clear, Few clouds, Partly cloudy, Partly cloudy 0
2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 0
3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 1
4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 1


Basic Models


I chose four regression models to try; LinearRegression, SVR and RandomForestRegressor from Sklearn, and the XGBRegressor from XGBoost.

Because my datasets are time-based, I did my train/test split without shuffling.

Linear Regression

DC dataset


Train set:
Root Mean Squared Error: 129.47650158667653
R-squared: 0.40023509636432963


Test set:
Root Mean Squared Error: 182.47521094734591
R-squared: 0.3154146362592527


London dataset


Train set:
Root Mean Squared Error: 887.6725962693262
R-squared: 0.31692830959135165


Test set:
Root Mean Squared Error: 978.3105148178599
R-squared: 0.24876167392851933


SVR

DC dataset


Train set:
Root Mean Squared Error: 143.88640899563978
R-squared: 0.25930624378822176


Test set:
Root Mean Squared Error: 210.58087348202932
R-squared: 0.08828792531790608


London dataset


Train set:
Root Mean Squared Error: 1021.256075397555
R-squared: 0.09587189980290356


Test set:
Root Mean Squared Error: 1086.1229944892814
R-squared: 0.07406114159332133


Random Forest Regressor

DC dataset


Train set:
Root Mean Squared Error: 14.152513563206748
R-squared: 0.9928341733910035


Test set:
Root Mean Squared Error: 72.87654665983804
R-squared: 0.8908068409097215


London dataset


Train set:
Root Mean Squared Error: 85.32546889684085
R-squared: 0.9936887115704169


Test set:
Root Mean Squared Error: 289.163573168845
R-squared: 0.9343686316506019


XGBRegressor

DC dataset


Train set:
Root Mean Squared Error: 24.081214852442457
R-squared: 0.9792529940605164


Test set:
Root Mean Squared Error: 69.92294989890084
R-squared: 0.8994784355163574


London dataset


Train set:
Root Mean Squared Error: 127.9868120875735
R-squared: 0.9857999086380005


Test set:
Root Mean Squared Error: 332.20786503370334
R-squared: 0.9133748412132263

The two models that performed the best, based on both R-squared and root mean squarred error, were the RandomForest and XGBoost models, so I chose those two models to continue tuning in the hyperparameters section.


Feature Selection


I used three methods of feature selection. The first was running a simple Lasso regression, and retrieving all of the columns that did not return a coefficient of 0. This was the crudest of the methods, as the Lasso regression was completely untuned, and as a result removed the most features.

The second method was Forward Selection, where features are added, step by step, to the model, starting with the most significant.

My third mthod was Backward Selection; starting with all features and optimizing by attempting step by step removal of features.

For both datasets, using Lasso selection resulted in the removal of multiple features. Forward and backwards select, however, dropped one feature from each set, and both dropped the same feature. Below are the selected and dropped features for each dataset, according to each method.

(Note: the count column is not involved here, as it is the dependant variable.)

Using Lasso:


DC dataset

Selected features: season, year, month, day, hour, weekday, precip, temp, humidity
Dropped features: holiday, workingday, weather, atemp, windspeed


London dataset

Selected features: season, year, month, day, hour, weekday, workingday, weather, precip, temp, humidity, windspeed
Dropped features: holiday, atemp


Using Forward Select:


DC dataset

Selected features: season, year, month, hour, weekday, holiday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: day, workingday


London dataset

Selected features: season, month, day, hour, workingday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: year, weekday, holiday


Using Backward Select:


DC dataset

Selected features: season, year, month, hour, weekday, holiday, weather, precip, atemp, humidity, windspeed
Dropped features: day, workingday, temp


London dataset

Selected features: season, month, day, hour, workingday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: year, weekday, holiday

I created new versions of X_train and X_test based on each selection list, as I wanted to test how well the model could do with feature inputs that had, theoretically been optimized.

Because forward and backwards select on the London dataset produced the same feature list, I didn't bother to create a separate _fw and _bw train/test pair; it would have been redundant. I simply created a _fwbw pair instead.


Hyperparameter Tuning


To start with, I tuned both models on the regular train/test set that had all of the features. My initial attempts to use r2 as my scoring method produced very low scores, so I ran secondary versions scoring on neg_root_mean_squared_error, and found that while that method did not always produce improvement, it usually at least matched the untuned model's baseline.

Random Forest:

DC set:
Base model r2: 0.8908068409097215
Tuned model r2: 0.6895321510957303


Base model rmse: 72.87654665983804
Tuned model rmse: 71.84172119174976



London set:
Base model r2: 0.9343686316506019
Tuned model r2: 0.9028084805222928


Base model rmse: 289.163573168845
Tuned model rmse: 312.8454875514148


XGBoost:

DC set:
Base model r2: 0.8994784355163574
Tuned model r2: 0.7963137030601501


Base model rmse: 69.92294989890084
Tuned model rmse: 56.490411500995606



London set:
Base model r2: 0.9133748412132263
Tuned model r2: 0.919543719291687


Base model rmse: 332.20786503370334
Tuned model rmse: 293.8955040541115


XGBRegressor performed much better than RandomForestRegressor did, so for the feature selection models, I focused on that one.

XGBR, Root Mean Squared Error

DC set:
Base model: 69.92294989890084
All features: 56.490411500995606
Fw features: 56.5705918807984
Bw features: 55.76142506506576
Lasso features: 60.833681953480436



London set:
Base model: 332.20786503370334
All features: 293.8955040541115
FwBw features: 287.72959089223957
Lasso features: 287.6278104393208


alpha lambda learning_rate max_depth n_estimators
dc allfeat 0.10 100 0.3 6 180
dc_fw 2.00 50 0.2 6 370
dc_bw 0.70 80 0.2 6 280
dc_lasso 1.00 130 0.2 6 190
lond_allfeat 0.75 125 0.3 7 70
lond_fwbw 0.01 10 0.1 7 140
lond_lasso 0.01 10 0.1 6 160


The DC set was all over the place for alpha and lambda values, and had the highest values for number of estimators, in one case going twice as high as the highest estimator value any London model ended up with.

The London set had higher alpha and lambda values on the all features version, but it dropped down for versions that had undergone feature selection. The number of estimators did the opposite, jumping up where feature election had occured.

Both models had the learning rate decrease for versions that had undergone feature selection, with a larger decrease on the London set.

The DC dataset always used a max depth of 6, but the London dataset slightly prefered 7, with no apparent pattern.


Tuned Models


I initiated and ran each model tuned in the hyperparameters notebook so I could fully evaluate their performances.

During my evaluation and comparisson, I compared the tuned models for each set to the untuned one, and the overall behavior of each dataset to the other. I looked at the R-squared and Root Mean Squared Error (R2 and RMSE) of both the training and test sets, as well as the changes between them, and the RMSE as a percentage of each dataset's maximum real value, in order to get an idea of how much error there actually was; an error of +/-50 would not be a big deal when the values in question are routinely in the thousands, but would be a big deal if the values were dozens at most.

DC evaluation outputs

Train R2 Test R2 R2 Decrease Train RMSE Test RMSE RMSE as % of range RMSE Increase RMSE Increase as % of range
basic model 0.979 0.899 0.080 24.081 69.923 7.16 35.832 3.67
all features 0.973 0.903 0.070 27.238 68.692 7.03 41.454 4.24
fw select 0.975 0.908 0.066 26.683 66.804 6.84 40.120 4.11
bw select 0.967 0.898 0.069 30.417 70.396 7.21 39.979 4.09
lasso 0.954 0.889 0.065 35.664 73.389 7.51 37.725 3.86


All the test R2 values are very consistent, regardless of model tuning or feature set, while the drop in R2 between the training and test sets is smaller for the tuned models, and even smaller with feature selection.

The test RMSEs are all also very similar, both as actual values and as percentages of the maximum possible value. While the actual values of the error increases look quite large, approximately doubleing across the board, they are actually fairly small amounts in proportion to the overall range.

There is not much difference in the RMSE as % values (0.67) or in the RMSE Increase as % values, even though there was some improvement in how much the R2 was dropping. This could mean the model is not generalizing very well on this data, even after tuning and feature selection.

For this data, there wasn't a consistent pattern to the performance in the raw scores, only in the amount of change from the training to test sets.


London evaluation outputs

Train R2 Test R2 R2 Decrease Train RMSE Test RMSE RMSE as % of range RMSE Increase RMSE Increase as % of range
basic model 0.986 0.913 0.072 127.987 332.208 4.23 203.221 2.59
all features 0.968 0.929 0.038 193.013 299.935 3.82 106.922 1.36
fwbw select 0.964 0.928 0.036 203.815 302.494 3.85 98.679 1.26
lasso 0.969 0.936 0.033 189.722 286.016 3.64 96.293 1.23


The R2 values for the test set are, again, very consistent. This time, however, there is a much smaller drop between the training and the test, showing that the model is better at generalizing and is not overfitting. There is no difference between the version with all features included and the train/tests produced by feature selection.

Here there is more variation in RMSE, with improvement in all of the tuned models, and looking at it as a percentage of the maximum possible value, it is easier to see the magnitude of the changes. The model with its features selected by Lasso had the largest decrease, 0.59, with the other two having around 0.4.

The increase in error between training and test was not similar for this model. Tuning cut the increase down dramatically for all versions. The models with feature selection had less than half the untuned model's RMSE increase. The 'all features' version was only just above the 50% mark. So, even though the actual test RMSE values were fairly close together, most of that error wasn't a result of the transition from training to test, but part of the models' attempts to account for the data. This, combined with the high R2 scores, is a good sign for the models' abilities to account for the data without overfitting.

Further confirmation that the models are doing well can be seen in comparing the error on the variable being predicted, count, to its standard deviation in the original data. For the DC dataset it is 181.5, while the highest test RMSE is ~73. For London those numbers are 1085.4 and ~332, respectively. In both cases, the RMSE is less than 1/3 of the standard deviation; well within "reasonable" for the data.


Model Comparisons

I ranked the models for each set by looking at the Test R2 and R2 Decrease columns, as well as the Test RMSE and RMSE Increase columns, ignoring the training columns as they are not the target, and the other two RMSE columns as they were redundant. I put values starting at 1 for the best and so on (up to 5 for DC and 4 for London) in each column, and then added the total for each model, with the lowest value being the best overall performer.

DC Test R2 R2 Decrease Test RMSE RMSE Increase Total
basic model 3 5 3 1 12
all features 2 4 2 5 13
fw select 1 2 1 4 8
bw select 4 3 4 3 14
lasso 5 1 5 2 13

Here, Forward Select had the best overall results, with all the others being pretty similar.

London Test R2 R2 Decrease Test RMSE RMSE Increase Total
basic model 4 4 4 4 16
all features 2 3 2 3 10
fwbw select 3 2 3 2 10
lasso 1 1 1 1 4

Here, the basic model, with no tuning, always performed the worst, while the model using Lasso selected features always performed the best, albiet by very small margins.

Despite having identical features to work with, the model does slightly better with the London dataset, especially after tuning. I am not sure if this is down to noise, or if there is a stronger pattern in the London set. Given that there is a more prevelant bike riding culture in Europe generally, this is the opposite of what I would have expected.


Feature Impacts on Models

In these charts, the higher-up features were more important, and the directionality of the effect and where high and low values for a feature are having an influence.

DC Results

DC model, all features

DC model, fw features

DC model, all features

DC model, fw features


London Results

DC model, all features

DC model, fw features

DC model, fw features


To summarize, higher temperatures and average temperatures result in the model output, count of usage, going up, while those values going down results in it going down. Higher humidity and windspeed do the opposite, with higher values causing the model output to decrease, although low values only cause a small increase. High weather values, which correspond to more severe weather, result in decreases, and when the precipitation category has an effect, a 1 for active precipitation results in a decrease as well. In both of these cases, there is only a small positive increase associated with the opposite values.

All of those behaviors make sense and are what I expected. Unexpectedly, the DC models show an increase for the high value seasons, (3 - fall, 4 - winter) and a decrease for the low (1 - spring, 2 - summer). This is contrary to all of the other behavior in the model. The London models do not do this, and have the expected trend of high values cause a decrease and low values causing an increase. This is the only complete opposition in the models; otherwise there are only differences in how much a feature is affecting the output.


Pipelines


As an exercise, I built pipelines, tweaked for each specific dataset, to cover the entire process in the Feature Engineering notebook. This meant converting each step into a custom transformer. Some steps were too specific to their dataset and couldn't be generalized to work for both, such as the fill_hours functions.

Transformer Name Function
RenameColDC Rename columns to standardized names
RenameColLond Rename columns to standardized names
DateTimeConverter Convert date column to datetime object
FillHoursDC Find missing hours and create rows for them
GetHour Get hour out of date
FillHoursLond Find missing hours and create rows for them
FillWMeans Fill values that are the same for 24 hour periods
FillWInterpolate Fill values that move over 24 hour periods
MergeDataDC Merge dc_hour table to fill missing weather values
FillWeatherValueDC Fill missing weather values for DC dataset
ForwardFillWeatherLond Fill missing weather values for London dataset
PrecipMappingDC Use mapping to create precip column for DC dataset
MakeWorkingDayLond Use holiday and is_weekend to create workingday for London dataset
GetDay Get day out of date
GetYr_Mn_Wkdy Get year, month and weekday out of date
DropNaSubsetLond Drop days in the London dataset that coundn't be filled with means
ValueMapTransformerLond Apply mappings for season, weather, weekday and precip to London dataset;
precip column is created
SetIntReorder Set category values to INT, reorder columns
SetIndexDate Set the date column as the index


These were all run within the notebook, but I also saved them into a .py file.

Once assembled in order, the DC pipeline was

RenameColDC,
FillHoursDC,
FillWMeans,
FillWInterpolate,
DateTimeConverter,
MergeDataDC,
FillWeatherValueDC,
PrecipMappingDC,
GetDay,
SetIntReorder,
SetIndexDate

and the London pipeline was

RenameColLond,
MakeWorkingDayLond,
DateTimeConverter,
GetHour,
FillHoursLond,
GetDay,
GetYr_Mn_Wkdy,
FillWMeans,
DropNaSubsetLond,
FillWInterpolate,
ForwardFillWeatherLond,
ValueMapTransformerLond,
SetIntReorder,
SetIndexDate

About

LHL Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published