I looked at data from two cities, Washington DC and London England, over two year time spans, to try and find weather-related usage patterns in the cities' bikeshare systems.
Found the Washington DC Capital bike share data set on the UC Irvine website via a list of interesting datasets, and a second dataset on Kaggle tracking the same variables when I checked for other data I could use. It was the only other dataset I found that had similar enough data to be usable in a short time frame.
Confirmed the same basic data was available in both datasets. The London data required more processing to get to the same columns at the end, but aside from the DC set having an additional breakdown between regular and guest users, the data available was the same.
The DC data starts on January 1, 2011, and ran for two years, ending on December 31, 2012. The London data missed the New Year and instead starts on January 4, and also runs for two years, but from 2015 until January 3, 2017.
Both datasets include hourly weather data; temperature, humidity, windspeed and 'weather condition', as well as whether or not a day is a holiday, the season, and of course the number of rentals.
Looking at the target variable, the number of bikes being used (called cnt
by default in both tables), there is a significant difference in the range of values, and therefore total useage, between the two cities.
London has much, much higher numbers than DC does, and both target variables have exponential distributions.
The season
feature was essentially uniform in both datasets, but there were no time-value columns initially present in the London data to look at. The year
, weekday
, month
and hour
columns had to be extracted from the timestamps. As such, the charts below are actually from partway through feature engineering.
Given that the data covered entire years, there were certain features that logically should have been uniformly distributed, or at least close to it. These were season
, year
, weekday
, month
and hour
.
Some minor variation in most of these values makes sense, given that not all months are the same length and the year does not always start at the beginning of the week. Additionally, 2012 was a leap year, so it has an extra day. London's year distribution was always going to be a little weird, given the data technically crosses into a third year. However, the fact that the hours showed as not completely uniform presented a problem; every day has 24 hours, so there was clearly missing data that needed to be filled in. It turned out that there were no rows with 0 rentals in the DC set, and only one in the London set, which lead me to conclude that the missing rows were the ones where there was no rental activity. The missing rows are dealt with during Feature Engineering
.
Although they have different names, the temp
in the DC dataset and t1
in the London dataset both refer to the actual temperature, while atemp
for DC and t2
for London both refer to the apparent temperature (per the documentation), making them directly comparable.
All of these features are approximately normally distributed.
There are also some columns that were likely to have unpredictable or skewed ditributions. First, for the DC dataset, there are the holiday
, workingday
and weather
columns.
All of these make sense, as there are more days that are not holidays, are working days, and less severe weather is more common than severe weather.
The London set also has holiday
, but has is_weekend
instead of workingday, and different value options for weather
.
The holiday pattern matches the one for DC, and the weekend pattern is roughly the inverse of the one for workingday. Used together, those can produce a workingday
column for the London dataset.
This weather column does not have as straight forward of a pattern as the one for the DC dataset. Its values are not sequential, resulting in the stretched histogram above. There is also one value, 94, that is never used.
Looking at PairPlots (which are omitted due to size and computing time), there are a few features that immediately appear to correlate with the number of bikes being rented.
These best fit lines behave the way I expected; both temperature lines have an increase in temperature resulting in an increase in the rental count. Additionally, an increase in humidity, which, in the climates of the cities I am looking at, will most often relate to less pleasant weather, results in a decrease in the rental count.
In the DC set, where there is an hour column to look at, the best fit line loosely goes upwards was the hour increases towards midnight, and looking at the actual scatterplot, its clear that the dips are the overnight and predawn hours, which are (loosely) the lowest hours.
Feature Pair | Coefficient |
---|---|
DC set | |
Count + Hour | 0.394 |
Count + Temp | 0.405 |
Count + Atemp | 0.400 |
Count + Hum | -0.323 |
London Set | |
Count + T1 | 0.389 |
Count + T2 | 0.369 |
Count + Hum | -0.463 |
On their own, the correlations are not particularly strong, but it is possible they will have a significant effect as part of a larger model.
Performed feature engineering; all values related to time had to be extracted into columns for the London dataset. I also filled in gaps in the data where an hour had been skipped, using the assumption that those times had been omitted due to no bikes being rented, as for my purposes 'no rentals occured' is a useful data point. In the London set, there were some days that had no values at all, in any of the columns, which meant there was no way to fill most of them, so those entire days were dropped to avoid creating skew. Otherwise, different fill methods were used depending on the column; some columns have static values throughout the day, while others move coninuously. In the first case, the value from elsewhere in the day was simply placed, while in the second case interpolation was used to find the average between the values on either side of the missing value.
For the DC dataset, a secondary table counting by day rather than by hour exists, so the weather values for a given day on that were used to fill missing weather values. No such second table existed for the London dataset, but exploration showed that it was reasonably common for the weather code on either side of a missing value to be the same, so simple forward fill was used to fill the missing values.
The London dataset had different values in the season
, weather
, and weekday
categories, so those were remapped to match the DC values. I am running them through separate models,
but for consistency and so I don't confuse myself later, it made sense to have them match. In the case of the London weather encoding, it was significantly more granular than the DC
encoding, so it had to be reformatted to be properly comparable. Both datasets have descriptions of what qualifies for each code, so I used those to group the London codes and match
them to the DC ones. I also created a new precip
column, based on those descriptions, which is a categorical 'is there precipitation or not' 0/1 column.
(Note: filling in the missing weather values in the London dataset happened after remapping the weather codes, and as the new values were less complex than the old ones, this may have helped with the values before and after the missing info being the same.)
The temp
, atemp
(apparent temp), humidity
and windspeed
columns of the DC dataset came pre-normalized, so the corresponding columns in the London dataset were normalized as well,
using the MinMaxScaler, as that matched the behavior of the data as I found it. This was done before splitting to be consistent with the DC set.
Feature code legends:
Code | Season |
---|---|
1 | Spring |
2 | Summer |
3 | Fall |
4 | Winter |
Year | Dataset | Code |
---|---|---|
2011 | DC | 0 |
2012 | DC | 1 |
2015 | London | 0 |
2016 | London | 1 |
2017 | London | 2 |
Code | Weather Description |
---|---|
1 | Clear, Few clouds, Partly cloudy, Partly cloudy |
2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
1: Clear, Few clouds, Partly cloudy, Partly cloudy
Old values:
1 - Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity
2 - scattered clouds / few clouds
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
Old values:
3 - Broken clouds
4 - Cloudy
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
Old value:
7 = Rain/ light Rain shower/ Light rain
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
Old values:
10 = rain with thunderstorm
26 = snowfall
94 = Freezing Fog
Mappings for precip
column
Weather code | Description | precip code |
---|---|---|
1 | Clear, Few clouds, Partly cloudy, Partly cloudy | 0 |
2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist | 0 |
3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds | 1 |
4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog | 1 |
I chose four regression models to try; LinearRegression
, SVR
and RandomForestRegressor
from Sklearn, and the XGBRegressor
from XGBoost.
Because my datasets are time-based, I did my train/test split without shuffling.
DC dataset
Train set:
Root Mean Squared Error: 129.47650158667653
R-squared: 0.40023509636432963
Test set:
Root Mean Squared Error: 182.47521094734591
R-squared: 0.3154146362592527
London dataset
Train set:
Root Mean Squared Error: 887.6725962693262
R-squared: 0.31692830959135165
Test set:
Root Mean Squared Error: 978.3105148178599
R-squared: 0.24876167392851933
DC dataset
Train set:
Root Mean Squared Error: 143.88640899563978
R-squared: 0.25930624378822176
Test set:
Root Mean Squared Error: 210.58087348202932
R-squared: 0.08828792531790608
London dataset
Train set:
Root Mean Squared Error: 1021.256075397555
R-squared: 0.09587189980290356
Test set:
Root Mean Squared Error: 1086.1229944892814
R-squared: 0.07406114159332133
DC dataset
Train set:
Root Mean Squared Error: 14.152513563206748
R-squared: 0.9928341733910035
Test set:
Root Mean Squared Error: 72.87654665983804
R-squared: 0.8908068409097215
London dataset
Train set:
Root Mean Squared Error: 85.32546889684085
R-squared: 0.9936887115704169
Test set:
Root Mean Squared Error: 289.163573168845
R-squared: 0.9343686316506019
DC dataset
Train set:
Root Mean Squared Error: 24.081214852442457
R-squared: 0.9792529940605164
Test set:
Root Mean Squared Error: 69.92294989890084
R-squared: 0.8994784355163574
London dataset
Train set:
Root Mean Squared Error: 127.9868120875735
R-squared: 0.9857999086380005
Test set:
Root Mean Squared Error: 332.20786503370334
R-squared: 0.9133748412132263
The two models that performed the best, based on both R-squared
and root mean squarred error
, were the RandomForest
and XGBoost
models, so I chose those
two models to continue tuning in the hyperparameters section.
I used three methods of feature selection. The first was running a simple Lasso
regression, and retrieving all of the columns that did not return a coefficient of
0. This was the crudest of the methods, as the Lasso regression was completely untuned, and as a result removed the most features.
The second method was Forward Selection
, where features are added, step by step, to the model, starting with the most significant.
My third mthod was Backward Selection
; starting with all features and optimizing by attempting step by step removal of features.
For both datasets, using Lasso selection resulted in the removal of multiple features. Forward and backwards select, however, dropped one feature from each set, and both dropped the same feature. Below are the selected and dropped features for each dataset, according to each method.
(Note: the count
column is not involved here, as it is the dependant variable.)
Using Lasso:
DC dataset
Selected features: season, year, month, day, hour, weekday, precip, temp, humidity
Dropped features: holiday, workingday, weather, atemp, windspeed
London dataset
Selected features: season, year, month, day, hour, weekday, workingday, weather, precip, temp, humidity, windspeed
Dropped features: holiday, atemp
Using Forward Select:
DC dataset
Selected features: season, year, month, hour, weekday, holiday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: day, workingday
London dataset
Selected features: season, month, day, hour, workingday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: year, weekday, holiday
Using Backward Select:
DC dataset
Selected features: season, year, month, hour, weekday, holiday, weather, precip, atemp, humidity, windspeed
Dropped features: day, workingday, temp
London dataset
Selected features: season, month, day, hour, workingday, weather, precip, temp, atemp, humidity, windspeed
Dropped features: year, weekday, holiday
I created new versions of X_train and X_test based on each selection list, as I wanted to test how well the model could do with feature inputs that had, theoretically been optimized.
Because forward and backwards select on the London dataset produced the same feature list, I didn't bother to create a separate _fw
and _bw
train/test pair; it
would have been redundant. I simply created a _fwbw
pair instead.
To start with, I tuned both models on the regular train/test set that had all of the features. My initial attempts to use r2
as my scoring method produced very low
scores, so I ran secondary versions scoring on neg_root_mean_squared_error
, and found that while that method did not always produce improvement, it usually at least
matched the untuned model's baseline.
DC set:
Base model r2: 0.8908068409097215
Tuned model r2: 0.6895321510957303
Base model rmse: 72.87654665983804
Tuned model rmse: 71.84172119174976
London set:
Base model r2: 0.9343686316506019
Tuned model r2: 0.9028084805222928
Base model rmse: 289.163573168845
Tuned model rmse: 312.8454875514148
DC set:
Base model r2: 0.8994784355163574
Tuned model r2: 0.7963137030601501
Base model rmse: 69.92294989890084
Tuned model rmse: 56.490411500995606
London set:
Base model r2: 0.9133748412132263
Tuned model r2: 0.919543719291687
Base model rmse: 332.20786503370334
Tuned model rmse: 293.8955040541115
XGBRegressor
performed much better than RandomForestRegressor
did, so for the feature selection models, I focused on that one.
DC set:
Base model: 69.92294989890084
All features: 56.490411500995606
Fw features: 56.5705918807984
Bw features: 55.76142506506576
Lasso features: 60.833681953480436
London set:
Base model: 332.20786503370334
All features: 293.8955040541115
FwBw features: 287.72959089223957
Lasso features: 287.6278104393208
alpha | lambda | learning_rate | max_depth | n_estimators | |
---|---|---|---|---|---|
dc allfeat | 0.10 | 100 | 0.3 | 6 | 180 |
dc_fw | 2.00 | 50 | 0.2 | 6 | 370 |
dc_bw | 0.70 | 80 | 0.2 | 6 | 280 |
dc_lasso | 1.00 | 130 | 0.2 | 6 | 190 |
lond_allfeat | 0.75 | 125 | 0.3 | 7 | 70 |
lond_fwbw | 0.01 | 10 | 0.1 | 7 | 140 |
lond_lasso | 0.01 | 10 | 0.1 | 6 | 160 |
The DC set was all over the place for alpha and lambda values, and had the highest values for number of estimators, in one case going twice as high as the highest
estimator value any London model ended up with.
The London set had higher alpha and lambda values on the all features version, but it dropped down for versions that had undergone feature selection. The number of estimators did the opposite, jumping up where feature election had occured.
Both models had the learning rate decrease for versions that had undergone feature selection, with a larger decrease on the London set.
The DC dataset always used a max depth of 6, but the London dataset slightly prefered 7, with no apparent pattern.
I initiated and ran each model tuned in the hyperparameters notebook so I could fully evaluate their performances.
During my evaluation and comparisson, I compared the tuned models for each set to the untuned one, and the overall behavior of each dataset to the other. I looked at
the R-squared
and Root Mean Squared Error
(R2 and RMSE) of both the training and test sets, as well as the changes between them, and the RMSE as a percentage of each
dataset's maximum real value, in order to get an idea of how much error there actually was; an error of +/-50 would not be a big deal when the values in question are
routinely in the thousands, but would be a big deal if the values were dozens at most.
DC evaluation outputs
Train R2 | Test R2 | R2 Decrease | Train RMSE | Test RMSE | RMSE as % of range | RMSE Increase | RMSE Increase as % of range | |
---|---|---|---|---|---|---|---|---|
basic model | 0.979 | 0.899 | 0.080 | 24.081 | 69.923 | 7.16 | 35.832 | 3.67 |
all features | 0.973 | 0.903 | 0.070 | 27.238 | 68.692 | 7.03 | 41.454 | 4.24 |
fw select | 0.975 | 0.908 | 0.066 | 26.683 | 66.804 | 6.84 | 40.120 | 4.11 |
bw select | 0.967 | 0.898 | 0.069 | 30.417 | 70.396 | 7.21 | 39.979 | 4.09 |
lasso | 0.954 | 0.889 | 0.065 | 35.664 | 73.389 | 7.51 | 37.725 | 3.86 |
All the test R2 values are very consistent, regardless of model tuning or feature set, while the drop in R2 between the training and test sets is smaller for
the tuned models, and even smaller with feature selection.
The test RMSEs are all also very similar, both as actual values and as percentages of the maximum possible value. While the actual values of the error increases look quite large, approximately doubleing across the board, they are actually fairly small amounts in proportion to the overall range.
There is not much difference in the RMSE as %
values (0.67) or in the RMSE Increase as %
values, even though there was some improvement in how much the R2
was dropping. This could mean the model is not generalizing very well on this data, even after tuning and feature selection.
For this data, there wasn't a consistent pattern to the performance in the raw scores, only in the amount of change from the training to test sets.
London evaluation outputs
Train R2 | Test R2 | R2 Decrease | Train RMSE | Test RMSE | RMSE as % of range | RMSE Increase | RMSE Increase as % of range | |
---|---|---|---|---|---|---|---|---|
basic model | 0.986 | 0.913 | 0.072 | 127.987 | 332.208 | 4.23 | 203.221 | 2.59 |
all features | 0.968 | 0.929 | 0.038 | 193.013 | 299.935 | 3.82 | 106.922 | 1.36 |
fwbw select | 0.964 | 0.928 | 0.036 | 203.815 | 302.494 | 3.85 | 98.679 | 1.26 |
lasso | 0.969 | 0.936 | 0.033 | 189.722 | 286.016 | 3.64 | 96.293 | 1.23 |
The R2 values for the test set are, again, very consistent. This time, however, there is a much smaller drop between the training and the test, showing that
the model is better at generalizing and is not overfitting. There is no difference between the version with all features included and the train/tests produced
by feature selection.
Here there is more variation in RMSE, with improvement in all of the tuned models, and looking at it as a percentage of the maximum possible value, it is
easier to see the magnitude of the changes. The model with its features selected by Lasso
had the largest decrease, 0.59, with the other two having around
0.4.
The increase in error between training and test was not similar for this model. Tuning cut the increase down dramatically for all versions. The models with feature selection had less than half the untuned model's RMSE increase. The 'all features' version was only just above the 50% mark. So, even though the actual test RMSE values were fairly close together, most of that error wasn't a result of the transition from training to test, but part of the models' attempts to account for the data. This, combined with the high R2 scores, is a good sign for the models' abilities to account for the data without overfitting.
Further confirmation that the models are doing well can be seen in comparing the error on the variable being predicted, count
, to its standard deviation in the
original data. For the DC dataset it is 181.5, while the highest test RMSE is ~73. For London those numbers are 1085.4 and ~332, respectively. In both cases,
the RMSE is less than 1/3 of the standard deviation; well within "reasonable" for the data.
I ranked the models for each set by looking at the Test R2
and R2 Decrease
columns, as well as the Test RMSE
and RMSE Increase
columns, ignoring
the training columns as they are not the target, and the other two RMSE columns as they were redundant. I put values starting at 1 for the best and so on
(up to 5 for DC and 4 for London) in each column, and then added the total for each model, with the lowest value being the best overall performer.
DC | Test R2 | R2 Decrease | Test RMSE | RMSE Increase | Total |
---|---|---|---|---|---|
basic model | 3 | 5 | 3 | 1 | 12 |
all features | 2 | 4 | 2 | 5 | 13 |
fw select | 1 | 2 | 1 | 4 | 8 |
bw select | 4 | 3 | 4 | 3 | 14 |
lasso | 5 | 1 | 5 | 2 | 13 |
Here, Forward Select
had the best overall results, with all the others being pretty similar.
London | Test R2 | R2 Decrease | Test RMSE | RMSE Increase | Total |
---|---|---|---|---|---|
basic model | 4 | 4 | 4 | 4 | 16 |
all features | 2 | 3 | 2 | 3 | 10 |
fwbw select | 3 | 2 | 3 | 2 | 10 |
lasso | 1 | 1 | 1 | 1 | 4 |
Here, the basic model, with no tuning, always performed the worst, while the model using Lasso
selected features always performed the best, albiet by
very small margins.
Despite having identical features to work with, the model does slightly better with the London dataset, especially after tuning. I am not sure if this is down to noise, or if there is a stronger pattern in the London set. Given that there is a more prevelant bike riding culture in Europe generally, this is the opposite of what I would have expected.
In these charts, the higher-up features were more important, and the directionality of the effect and where high and low values for a feature are having an influence.
DC Results
London Results
To summarize, higher temperatures and average temperatures result in the model output, count of usage, going up, while those values going down results in it going
down. Higher humidity and windspeed do the opposite, with higher values causing the model output to decrease, although low values only cause a small increase.
High weather values, which correspond to more severe weather, result in decreases, and when the precipitation category has an effect, a 1 for active precipitation
results in a decrease as well. In both of these cases, there is only a small positive increase associated with the opposite values.
All of those behaviors make sense and are what I expected. Unexpectedly, the DC models show an increase for the high value seasons, (3 - fall, 4 - winter) and a decrease for the low (1 - spring, 2 - summer). This is contrary to all of the other behavior in the model. The London models do not do this, and have the expected trend of high values cause a decrease and low values causing an increase. This is the only complete opposition in the models; otherwise there are only differences in how much a feature is affecting the output.
As an exercise, I built pipelines, tweaked for each specific dataset, to cover the entire process in the Feature Engineering
notebook. This meant converting each step into a custom transformer. Some steps were too specific to their dataset and couldn't be generalized to work for both, such as the fill_hours functions.
Transformer Name | Function |
---|---|
RenameColDC | Rename columns to standardized names |
RenameColLond | Rename columns to standardized names |
DateTimeConverter | Convert date column to datetime object |
FillHoursDC | Find missing hours and create rows for them |
GetHour | Get hour out of date |
FillHoursLond | Find missing hours and create rows for them |
FillWMeans | Fill values that are the same for 24 hour periods |
FillWInterpolate | Fill values that move over 24 hour periods |
MergeDataDC | Merge dc_hour table to fill missing weather values |
FillWeatherValueDC | Fill missing weather values for DC dataset |
ForwardFillWeatherLond | Fill missing weather values for London dataset |
PrecipMappingDC | Use mapping to create precip column for DC dataset |
MakeWorkingDayLond | Use holiday and is_weekend to create workingday for London dataset |
GetDay | Get day out of date |
GetYr_Mn_Wkdy | Get year , month and weekday out of date |
DropNaSubsetLond | Drop days in the London dataset that coundn't be filled with means |
ValueMapTransformerLond | Apply mappings for season , weather , weekday and precip to London dataset; precip column is created |
SetIntReorder | Set category values to INT, reorder columns |
SetIndexDate | Set the date column as the index |
These were all run within the notebook, but I also saved them into a .py file.
Once assembled in order, the DC pipeline was
RenameColDC,
FillHoursDC,
FillWMeans,
FillWInterpolate,
DateTimeConverter,
MergeDataDC,
FillWeatherValueDC,
PrecipMappingDC,
GetDay,
SetIntReorder,
SetIndexDate
and the London pipeline was
RenameColLond,
MakeWorkingDayLond,
DateTimeConverter,
GetHour,
FillHoursLond,
GetDay,
GetYr_Mn_Wkdy,
FillWMeans,
DropNaSubsetLond,
FillWInterpolate,
ForwardFillWeatherLond,
ValueMapTransformerLond,
SetIntReorder,
SetIndexDate