Crowd AnalytiX (CAX) - McKinsey Big Data Hackathon to predict the probability of an offer being accepted by a certain driver.
- Objective
- Approach
- Study Dataset Creation
- Feature Analysis
- Exploratory Data Analysis
- Supervised Model Building
- XGBoost Model Results
- Model Comparison Chart
- Learnings
- Acknowledgement
The objective of this contest is to predict the probability of an offer being accepted by a certain driver.
- Study Dataset Creation
- Raw data analysis and Feature Analysis
- Multivariate analysis - Using Tableau
- Exploratory Data Analysis - Data Cleaning
- Apply Supervised Machine Learning Models on the Dataste
- Performance Measurement - Confusionmatrix and AUC
In this type of data preparation, we created new features distance_speed and distance_driver_origin.
Where, duration_speed = duration_km / duration_min and distance_driver_origin = Ecludian distance on driver_latitude, driver_longitude,origin_order_latitude, origin_order_longitude.
We also checked for outliers, missing values and completed ExploratoryData Analysis for the original dataset. The R code is written in - CAX_McK_Clean1.R file. And we store the data in [CAX_Mck_Train_Clean1.csv] file.
We build Machine Learning models and evaluated model performance.
For Feature analysis we decided to do multivariant analysis in Tableau. Tableau is powerful tool to analysis data in visual forms. As out target feature is driver_responce, our analysis includes particular variables the most.
In the above-mentioned chart as we can see 1421 driver code got the maximum offers, however, driver accepted offer is only 90. Similarly, 3371 driver code, got around 500 offers, but driver accepted offer was 382. Hence, this driver did maximum business by accepting the offers.
This patter is also shown in following maps,
As we can see in the map, based on the longitude and latitude the data is from the city Moscow, Russia. As we can see in the map, near to Moskva River, drivers getting more offers and accepting offers are also high compare to outskirts of the main city. Observant points, driver ID with 3371, 3665 and 6580 as shown in the map got good count on offer acceptance.
Hence, we can say that in the centre of the city drivers are able to get more acceptance due to short distance.
As, we can see in the following chart Red market cluster are in the drivers in the centre of the city, and size of the circle shown average distance in km. However, as we can see in the blue clusters the distance outside the city rides are longer as size of the circles are bigger.
This, normally happens in the city, where in the centre of the city due to short distance, drivers get available earlier and hence are also available to accept the next nearest ride.
Here, 0 – is Sunday and ascending weekdays 6 is Saturday. As we can see in the flowing bar chart Driver’s acceptance ratio is lower on Sunday and Friday, which is 70% and 71.1% out of all offers receiver by the drivers on the particular weekday.
However, the best acceptance rate we can see is for Wednesday. And Thursday.
Low acceptance ratios on Monday, Friday and Saturday, also means the availability of the drivers? Dose these days we observe overcrowd of passengers? However, we can’ totally deny that drivers were sitting ideal and not accepting the offers on Monday or Saturday. However, for Sunday the case could be different as it is a only week holiday.
Similarly, as we observer the bar chat for a week, we have also observer patterns for an hour.
As sown in the following bar chart, Hight of the bar lines are the total counts response of the drivers and Green share indicated that acceptance rate of the driver.
Hence, as we can see in the bar chart driver’s acceptance rate is high from 8 am in morning till 3 pm in the afternoon.
However, the offer count is highest between evening 7 pm to 9-9.30 pm I at night. But mat be due to again availability of the drivers, more than 25% of the offers are not accepted.
This is the time frame, where company need to work to turn bars into the green shade.
Offer Class – With Respect Avg. Distance and Driver Response (Accept) To Overall size of the Offer Class
As we can see in the following bar chart, passengers prefer to book XL category car for long distance, and hence the average distance for the class group XL is the maximum, with around 20 km.
We have also observed that, driver’s response for accepting the ride is high for VIP+ and VIP category. However, distance in km are less for VIP+ class, mainly due to premium charges.
On the other side, Economy and Standard class category group of drivers are one of the least groups for accepting the offer from the passengers with respect to both the class shares highest market share on Pie chart. Again, this might be mainly because of high rush in particular class groups due to low fairs and hence, drivers are not available to accept the incoming offer.
Speed is a feature we created based on the distance and duration time. In the following multivariant analysis we took average speed during the hour and checked which day in a week has low and high traffic. We assumed low speed means high traffic.
Based on the chart, in the earlier hours of the day avg. speed is above average and as business hours begins speed decreases below average line. However, for the Sunday as we can see – mentioned in Orange colour – speed is above average. That indicates, due to less traffic cabs are able to achieve above avg. speed on Sunday.
We can also see few outliers in the graph, from midnight to early morning hours.
Based on the Avg. Speed During Week based on Hour Key analysis, we can nor relate the driver’s response based on the traffic.
Here, 0 means Driver not accepted the offer and 1 means – Driver accepted the offer. Driver’s are not accepting offers when the Avg. distance in km is high, and Avg. Speed is low.
Also, in the bar graph we can observe that for the Monday, offers from the passengers are above average but drivers acceptance rate is lower than the average. And for the Saturday, it’s reversed.
Based on the following Pie chart Standard and Economy class groups share the biggest market share, which can easily understand from the number of offers gets to these two class – as shown in in Bar charts also.
Based on the Red and Greed divergence – we can identify that Economy class drivers are accepting less offers on all the weekdays. And for Standard class we can see that the ratio of accepting is less only on Sunday (may be due to less drivers and a holiday).
For EDA we started with missing value identification and feature summary.
## Data Summary
dim(CAX_McK)
summary(CAX_McK)
str(CAX_McK)
colSums(is.na(CAX_McK))
attach(CAX_McK)
For Target variable : driver_respove we also checked the balance between class 0 and 1.
## Driver Respond Ratio
as.matrix(prop.table(table(driver_response)))
# [,1]
# 0 0.2597694
# 1 0.7402306
For ourliers detection, under uni-variant alaysis
## Uni-variant Analysis
> summary(CAX_McK[, c(10,11,15)])
distance_km duration_min distance_driver_origin
Min. : -1.000 Min. : -1.00 Min. : 0.00000
1st Qu.: -1.000 1st Qu.: -1.00 1st Qu.: 0.00562
Median : 6.923 Median : 15.38 Median : 0.01053
Mean : 13.474 Mean : 18.93 Mean : 0.24131
3rd Qu.: 18.199 3rd Qu.: 28.93 3rd Qu.: 0.01924
Max. :9137.747 Max. :6752.48 Max. :69.39688
par(mfrow = c(2,2))
boxplot(distance_km, main = 'Distance in KM', col = 'darkolivegreen2')
boxplot(duration_min, main = 'Duration in Min', col = 'darkorchid2')
boxplot(duration_speed, main = 'Speed in KM/Min', col = 'coral')
boxplot(distance_driver_origin, main = 'Distance Order Pickup and Driver', col = 'cornflowerblue')
As we can see the outliers in the box plot for particular features, we decided to remove the outlires. Also, we decided to remove -1 observations when destination is not set. This gives skewness the dataset.
## Outliers Remove with 97% percentile
## Remove -1 outliers
CAX_McK = CAX_McK[which(duration_min >= 0 &
distance_km >= 0 &
duration_speed >= 0 &
driver_latitude >= 0.1 &
driver_longitude >= 0.1 &
origin_order_latitude >= 0 &
origin_order_longitude >= 0),]
And hence, decided to check the quantiles,
## At 95% quantile - run one more Outlier test
> quantile(distance_driver_origin, 0.95)
95%
0.03919821
> quantile(duration_min, 0.95)
95%
57.667
> quantile(distance_km, 0.95)
95%
53.406
CAX_McK = CAX_McK[which(distance_driver_origin <= 0.03919 &
duration_min <= 57.667 &
distance_km <= 53.406), ]
After removing the outlires - summary and box plot.
> ## Mean After Outlier Adjustment
> summary(CAX_McK[, c(10,11,15)])
distance_km duration_min distance_driver_origin
Min. : 0.000 Min. : 0.00 Min. :0.000000
1st Qu.: 5.504 1st Qu.:13.22 1st Qu.:0.005593
Median :11.266 Median :21.13 Median :0.010292
Mean :14.976 Mean :23.35 Mean :0.012424
3rd Qu.:20.903 3rd Qu.:31.80 3rd Qu.:0.017664
Max. :53.406 Max. :57.67 Max. :0.039189
For Bi-variant analysis we did correlation study between features.
Continious Variables - correlation analysis We decised to not to incluse, order_gk, driver_gk, offer_gk due to unique ids We also decided to not to include, Discrit variables - hour_key, weekday_key as represents categorical variables.
corrplot(cor(CAX_McK[, c(6,7,8,9,10,11,12,15,16)]), type = 'upper', order = 'hclust',
col = brewer.pal(n = 7, name = 'YlGnBu'))
Remove High Correlatated features - Multicollinearity - duration_km and distance_km. Also, Remove, Origin_Order : latitude and longitude (High corr). We decided to keep duration_speed as it is created from km and min variables.
To remove non-significant variables in continious/numbers features, we used correlation. But, to identify non-significant variables in categorical features we used chi square test. However, we also performed t-test for continious variables.
To check the significant impact on target variable we put threashold of 95%.
## For Categorical Variables
chisq.test(as.factor(offer_gk), as.factor(driver_response)) ## NOT SIGNIFICANT : p-value: 0.499
chisq.test(as.factor(weekday_key), as.factor(driver_response)) ## SIGNIFICANT
chisq.test(as.factor(hour_key), as.factor(driver_response)) ## SIGNIFICANT
chisq.test(as.factor(driver_gk), as.factor(driver_response)) ## SIGNIFICANT
chisq.test(as.factor(order_gk), as.factor(driver_response)) ## NOT SIGNIFICANT : p-value: 1
chisq.test(as.factor(offer_class_group), as.factor(driver_response)) ## SIGNIFICANT
chisq.test(as.factor(ride_type_desc), as.factor(driver_response)) ## SIGNIFICANT
## For Continious Variables
t.test(distance_km, driver_response) ## SIGNIFICANT
t.test(distance_driver_origin, driver_response) ## SIGNIFICANT
t.test(duration_min, driver_response) ## SIGNIFICANT
t.test(duration_speed, driver_response) ## SIGNIFICANT
An hence, we removed non-significant feature from the dataset.
For Normalization of variables, we ploted the histogram and data distribution.
As we can see on the plot Speed variable has slight skewness and Distance Order variable is positive skewed distribution.
For normalization, we used boxcox.lambda test on both the variables, with the use of forecast library.
## Boxcox Lambda Test
library(moments)
library(forecast)
# duration_speed
BoxCox.lambda(CAX_McK$duration_speed)
CAX_McK$duration_speed = sqrt(CAX_McK$duration_speed)
# distance_driver_origin
BoxCox.lambda(CAX_McK$distance_driver_origin)
CAX_McK$distance_driver_origin = log(CAX_McK$distance_driver_origin)
CAX_McK$distance_driver_origin = (CAX_McK$distance_driver_origin)^2
CAX_McK$distance_driver_origin = 1 / CAX_McK$distance_driver_origin
After normalization, we stored the resultes in new file - CAX_McK_Train_Clean1.csv for supervised model building.
Before model building we converted converted offer_class_group and ride_type_desc factors into dummy variables.
Due to the size of the datse set we decided to build Logistic Regression model for class prediction and probability to predict whether particular driver accepts the offer or not.
We also tried building Random FOrest, and K-nearest neighbour models, but due to size of the data, system took too much time for model completion. Hence, we build model using logit technique.
## Convert FActors into Dummy vars
offer_class = model.matrix( ~ offer_class_group - 1, data = CAX)
CAX = data.frame(CAX, offer_class)
ride_type = model.matrix( ~ ride_type_desc - 1, data = CAX)
CAX = data.frame(CAX, ride_type)
To validate the logistic regression model we created two sets from the originl datset one, for traning and second for validation.
In validation dataset we created sub-dataset where we removed target variable driver_responce to predict the variables and to validate with the actual results.
Here, development datset is train1, and validation is test22.
# Make Ratio of 30% and 70% for test and train dataset
ind = sample(2, nrow(CAX), replace = TRUE, prob = c(0.7,0.3))
train1 = CAX[ind == 1,]
test1 = CAX[ind == 2,]
test22 = test1[, -c(8)] ## Remove Targer Var for Test
We trained logit model for all the features. And then based on the significant importance, we removed few variables from the model to improve accuracy.
## Build Logit Model
CAX_logit = glm(train1$driver_response ~ . , data = train1, family = binomial())
summary(CAX_logit)
train1 = train1[, -c(1,3,7,12,14,15,16,17,18,19,20)]
Summary of the model,
> summary(CAX_logit)
Call:
glm(formula = train1$driver_response ~ ., family = binomial(),
data = train1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0278 -1.0333 0.6420 0.7844 1.5852
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 65.8515313 2.7365527 24.064 < 2e-16 ***
hour_key 0.0018727 0.0005738 3.264 0.0011 **
driver_latitude -1.9185328 0.0383713 -49.999 < 2e-16 ***
driver_longitude 1.1886665 0.0310260 38.312 < 2e-16 ***
duration_speed -2.7140158 0.0280469 -96.767 < 2e-16 ***
offer_class_groupDelivery -0.3466691 0.0382926 -9.053 < 2e-16 ***
offer_class_groupEconomy -0.5951427 0.0226766 -26.245 < 2e-16 ***
offer_class_groupKids 0.3604382 0.0543878 6.627 3.42e-11 ***
offer_class_groupStandard -0.1903428 0.0231063 -8.238 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 453283 on 399423 degrees of freedom
Residual deviance: 429901 on 399415 degrees of freedom
AIC: 429919
Number of Fisher Scoring iterations: 4
For model performance we used confusion matrix and AUC (Area Under Curve)
## Prediction Test
test22$predict = predict.glm(CAX_logit, test22, type = 'response')
test22$predict_class = round(test22$predict)
confusionMatrix(as.factor(test22$predict_class), as.factor(test1$driver_response))
Here, we created two prediction, 1. test22$predict for probability (continious) and 2. test22$predict_class for class (1 and 0).
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4125 1481
1 39200 126241
Accuracy : 0.7622
95% CI : (0.7601, 0.7642)
No Information Rate : 0.7467
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.1174
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.09521
Specificity : 0.98840
Pos Pred Value : 0.73582
Neg Pred Value : 0.76306
Prevalence : 0.25329
Detection Rate : 0.02412
Detection Prevalence : 0.03277
Balanced Accuracy : 0.54181
'Positive' Class : 0
As we can see the accuracy of the model to predict correct - driver accept the offer (class 1) or not (class 0) is 76.2%.
- Sensitivity means : True positive rate 9.5%: cases correctly identified as driver would not accept the offer. [Need to be higher, we wnat to know more about specific driver that is not accepting the rides]
- Specificity meand : True Negative rate 98.8%: cases correctly identified as driver would accept the offer.
In this case if we calculate Precison and Recall it would be, 0.735 and 0.095 respectively. Bur, please note, this will give us for drivers not accepting ride. NOw, if I reverce the situtaion, then for drivers accepting the offer would be changed.
In that particulat case (driver accept the offer) my Precision would be 0.763 and Recall would be 0.988. Hence, we can say that 76.3% times we got correct predicting driver accepted the offer and got accepted, and 98.8% we got correct predicting that driver accepts the offer.
- False Positive Rate (FPR): 1 - Specificity = 1.2%
- False Negative Rate (FNR): 1 - Sensitivity = 90.4%
False positive rate means, actually driver accepted the offer, but our model shows driver denied the offer. And FAlse negative rate means driver denied the offer, but our model predicts drived accepted the ride. [Needs to be small!]
Now, from our perspective FNR should be as small as possible for cab ride company. Because, if model predicts drivers acepts the offer and actually there is no driver to accept the offer / or driver denied the ride offer, means loses of revenue for the cab ride company. However, on the otherside cab ride company migh be in big misconception, that driver woudl accept the offer, but actually driver denied the offer. And this might lead to mismanagement for company while matching cab demands according to areas.
In the following graph, we have shown two ROC (Receiver Operating Characteristic Curve) and AUC are 0.7494.
After the Logistic Regression, we build XG Boost model. Here we have presented direct results from the model.
For 500 iteration we plot, mlogloss rate for train in blue dots and test in red line.
For XGBOOST model we performed confusion martix an ROC curve.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 15470 3568
1 28109 124139
Accuracy : 0.8151
95% CI : (0.8132, 0.8169)
No Information Rate : 0.7456
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4015
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.35499
Specificity : 0.97206
Pos Pred Value : 0.81259
Neg Pred Value : 0.81537
Prevalence : 0.25442
Detection Rate : 0.09032
Detection Prevalence : 0.11115
Balanced Accuracy : 0.66352
'Positive' Class : 0
Here as we can see, model accuracy is higher than Logistic regression model. ALso, our focus is on drivers accepting the ride offer, in that case we have specificity at 97%.
Over all model performed better on the sae dataset compared to logistic regression model.
It's not only limited to accurcy, but on AUC also we improved with 0.814. Followinf is the ROC curve,
- For Feature Analysis we used Tableau tool.
- Studied around 1 million observations, and hence, we were forced to used Logistic Regression - due to system limitation.
- Planning to build one more model Light GBM.
- Initially, we try to built Random Forest and Knn models on the dataset, but due to system process limitation, need to drop the idea.
- Knn we tried with initial k = 5 and hence, it's out of question that model analysis over half a million cases 5 times and build Knn.
- In case, if regression (Logistic) is not used. May be we woudl not remove highly correlated predictor variables. (Need to check). Becauses, multicollinerity adds sensitivity to minor changes in the model.
- Due to okay accuarcy, we got slight poor AUC. Need to correct!
- For correctiuon, performed XGBOOST - and improved overall results on the same dataset.
Crowd AnalytiX (CAX) - McKinsey Big Data Hackathon