title | author | date | output |
---|---|---|---|
README |
Dusan Grubjesic email: [email protected] |
August 11, 2015 |
html_document |
This is click rate prediction algorithm using spark, writen in python api of spark: pyspark.
Data was taken from Criteo Labs and is sample of Kaggle Display Advertising Challenge Dataset. It can be downloaded after you accept the agreement http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/.
It is structured as lines of observations where first is click or no click(1,0) and rest is features
You must have installed apache spark and python. Also you have to change location of sample in ClickRate.py to where you downloaded it and spark context if you want to change from local to cluster. Sh file is only used for simpler starting and if you want to use it you have to change to your settings.
I have apache spark pre-bult with hadoop 2.6, python 3.4 and numpy package installed
- Sample is first parsed and loaded in context.
- Transformed so it can be used in logistic regression
- Model created from train data
- Set of log loss validations
- Iterations of logistic regressions for best hyperparamaters
additional explanations are in code