GitHub - nlim/spark-test: Example joining, grouping and aggregating using DataFrame API

You are productizing a batch machine learning pipeline that makes product recommendations. Transaction data from all customer stores is available as an input in the following format:

customer_uuid     date         store_id product_sku  item_count usd_total
34a1a589-cc39-... 2018-01-01.. 18       12421        1          15.25

In addition, product catalog data is also available:

product_sku  product_category
12421-345                Food
10343-124               Latte

Design a batch data processing pipeline that runs nightly and computes features for the ML model to train on. Specifically, for every customer, calculate the following:

weekly_food_purchases, weekly_nonfood_purchases: for the past 8 week period (56 days), how many food and non-food items has each customer purchased per week (on average)
weekly_spend: for the past 8 week period (56 days), how much money has each customer spend per week (on average)

For a bonus task, compute weekly_visits, how many times has each customer visited the store per week (on average).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.circleci		.circleci
project		project
src		src
README.md		README.md
build.sbt		build.sbt
catalog.csv		catalog.csv
txns.csv		txns.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

nlim/spark-test

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages