Skip to content

Example joining, grouping and aggregating using DataFrame API

Notifications You must be signed in to change notification settings

nlim/spark-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

You are productizing a batch machine learning pipeline that makes product recommendations. Transaction data from all customer stores is available as an input in the following format:

customer_uuid     date         store_id product_sku  item_count usd_total
34a1a589-cc39-... 2018-01-01.. 18       12421        1          15.25

In addition, product catalog data is also available:

product_sku  product_category
12421-345                Food
10343-124               Latte

Design a batch data processing pipeline that runs nightly and computes features for the ML model to train on. Specifically, for every customer, calculate the following:

  1. weekly_food_purchases, weekly_nonfood_purchases: for the past 8 week period (56 days), how many food and non-food items has each customer purchased per week (on average)

  2. weekly_spend: for the past 8 week period (56 days), how much money has each customer spend per week (on average)

For a bonus task, compute weekly_visits, how many times has each customer visited the store per week (on average).

About

Example joining, grouping and aggregating using DataFrame API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages