This project has been created for the new startup called Sparkify.
The purpose of the project is understanding what songs users are listening and analyzing the data based on songs and user activity on the app.
Milestones of this project:
- Build an ETL pipeline using Python
- Define fact and dimension tables for a star schema
- Create an ETL pipeline to transfer data from files to tables
- songs: fact data of the songs in the app
- artists: fact data of the artists in the app
- users: fact data of the app users
- time: timestamp of user activity data
- songplays: user listening activity data
- Install postgres and psycopg2-binary
- Point postgres to the host port number and host name
- Run create_tables.py to create the schema.
- This will run sql_queries.py script to create & drop tables.
- Run etl.py to create to populate the tables.
- This will populate the tables with sql_queries.py
- Data
- log_data
- song_data
- README.md
- create_tables.py
- etl.ipynb
- etl.py
- sql_queries.py
- test.ipynb
EX Query:
SELECT Count(*) agent_count, user_agent FROM SongPlays
GROUP BY user_agent