Skip to content

Data Engineering Capstone Project for DTC DE Zoomcamp 2024 Cohort

Notifications You must be signed in to change notification settings

nishiikata/de-zoomcamp-2024-mage-capstone

Repository files navigation

MyAnimeList Data Insights

Description

The project seeks to acquire insights and develop dashboards with anime data from MyAnimeList. According to their website description, MyAnimeList is "the world's most active online anime and manga community and database".

Dataset

The dataset that will be utilized in the project will be a kaggle dataset. The files that will be used from the dataset are the following:

  • anime-dataset-2023.csv - "This dataset contains comprehensive details of 24,905 anime entries."
  • users-details-2023.csv - "This dataset comprises information on 731,290 users registered on the MyAnimeList platform. It is worth noting that while a significant portion of these users are genuine anime enthusiasts, there may be instances of bots, inactive accounts, and alternate profiles present within the dataset."
  • users-score-2023.csv - "This dataset comprises anime scores provided by 270,033 users, resulting in a total of 24,325,191 rows or samples."

Problem Description

This project aims to answer the following questions regarding the dataset:

  • What are the most common genres in anime?
  • Which licensors have the most anime licensed/co-licensed?
  • Which producers or co-producers are prevalent among the anime industry?
  • How does the average anime ratings vary between seasons?

Pipeline

pipeline

Installation

The following programs need to be installed assuming the user does not already have the said programs present on their computer.

Tip

On a windows machine, these programs can also be installed using a command-line interface (CLI) assuming winget is already installed.

winget install --id=Docker.DockerDesktop  -e
winget install --id=Git.Git  -e
winget install --id=Hashicorp.Terraform  -e

GCP

  1. Create a Google Cloud Project via console.
  2. Set up Application Default Credentials using service account keys by creating a JSON key via console.
  3. Grant the following roles:
    • Viewer
    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin
    • Dataproc Administrator
  4. Check if billing is enabled on the newly created project.

Git

  1. Clone the git repository
git clone https://github.com/nishiikata/de-zoomcamp-2024-mage-capstone.git

Terraform

  1. An alternative to configuring the credentials of Google Provider inside the terraform block is by setting GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the JSON file.
  2. Change directory into the terraform folder and configure the necessary variables needed to create a terraform plan.

Note

The dev.tfvars file can be used as reference to know what terraform variables needs to be configured.

Save the updated file as terraform.tfvars or any file name ending in .auto.tfvars so that terraform will automatically load the configuration by default.

  1. Create a Terraform Plan.

Caution

It is recommend to just let terraform use the default values set in locals.data_lake_bucket and var.bq_dataset_name. Otherwise, values may need to be changed inside the mage orchestrator setup.

Docker

  1. Set the necessary environment variables required for docker compose in an .env file.

Note

The dev.env file can be used as reference to know what environment variables needs to be configured.

The following google related environment variables provisioned by terraform may look like this

# GOOGLE_APPLICATION_CREDENTIALS does not have to be set inside the .env file
# if it was already set on the current operating system
# during the terraform configuration process
GCS_BUCKET="myanimelist_data_lake_[YOUR PROJECT]"
GCLOUD_PROJECT=[YOUR PROJECT]
  1. Run the gcs_connector.sh script to download a jar file that will enable pyspark app to connect to Google Cloud Storage (GCS).
  2. Start the docker container via docker compose up

Usage

Starting the Mage Pipeline

Only the initial_pipeline pipeline needs to be manually started. The remaining pipelines are set to be automatically triggered from a block. The initial_pipeline can be triggered through any of the following:

  • Using Mage UI (Schedule with Run exactly once)
  • Executing the provided start_initial_pipeline.sh script.

Important

The jq command-line tool is required to use the start_initial_pipeline.sh script I wrote since processing JSON objects on a CLI without it is so painful otherwise 🙃😭. You can just run the initial_pipeline manually if you can't install jq on your machine.

On windows you can also install jq via winget (This is the method I used on my personal windows machine.)

winget install --id=jqlang.jq  -e

For other operating systems such as Linux and MacOS and many more .etc, you can just use the official download page.

Run Mage in Cloud VM

Refer to Mage AI documentation for instructions on how to deploy a Mage AI instance to cloud virtual machines such as GCP Cloud Run.

Dashboard

dashboard

The Looker Studio dashboard can be found here.

About

Data Engineering Capstone Project for DTC DE Zoomcamp 2024 Cohort

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published