The project seeks to acquire insights and develop dashboards with anime data from MyAnimeList. According to their website description, MyAnimeList is "the world's most active online anime and manga community and database".
The dataset that will be utilized in the project will be a kaggle dataset. The files that will be used from the dataset are the following:
anime-dataset-2023.csv
- "This dataset contains comprehensive details of 24,905 anime entries."users-details-2023.csv
- "This dataset comprises information on 731,290 users registered on the MyAnimeList platform. It is worth noting that while a significant portion of these users are genuine anime enthusiasts, there may be instances of bots, inactive accounts, and alternate profiles present within the dataset."users-score-2023.csv
- "This dataset comprises anime scores provided by 270,033 users, resulting in a total of 24,325,191 rows or samples."
This project aims to answer the following questions regarding the dataset:
- What are the most common genres in anime?
- Which licensors have the most anime licensed/co-licensed?
- Which producers or co-producers are prevalent among the anime industry?
- How does the average anime ratings vary between seasons?
The following programs need to be installed assuming the user does not already have the said programs present on their computer.
Tip
On a windows machine, these programs can also be installed using a command-line interface (CLI) assuming winget is already installed.
winget install --id=Docker.DockerDesktop -e
winget install --id=Git.Git -e
winget install --id=Hashicorp.Terraform -e
- Create a Google Cloud Project via console.
- Set up Application Default Credentials using service account keys by creating a JSON key via console.
- Grant the following roles:
- Viewer
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Dataproc Administrator
- Check if billing is enabled on the newly created project.
- Clone the git repository
git clone https://github.com/nishiikata/de-zoomcamp-2024-mage-capstone.git
- An alternative to configuring the credentials of Google Provider inside the terraform block is by setting
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the path of the JSON file. - Change directory into the terraform folder and configure the necessary variables needed to create a terraform plan.
Note
The dev.tfvars
file can be used as reference to know what terraform variables needs to be configured.
Save the updated file as terraform.tfvars
or any file name ending in .auto.tfvars
so that terraform will automatically load the configuration by default.
Caution
It is recommend to just let terraform use the default values set in locals.data_lake_bucket
and var.bq_dataset_name
. Otherwise, values may need to be changed inside the mage orchestrator setup.
- Set the necessary environment variables required for docker compose in an
.env
file.
Note
The dev.env
file can be used as reference to know what environment variables needs to be configured.
The following google related environment variables provisioned by terraform may look like this
# GOOGLE_APPLICATION_CREDENTIALS does not have to be set inside the .env file
# if it was already set on the current operating system
# during the terraform configuration process
GCS_BUCKET="myanimelist_data_lake_[YOUR PROJECT]"
GCLOUD_PROJECT=[YOUR PROJECT]
- Run the
gcs_connector.sh
script to download a jar file that will enable pyspark app to connect to Google Cloud Storage (GCS). - Start the docker container via
docker compose up
Only the initial_pipeline
pipeline needs to be manually started. The remaining pipelines are set to be automatically triggered from a block. The initial_pipeline
can be triggered through any of the following:
- Using Mage UI (Schedule with Run exactly once)
- Executing the provided
start_initial_pipeline.sh
script.
Important
The jq command-line tool is required to use the start_initial_pipeline.sh
script I wrote since processing JSON objects on a CLI without it is so painful otherwise 🙃😭. You can just run the initial_pipeline
manually if you can't install jq on your machine.
On windows you can also install jq via winget (This is the method I used on my personal windows machine.)
winget install --id=jqlang.jq -e
For other operating systems such as Linux and MacOS and many more .etc, you can just use the official download page.
Refer to Mage AI documentation for instructions on how to deploy a Mage AI instance to cloud virtual machines such as GCP Cloud Run.
The Looker Studio dashboard can be found here.