MyAnimeList Data Insights

Description

The project seeks to acquire insights and develop dashboards with anime data from MyAnimeList. According to their website description, MyAnimeList is "the world's most active online anime and manga community and database".

Dataset

The dataset that will be utilized in the project will be a kaggle dataset. The files that will be used from the dataset are the following:

anime-dataset-2023.csv - "This dataset contains comprehensive details of 24,905 anime entries."
users-details-2023.csv - "This dataset comprises information on 731,290 users registered on the MyAnimeList platform. It is worth noting that while a significant portion of these users are genuine anime enthusiasts, there may be instances of bots, inactive accounts, and alternate profiles present within the dataset."
users-score-2023.csv - "This dataset comprises anime scores provided by 270,033 users, resulting in a total of 24,325,191 rows or samples."

Problem Description

This project aims to answer the following questions regarding the dataset:

What are the most common genres in anime?
Which licensors have the most anime licensed/co-licensed?
Which producers or co-producers are prevalent among the anime industry?
How does the average anime ratings vary between seasons?

Pipeline

Installation

The following programs need to be installed assuming the user does not already have the said programs present on their computer.

Tip

On a windows machine, these programs can also be installed using a command-line interface (CLI) assuming winget is already installed.

winget install --id=Docker.DockerDesktop  -e
winget install --id=Git.Git  -e
winget install --id=Hashicorp.Terraform  -e

GCP

Create a Google Cloud Project via console.
Set up Application Default Credentials using service account keys by creating a JSON key via console.
Grant the following roles:
- Viewer
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Dataproc Administrator
Check if billing is enabled on the newly created project.

Git

Clone the git repository

git clone https://github.com/nishiikata/de-zoomcamp-2024-mage-capstone.git

Terraform

An alternative to configuring the credentials of Google Provider inside the terraform block is by setting GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the JSON file.
Change directory into the terraform folder and configure the necessary variables needed to create a terraform plan.

Note

The dev.tfvars file can be used as reference to know what terraform variables needs to be configured.

Save the updated file as terraform.tfvars or any file name ending in .auto.tfvars so that terraform will automatically load the configuration by default.

Create a Terraform Plan.

Caution

It is recommend to just let terraform use the default values set in locals.data_lake_bucket and var.bq_dataset_name. Otherwise, values may need to be changed inside the mage orchestrator setup.

Docker

Set the necessary environment variables required for docker compose in an .env file.

Note

The dev.env file can be used as reference to know what environment variables needs to be configured.

The following google related environment variables provisioned by terraform may look like this

# GOOGLE_APPLICATION_CREDENTIALS does not have to be set inside the .env file
# if it was already set on the current operating system
# during the terraform configuration process
GCS_BUCKET="myanimelist_data_lake_[YOUR PROJECT]"
GCLOUD_PROJECT=[YOUR PROJECT]

Run the gcs_connector.sh script to download a jar file that will enable pyspark app to connect to Google Cloud Storage (GCS).
Start the docker container via docker compose up

Usage

Starting the Mage Pipeline

Only the initial_pipeline pipeline needs to be manually started. The remaining pipelines are set to be automatically triggered from a block. The initial_pipeline can be triggered through any of the following:

Using Mage UI (Schedule with Run exactly once)
Executing the provided start_initial_pipeline.sh script.

Important

The jq command-line tool is required to use the start_initial_pipeline.sh script I wrote since processing JSON objects on a CLI without it is so painful otherwise 🙃😭. You can just run the initial_pipeline manually if you can't install jq on your machine.

On windows you can also install jq via winget (This is the method I used on my personal windows machine.)

winget install --id=jqlang.jq  -e

For other operating systems such as Linux and MacOS and many more .etc, you can just use the official download page.

Run Mage in Cloud VM

Refer to Mage AI documentation for instructions on how to deploy a Mage AI instance to cloud virtual machines such as GCP Cloud Run.

Dashboard

The Looker Studio dashboard can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets/images		assets/images
platform_template		platform_template
terraform		terraform
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dev.env		dev.env
docker-compose.yml		docker-compose.yml
gcs_connector.sh		gcs_connector.sh
requirements.txt		requirements.txt
start_initial_pipeline.sh		start_initial_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MyAnimeList Data Insights

Description

Dataset

Problem Description

Pipeline

Installation

GCP

Git

Terraform

Docker

Usage

Starting the Mage Pipeline

Run Mage in Cloud VM

Dashboard

About

Releases

Packages

Languages

nishiikata/de-zoomcamp-2024-mage-capstone

Folders and files

Latest commit

History

Repository files navigation

MyAnimeList Data Insights

Description

Dataset

Problem Description

Pipeline

Installation

GCP

Git

Terraform

Docker

Usage

Starting the Mage Pipeline

Run Mage in Cloud VM

Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages