Adds architecture documentation

chanzuckerberg · Dec 20, 2024 · 2864e7f · 2864e7f
1 parent 408f934
commit 2864e7f
Show file tree

Hide file tree

Showing 4 changed files with 125 additions and 0 deletions.
diff --git a/docs/CryoET_Architecture_Diagram.png b/docs/CryoET_Architecture_Diagram.png
diff --git a/docs/CryoET_Backend_Workflow.png b/docs/CryoET_Backend_Workflow.png
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,63 @@
+# Architecture
+
+This document provides an overview of the architecture for cryoET applications. The backend stacks supports the data ingestion, and the maintenance of the graphQL API.
+
+
+## Account Setup
+
+The backend of the cryoET is distributed across three different AWS accounts, each for different purposes:
+
+### dev account
+ - It hosts all the source data
+ - All the data standardization happens in this account.
+ - It includes the dev and staging environments.
+### public account
+- It hosts the publicly accessible s3 bucket and the standardized production data.
+### prod account
+- It hosts the production environment for the frontend and the graphql API
+
+
+## Architecture diagram
+
+<img width="1500" alt="CryoET Architecture Diagram V1 0" src="./CryoET_Architecture_Diagram.png">
+
+Please find the link to the architecture diagram [here](https://www.figma.com/board/detNDlCcfgIWldTIbNU74X/CryoET-Architecture-Diagram-V1.0?node-id=0-1&t=Z9gE7KA8ouCu2blp-1).
+
+
+## Components
+
+
+### Batch Job Orchestrator
+
+This component is responsible for orchestrating the batch jobs that are used for executing workflow. This is done using [swipe](https://github.com/chanzuckerberg/swipe). The orchestration is done using the following AWS managed services:
+
+#### AWS Batch
+
+This is used for executing the workflow jobs. You can read more about AWS Batch [here](https://aws.amazon.com/batch/).
+
+#### AWS Step Functions
+
+This is used for orchestrating the workflow execution. Swipe tries to execute the job on a spot instances, and if the instance is terminated by AWS, retries that job with an on-demand instance. You can read more about AWS Step Functions [here](https://aws.amazon.com/step-functions/).
+
+
+### Data Storage - AWS S3
+S3 is used for storing the source data from data generators, processed data from ingestion, and the production data that is used for sharing the data with the public. You can read more about AWS S3 [here](https://aws.amazon.com/s3/).
+
+In the dev account, the source data is stored in a private bucket.
+
+The standardized data can be of two types.
+1. Data that is ready for public access, these are stored in the staging bucket in the dev account. This bucket allows limited access to public.
+2. Data that needs to be embargoed from public access, these are stored in a private bucket in the dev account until they are ready for public access.
+
+The production data is stored in a publicly accessible `cryoet-data-portal-public` bucket.
+
+### Database - AWS Aurora
+The metadata for the data is stored in a Postgres database. AWS Aurora is used for running the managed database. You can read more about AWS Aurora [here](https://aws.amazon.com/rds/Aurora).
+
+
+### Compute - Elastic Kubernetes Service (EKS)
+The servers for running the graphql API and frontend applications are managed through AWS Elastic Kubernetes Service (EKS). You can read more about AWS EKS [here](https://aws.amazon.com/eks/).
+
+
+### Content Delivery Network (CDN) - AWS CloudFront
+CloudFront is used for surfacing the data that is stored in the S3 bucket over http. It is also used as a proxy to the API servers. You can read more about AWS CloudFront [here](https://aws.amazon.com/cloudfront/).
diff --git a/docs/data_ingestion.md b/docs/data_ingestion.md
@@ -0,0 +1,62 @@
+# Data Ingestion
+
+## Overview
+The data ingestion process involves transforming input data from various sources into a standardized output format.
+
+
+### Steps
+1. The first step of this is usually bringing the raw data into an AWS environment that is accessible to our ingestion workflows.
+2. Once the data is available, create an ingestion config to include the metadata and file paths for data.
+3. The source data needs to be transformed into a standardized format that id uniform across its different datasets.
+4. The transformed data is validated against a set of rules to ensure that it is in the correct format.
+5. The validated data is then ingested into the database, to be surfaced by the GraphQL API.
+6. Sync the data to the production environment.
+7. Update the database in the production environment.
+
+
+<img width="1500" alt="CryoET Architecture Diagram V1 0" src="./CryoET_Backend_Workflow.png">
+
+
+## Fetching Source Data
+
+Data can originate from various researchers/data generators who share the data with the platform. Each source may have its own format and structure.
+
+### The data is deposited in the S3 bucket by the data creator.
+
+The data creator works with the team, and is provided with the necessary permissions to deposit the data in the S3 bucket.
+
+### The data is fetched from EMPIAR
+
+To be expanded.
+
+## Ingesting Data into S3
+
+The data is fetched from the source and transformed into a standardized format. This data is then deposited into the staging s3 bucket.
+
+There are several transformations that are applied during this process. You can learn more about how to run this ingestion workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#running-the-s3-ingestion).
+
+
+## Validating the Ingested Data
+
+The data that is ingested into the staging bucket is validated against a set of rules. The source code for these validations can be found [here](../ingestion_tools/scripts/data_validation).
+ou can learn more about how to run this workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#Running validation tests).
+
+
+## Ingesting Data into the Database
+
+The validated data needs to ingested into the database, for it to be surfaced by the GraphQL API. This process involves reading all the metadata files created during the ingestion process, and writing them to the database.
+
+Currently, we are maintaining two versions of the API, and hence the data needs to be ingested into both the databases. You can learn more about how to run this workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#running-the-db-ingestion).
+
+### V1 Database Ingestion
+The V1 API is powered by Hasura. The source code for these validations can be found [here](../ingestion_tools/scripts/importers/db_import.py).
+
+
+### V2 Database Ingestion
+The V1 API is powered by Platformics. You can find the source code for this [here](../apiv2/db_import/).
+
+
+## Update the data in production environment
+Once the data has been validated, it can be copied from the staging bucket to the production bucket. This is done, to optimize for the processing of the data, and to ensure that the production environment only has validated data.
+
+You can learn more about how to run this workflow in [here](../ingestion_tools/docs/enqueue_runs.md#s3-file-sync-sync-subcommand).