Skip to content

Commit

Permalink
Adds architecture documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
manasaV3 committed Dec 20, 2024
1 parent 408f934 commit 2864e7f
Show file tree
Hide file tree
Showing 4 changed files with 125 additions and 0 deletions.
Binary file added docs/CryoET_Architecture_Diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/CryoET_Backend_Workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Architecture

This document provides an overview of the architecture for cryoET applications. The backend stacks supports the data ingestion, and the maintenance of the graphQL API.


## Account Setup

The backend of the cryoET is distributed across three different AWS accounts, each for different purposes:

### dev account
- It hosts all the source data
- All the data standardization happens in this account.
- It includes the dev and staging environments.
### public account
- It hosts the publicly accessible s3 bucket and the standardized production data.
### prod account
- It hosts the production environment for the frontend and the graphql API


## Architecture diagram

<img width="1500" alt="CryoET Architecture Diagram V1 0" src="./CryoET_Architecture_Diagram.png">

Please find the link to the architecture diagram [here](https://www.figma.com/board/detNDlCcfgIWldTIbNU74X/CryoET-Architecture-Diagram-V1.0?node-id=0-1&t=Z9gE7KA8ouCu2blp-1).


## Components


### Batch Job Orchestrator

This component is responsible for orchestrating the batch jobs that are used for executing workflow. This is done using [swipe](https://github.com/chanzuckerberg/swipe). The orchestration is done using the following AWS managed services:

#### AWS Batch

This is used for executing the workflow jobs. You can read more about AWS Batch [here](https://aws.amazon.com/batch/).

#### AWS Step Functions

This is used for orchestrating the workflow execution. Swipe tries to execute the job on a spot instances, and if the instance is terminated by AWS, retries that job with an on-demand instance. You can read more about AWS Step Functions [here](https://aws.amazon.com/step-functions/).


### Data Storage - AWS S3
S3 is used for storing the source data from data generators, processed data from ingestion, and the production data that is used for sharing the data with the public. You can read more about AWS S3 [here](https://aws.amazon.com/s3/).

In the dev account, the source data is stored in a private bucket.

The standardized data can be of two types.
1. Data that is ready for public access, these are stored in the staging bucket in the dev account. This bucket allows limited access to public.
2. Data that needs to be embargoed from public access, these are stored in a private bucket in the dev account until they are ready for public access.

The production data is stored in a publicly accessible `cryoet-data-portal-public` bucket.

### Database - AWS Aurora
The metadata for the data is stored in a Postgres database. AWS Aurora is used for running the managed database. You can read more about AWS Aurora [here](https://aws.amazon.com/rds/Aurora).


### Compute - Elastic Kubernetes Service (EKS)
The servers for running the graphql API and frontend applications are managed through AWS Elastic Kubernetes Service (EKS). You can read more about AWS EKS [here](https://aws.amazon.com/eks/).


### Content Delivery Network (CDN) - AWS CloudFront
CloudFront is used for surfacing the data that is stored in the S3 bucket over http. It is also used as a proxy to the API servers. You can read more about AWS CloudFront [here](https://aws.amazon.com/cloudfront/).
62 changes: 62 additions & 0 deletions docs/data_ingestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Data Ingestion

## Overview
The data ingestion process involves transforming input data from various sources into a standardized output format.


### Steps
1. The first step of this is usually bringing the raw data into an AWS environment that is accessible to our ingestion workflows.
2. Once the data is available, create an ingestion config to include the metadata and file paths for data.
3. The source data needs to be transformed into a standardized format that id uniform across its different datasets.
4. The transformed data is validated against a set of rules to ensure that it is in the correct format.
5. The validated data is then ingested into the database, to be surfaced by the GraphQL API.
6. Sync the data to the production environment.
7. Update the database in the production environment.


<img width="1500" alt="CryoET Architecture Diagram V1 0" src="./CryoET_Backend_Workflow.png">


## Fetching Source Data

Data can originate from various researchers/data generators who share the data with the platform. Each source may have its own format and structure.

### The data is deposited in the S3 bucket by the data creator.

The data creator works with the team, and is provided with the necessary permissions to deposit the data in the S3 bucket.

### The data is fetched from EMPIAR

To be expanded.

## Ingesting Data into S3

The data is fetched from the source and transformed into a standardized format. This data is then deposited into the staging s3 bucket.

There are several transformations that are applied during this process. You can learn more about how to run this ingestion workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#running-the-s3-ingestion).


## Validating the Ingested Data

The data that is ingested into the staging bucket is validated against a set of rules. The source code for these validations can be found [here](../ingestion_tools/scripts/data_validation).
ou can learn more about how to run this workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#Running validation tests).


## Ingesting Data into the Database

The validated data needs to ingested into the database, for it to be surfaced by the GraphQL API. This process involves reading all the metadata files created during the ingestion process, and writing them to the database.

Currently, we are maintaining two versions of the API, and hence the data needs to be ingested into both the databases. You can learn more about how to run this workflow in [here](../ingestion_tools/docs/running_data_ingestion.md#running-the-db-ingestion).

### V1 Database Ingestion
The V1 API is powered by Hasura. The source code for these validations can be found [here](../ingestion_tools/scripts/importers/db_import.py).


### V2 Database Ingestion
The V1 API is powered by Platformics. You can find the source code for this [here](../apiv2/db_import/).


## Update the data in production environment
Once the data has been validated, it can be copied from the staging bucket to the production bucket. This is done, to optimize for the processing of the data, and to ensure that the production environment only has validated data.

You can learn more about how to run this workflow in [here](../ingestion_tools/docs/enqueue_runs.md#s3-file-sync-sync-subcommand).

0 comments on commit 2864e7f

Please sign in to comment.