ARCH is a package designed for constructing a large-scale knowledge graph (KG) based on clinical feature co-occurrence matrices. The steps include data preprocessing, calculating co-occurrence matrices, variance estimation, and knowledge graph construction. The package applies methods introduced in the paper:
Gan, Ziming, et al. "Arch: Large-scale knowledge graph via aggregated narrative codified health records analysis." medRxiv (2023)..
- Python 3.x
- R (version X.X.X or later)
- Required Python libraries:
- numpy
- scipy
- pandas
- Required R libraries:
- Matrix
- RSpectra
The process of constructing the knowledge graph involves several steps:
-
Data Preprocessing
Use thepre.py
script to preprocess the raw dataset. This step will extract essential information such as:- Total number of patients
- Average number of features recorded per patient per day
Command to run:
python pre.py
-
Co-occurrence Matrix Calculation
After preprocessing, use thepre.R
script to calculate the co-occurrence matrix. This includes:- Calculating the SVD-SPPMI (Singular Value Decomposition of Shifted Positive Pointwise Mutual Information) matrix
- Estimating parameters needed to calculate the variance of the SVD-SPPMI matrix
Command to run in R:
Rscript pre.R
-
Variance Estimation
To estimate the variance of the calculated SVD-SPPMI matrix, use thecosine_test.py
script:python cosine_test.py
-
Knowledge Graph Construction
Once the variance has been estimated, use theconstruct_KG.R
script to construct the knowledge graph based on the processed data:Rscript construct_KG.R
Below is an example of how to use the ARCH package to create a knowledge graph (KG) from electronic health record (EHR) data. This example assumes that you have already preprocessed your data and calculated the necessary co-occurrence matrices.
Ensure you have all necessary libraries loaded and data files ready before running the script.
# Load required packages
library(Matrix)
library(RSpectra)
# Load your preprocessed data (replace 'data_path' with the actual path to your data)
cooccurrence_matrix <- readRDS('data_path/cooccurrence_matrix.rds')
svd_sppmi_matrix <- readRDS('data_path/svd_sppmi_matrix.rds')
Use Singular Value Decomposition (SVD) to decompose the co-occurrence matrix. This will help reduce the dimensionality of the data and prepare it for knowledge graph construction.
# Perform SVD on the co-occurrence matrix
svd_result <- svds(cooccurrence_matrix, k = 50) # k is the number of dimensions
# Extract the U, D, and V matrices from the SVD result
U <- svd_result$u
D <- diag(svd_result$d)
V <- svd_result$v
Now, you can construct the knowledge graph by using the SVD results. The graph can be represented as a set of relationships between clinical features based on their similarity in the lower-dimensional space.
# Construct a similarity matrix based on the SVD results
similarity_matrix <- U %*% D %*% t(V)
# Set a threshold for determining strong relationships (adjust this based on your analysis)
threshold <- 0.8
knowledge_graph <- similarity_matrix > threshold