Getting Started

Getting started
Distributed training

In this section, we detailedly introduce how to use the Euler and TensorFlow for GraphSage model training. GraphSage is an inductive graph learning method proposed by Stanford. It has the similar good performance as the GCN model and can be applied on large-scale grapha with billions of nodes.

Getting started

1. Preparing data

We need to set the graph data format that the Euler engine can read. Here we use the PPI (Protein-Protein Interactions) data set as an example and provide an pre-processing script:

apt-get update && apt-get install -y curl
curl -k -O https://raw.githubusercontent.com/alibaba/euler/master/examples/ppi_data.py
pip install networkx==1.11 sklearn
python ppi_data.py

The above command will generate a ppi directory in the current directory containing the constructed PPI graph data.

2. Training model

To train a semi-supervised GraphSage model on the training data set:

python -m tf_euler \
  --data_dir ppi \
  --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
  --model graphsage_supervised --mode train

The above command will generate a ckpt directory in the current directory containing the trained TensorFlow model.

3. Evaluating model

To evaluate the performance of the model on the test set:

python -m tf_euler \
  --data_dir ppi --id_file ppi/ppi_test.id \
  --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
  --model graphsage_supervised --mode evaluate

The mirco-F1 score of Euler's built-in GraphSage with default hyper-parameters should be around 0.6 on the test set.

4. Exporting embeddings

To export the node embeddings：

python -m tf_euler \
  --data_dir ppi \
  --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
  --model graphsage_supervised --mode save_embedding

The above command will generate a embedding.npy file and an id.txt file in the ckpt directory of the current directory, representing the embeddings and corresponding ids of all the nodes in the graph.

5. Import embedding in faiss for fast retrieval (optional)

The embeddings generated by Euler can be used in downstream applications according to the actual needs of users. Here is an example of using Faiss for similarity search:

import faiss
import numpy as np

embedding = np.load('ckpt/embedding.npy')
index = faiss.IndexFlatIP(256)
index.add(embedding)
print(index.search(embedding[:5], 4))

Distributed training

Euler supports distributed model training. Users need to add four parameters to the original training command --ps_hosts, --worker_hosts, --job_name, and --task_index to specify the distributed configuration. Note that for distributed training, data must be partitioned and placed on HDFS. Here is an example script that starts two ps and two workers on local port 1998--2001 for distributed training.

bash tf_euler/scripts/dist_tf_euler.sh \
  --data_dir hdfs://host:port/data \
  --euler_zk_addr zk.host.com:port --euler_zk_path /path/for/euler \
  --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
  --model graphsage_supervised --mode train

The above command will print the log in the /tmp/log.{woker,ps}.{0,1} file.