-
Notifications
You must be signed in to change notification settings - Fork 558
Getting Started
In this section, we detailedly introduce how to use the Euler and TensorFlow for GraphSage model training. GraphSage is an inductive graph learning method proposed by Stanford. It has the similar good performance as the GCN model and can be applied on large-scale grapha with billions of nodes.
We need to set the graph data format that the Euler engine can read. Here we use the PPI (Protein-Protein Interactions) data set as an example and provide an pre-processing script:
apt-get update && apt-get install -y curl
curl -k -O https://raw.githubusercontent.com/alibaba/euler/1.0/examples/ppi_data.py
pip install networkx==1.11 sklearn
python ppi_data.py
The above command will generate a ppi directory in the current directory containing the constructed PPI graph data.
To train a semi-supervised GraphSage model on the training data set:
python -m tf_euler \
--data_dir ppi \
--max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
--model graphsage_supervised --mode train
The above command will generate a ckpt directory in the current directory containing the trained TensorFlow model.
To evaluate the performance of the model on the test set:
python -m tf_euler \
--data_dir ppi --id_file ppi/ppi_test.id \
--max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
--model graphsage_supervised --mode evaluate
The mirco-F1 score of Euler's built-in GraphSage with default hyper-parameters should be around 0.6 on the test set.
To export the node embeddings:
python -m tf_euler \
--data_dir ppi \
--max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
--model graphsage_supervised --mode save_embedding
The above command will generate a embedding.npy file and an id.txt file in the ckpt directory of the current directory, representing the embeddings and corresponding ids of all the nodes in the graph.
The embeddings generated by Euler can be used in downstream applications according to the actual needs of users. Here is an example of using Faiss for similarity search:
import faiss
import numpy as np
embedding = np.load('ckpt/embedding.npy')
index = faiss.IndexFlatIP(256)
index.add(embedding)
print(index.search(embedding[:5], 4))
Euler supports distributed model training. Users need to add four parameters to the original training command --ps_hosts
, --worker_hosts
, --job_name
, and --task_index
to specify the distributed configuration. Note that for distributed training, data must be partitioned and placed on HDFS. Here is an example script that starts two ps and two workers on local port 1998--2001 for distributed training.
bash tf_euler/scripts/dist_tf_euler.sh \
--data_dir hdfs://host:port/data \
--euler_zk_addr zk.host.com:port --euler_zk_path /path/for/euler \
--max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 \
--model graphsage_supervised --mode train
The above command will print the log in the /tmp/log.{woker,ps}.{0,1}
file.