Skip to content

Commit

Permalink
Integration test for AllReduce (sql-machine-learning#2256)
Browse files Browse the repository at this point in the history
* Integration test for AllReduce

* Retry to get valid rank
  • Loading branch information
workingloong authored Aug 21, 2020
1 parent 65a5b8e commit 8186cb6
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 3 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ jobs:
JOB_TYPES=(
odps
train
allreduce
#evaluate
#predict
)
Expand Down
11 changes: 10 additions & 1 deletion elasticdl/python/worker/allreduce_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,16 @@ def training_process_with_fault_tolerance(self, features, labels):
self.init_horovod_if_needed()

def init_horovod_if_needed(self):
rank_response = self._master_client.get_comm_rank()
for _ in range(DEFAULT_MAX_ALLREDUCE_RETRY_NUM):
rank_response = self._master_client.get_comm_rank()
if rank_response.rank_id < 0:
logger.warning(
"The master has not added the worker host into "
"rendezvous yet. Retrying to get rank"
)
time.sleep(5)
else:
break

# If the rendezvous from master is unequal to self._rendezvous_id,
# the worker should rebuild the communication because the master
Expand Down
19 changes: 19 additions & 0 deletions scripts/client_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,25 @@ elif [[ "$JOB_TYPE" == "odps" ]]; then
--log_level=INFO \
--image_pull_policy=Never \
--output=model_output
elif [[ "$JOB_TYPE" == "allreduce" ]]; then
elasticdl train \
--image_name=elasticdl:ci \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--training_data=/data/mnist/train \
--num_epochs=1 \
--master_resource_request="cpu=0.3,memory=1024Mi" \
--master_resource_limit="cpu=1,memory=2048Mi" \
--worker_resource_request="cpu=0.4,memory=2048Mi" \
--worker_resource_limit="cpu=1,memory=3072Mi" \
--minibatch_size=64 \
--num_minibatches_per_task=2 \
--num_workers="$WORKER_NUM" \
--distribution_strategy=AllreduceStrategy \
--job_name=test-allreduce \
--log_level=INFO \
--image_pull_policy=Never \
--volume="host_path=${DATA_PATH},mount_path=/data"
else
echo "Unsupported job type specified: $JOB_TYPE"
exit 1
Expand Down
9 changes: 7 additions & 2 deletions scripts/travis/run_job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,13 @@ else
export MAXCOMPUTE_TABLE
bash scripts/travis/create_odps_table.sh
fi
PS_NUM=2
WORKER_NUM=1
if [[ "$JOB_TYPE" == "allreduce" ]]; then
PS_NUM=0
WORKER_NUM=2
else
PS_NUM=2
WORKER_NUM=1
fi
docker run --rm -it --net=host \
-e MAXCOMPUTE_TABLE="$MAXCOMPUTE_TABLE" \
-e MAXCOMPUTE_PROJECT="$MAXCOMPUTE_PROJECT" \
Expand Down

0 comments on commit 8186cb6

Please sign in to comment.