From d0caea414d58100bb8d014599656b1c2d73f4d79 Mon Sep 17 00:00:00 2001 From: Sukrit Kalra Date: Wed, 24 Jan 2024 17:42:16 -0800 Subject: [PATCH] Revamp the main README --- README.md | 115 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 63 insertions(+), 52 deletions(-) diff --git a/README.md b/README.md index 8e5a119b..571adabf 100644 --- a/README.md +++ b/README.md @@ -1,77 +1,88 @@ # ERDOS Simulator +This repository provides a discrete-event simulator for an execution system for DAG-based jobs. The focus of the simulator is to provide an easy playing ground for different scheduling strategies for the execution system. Some features that users can choose to play around with: + +- Implement and evaluate DAG and Deadline-Aware scheduling policies that focus on a wide range of commonly used metrics such as maximizing goodput, minimizing average job completion time, minimizing makespan, minimizing placement delay etc. +- Evaluate the effects of real-world perturbations to Task and TaskGraph metrics on the effectiveness of the scheduling strategy. The users can choose to enable the Tasks to randomly perturb their execution time while they are running to test the effectiveness of their constructed schedules. +- Implement and evaluate heterogeneous and multi-dimensional resource-focused scheduling strategies. The simulator provides the ability to define a cluster with different types and instances of resources, and provide various execution strategies for the Tasks such that a preference order over their choices can be enforced by the scheduler. + ## Installation -To install the repository and set up the required paths, install Python 3.7+, -and run +To get the repository up and running, install Python 3.7+ and set up a virtual environment. Inside the virtual environment, +run ```console -python3 setup.py develop +pip install -r requirements.txt ``` -## Executing Experiments +If you aim to use the C++-based scheduler defined in `schedulers/tetrisched`, refer to its README for installation instructions and ensure that the package is available in the correct virtual environment. -Run the following command in order to run experiments: +## Terminology -```console -python3 main.py --flagfile=configs/default.conf -``` +- *Job*: A Job is a static instance of a piece of computation that is not invokable, and is used to capture some metadata about its execution. For example, a Job captures the resource requirements of the computation, and its expected runtime. +- *JobGraph*: A JobGraph is a static entity that captures the relations amongst Jobs, such that if there is an edge from Job A to Job B, then Job A must finish before Job B can execute. +- *Task*: A dynamic instantiation of the static *Job*, a Task provides additional runtime information such as its deadline, its release time, the time at which it started executing, the time at which it finished execution and the state that the Task is currently in. +- *TaskGraph*: A dynamic instantiation of the static *JobGraph*, a TaskGraph provides runtime information such as the progress of the tasks in the TaskGraph, the end-to-end deadline to which the TaskGraph is being enforced, and other helper functions that can be used to easily query information about the particular invocation of the TaskGraph. +- *ExecutionStrategy*: An ExecutionStrategy object defines the resource requirements and runtime of a particular viable strategy for the Task to execute with. The strategy defines that if the Task is provided with the given resource requirements, then it will run within the given runtime. +- *WorkProfile*: Each Task in a TaskGraph is associated with a *WorkProfile* that summarizes all the different strategies with which a Task can execute. +- *Workload*: A Workload object provides the simulator with the information about the Tasks and TaskGraphs that are to be executed in this given run. Users must define specific Workloads, or implement their own data loaders that can generate the Workload object. +- *Worker*: A Worker is a collection of (possibly) heterogeneous resources that forms the boundary of scheduling. Unless specified, a Task cannot use resources from multiple Workers concurrently. +- *WorkerPool*: A collection of *Worker*s that is used to represent the state of the cluster to the scheduler. -Next, run the following command in order to plot graphs (e.g., resource -utilization, task placement delay, task slack) from the logs): +## Basic Scheduling Example -```console -python3 analyze.py \ - --csv_files={PATH_TO_CSV_LOG_FILE} \ - --csv_labels={SCHEDULER_NAME} \ - --inter_task_time \ - --task_placement \ - --task_placement_delay \ - --task_slack \ - --resource_utilization - --plot -``` +An easy way to get started is to define a simple Workload and a WorkerPool and pass it to the simulator to execute under a specific scheduling policy. For our purposes, we specify a simplistic version of the Autonomous Vehicle (AV) pipeline shown in Gog et al. ([EuroSys '22](https://dl.acm.org/doi/pdf/10.1145/3492321.3519576)). Simple workloads are defined in a JSON / YAML specification, and the representation of the AV pipeline can be seen in [simple_av_workload.yaml](./profiles/workload/simple_av_workload.yaml). The set of TaskGraphs to be released in the system is defined using the graphs parameter as shown below: -or plot all the graphs using: +```yaml +graphs: + - name: AutonomousVehicle + graph: ... + release_policy: fixed + period: 100 # In microseconds. + invocations: 10 + deadline_variance: [50, 100] +``` -```console -python3 analyze.py - --csv_files={PATH_TO_CSV_LOG_FILE} \ - --csv_labels={SCHEDULER_NAME} - --all - --plot +The above example defines a graph named "AutonomousVehicle" that is released at a fixed interval of 100 microseconds for 10 invocations, each of which is randomly assigned a deadline slack equivalent to somewhere between 50 and a 100% of the runtime of the critical path of the pipeline. + +The graph definition is given as a series of nodes, each of which defines the WorkProfile with which it can run, and the set of children that become available once it finishes its execution. For example, the `Perception` task below can be run with the `PerceptionProfile` by using 1 GPU and finishing within 200 microseconds. Once the `Perception` task is finished, the `Prediction` task becomes available for execution. + +```yaml + graph: ... + - name: Perception + work_profile: PerceptionProfile + children: ["Prediction"] + profiles: ... + - name: PerceptionProfile + execution_strategies: + - batch_size: 1 + runtime: 200 + resource_requirements: + GPU:any: 1 ``` -To just output detailed statistics for all graphs, do +Similarly, a WorkerPool is defined as a collection of resources available in the cluster, and an example can be seen in [worker_1_machine_1_gpu_profile.yaml](./profiles/workers/worker_1_machine_1_gpu_profile.yaml), and defines a WorkerPool with 1 GPU as follows: -```console -python3 analyze.py - --csv_files={PATH_TO_CSV_LOG_FILE} \ - --csv_labels={SCHEDULER_NAME} - --all +```yaml +- name: WorkerPool_1 + workers: + - name: Worker_1_1 + resources: + - name: GPU + quantity: 1 ``` -and to convert the given CSV files into Chrome traces (to be visualized in chrome://tracing), do +## Running the Example -```console -python3 analyze.py - --csv_files={PATH_TO_CSV_LOG_FILE} \ - --csv_labels={SCHEDULER_NAME} - --chrome_trace=task -``` +The easiest way to run an example is to define a configuration file for the flag values to main.py. For running the above example with an AV pipeline, a sample configuration has been provided in [simple_av_workload.conf](./configs/simple_av_workload.conf), which defines the names of the log and the CSV files, along with the scheduler that is to be used to place tasks on the WorkerPool. -The `scripts` directory provides helper scripts to spawn the execution of a large number of -experiments. To execute the experiments, change the exploration space in `scripts/run_experiments.sh`, -and then do +To run this example, simply run -```console -export ERDOS_SIMULATOR_DIR=/path/to/cloned/repository -./scripts/run_experiments.sh /path/to/store/results +```bash +python main.py --flagfile=configs/simple_av_workload.conf ``` -To check on the status of the experiments periodically, run -```console -watch -c -n 10 ./scripts/check_experiment_status.sh /results/path -``` -where `/results/path` is the path specified while invoking `run_experiments.sh` +## Questions / Comments? + +Please feel free to raise issues / PRs for bugs that you encounter or enhancements that you would like to see!