Skip to content

Commit

Permalink
Revamp the main README
Browse files Browse the repository at this point in the history
  • Loading branch information
sukritkalra committed Jan 25, 2024
1 parent 5f4cb84 commit d0caea4
Showing 1 changed file with 63 additions and 52 deletions.
115 changes: 63 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,88 @@
# ERDOS Simulator

This repository provides a discrete-event simulator for an execution system for DAG-based jobs. The focus of the simulator is to provide an easy playing ground for different scheduling strategies for the execution system. Some features that users can choose to play around with:

- Implement and evaluate DAG and Deadline-Aware scheduling policies that focus on a wide range of commonly used metrics such as maximizing goodput, minimizing average job completion time, minimizing makespan, minimizing placement delay etc.
- Evaluate the effects of real-world perturbations to Task and TaskGraph metrics on the effectiveness of the scheduling strategy. The users can choose to enable the Tasks to randomly perturb their execution time while they are running to test the effectiveness of their constructed schedules.
- Implement and evaluate heterogeneous and multi-dimensional resource-focused scheduling strategies. The simulator provides the ability to define a cluster with different types and instances of resources, and provide various execution strategies for the Tasks such that a preference order over their choices can be enforced by the scheduler.

## Installation

To install the repository and set up the required paths, install Python 3.7+,
and run
To get the repository up and running, install Python 3.7+ and set up a virtual environment. Inside the virtual environment,
run

```console
python3 setup.py develop
pip install -r requirements.txt
```

## Executing Experiments
If you aim to use the C++-based scheduler defined in `schedulers/tetrisched`, refer to its README for installation instructions and ensure that the package is available in the correct virtual environment.

Run the following command in order to run experiments:
## Terminology

```console
python3 main.py --flagfile=configs/default.conf
```
- *Job*: A Job is a static instance of a piece of computation that is not invokable, and is used to capture some metadata about its execution. For example, a Job captures the resource requirements of the computation, and its expected runtime.
- *JobGraph*: A JobGraph is a static entity that captures the relations amongst Jobs, such that if there is an edge from Job A to Job B, then Job A must finish before Job B can execute.
- *Task*: A dynamic instantiation of the static *Job*, a Task provides additional runtime information such as its deadline, its release time, the time at which it started executing, the time at which it finished execution and the state that the Task is currently in.
- *TaskGraph*: A dynamic instantiation of the static *JobGraph*, a TaskGraph provides runtime information such as the progress of the tasks in the TaskGraph, the end-to-end deadline to which the TaskGraph is being enforced, and other helper functions that can be used to easily query information about the particular invocation of the TaskGraph.
- *ExecutionStrategy*: An ExecutionStrategy object defines the resource requirements and runtime of a particular viable strategy for the Task to execute with. The strategy defines that if the Task is provided with the given resource requirements, then it will run within the given runtime.
- *WorkProfile*: Each Task in a TaskGraph is associated with a *WorkProfile* that summarizes all the different strategies with which a Task can execute.
- *Workload*: A Workload object provides the simulator with the information about the Tasks and TaskGraphs that are to be executed in this given run. Users must define specific Workloads, or implement their own data loaders that can generate the Workload object.
- *Worker*: A Worker is a collection of (possibly) heterogeneous resources that forms the boundary of scheduling. Unless specified, a Task cannot use resources from multiple Workers concurrently.
- *WorkerPool*: A collection of *Worker*s that is used to represent the state of the cluster to the scheduler.

Next, run the following command in order to plot graphs (e.g., resource
utilization, task placement delay, task slack) from the logs):
## Basic Scheduling Example

```console
python3 analyze.py \
--csv_files={PATH_TO_CSV_LOG_FILE} \
--csv_labels={SCHEDULER_NAME} \
--inter_task_time \
--task_placement \
--task_placement_delay \
--task_slack \
--resource_utilization
--plot
```
An easy way to get started is to define a simple Workload and a WorkerPool and pass it to the simulator to execute under a specific scheduling policy. For our purposes, we specify a simplistic version of the Autonomous Vehicle (AV) pipeline shown in Gog et al. ([EuroSys '22](https://dl.acm.org/doi/pdf/10.1145/3492321.3519576)). Simple workloads are defined in a JSON / YAML specification, and the representation of the AV pipeline can be seen in [simple_av_workload.yaml](./profiles/workload/simple_av_workload.yaml). The set of TaskGraphs to be released in the system is defined using the graphs parameter as shown below:

or plot all the graphs using:
```yaml
graphs:
- name: AutonomousVehicle
graph: ...
release_policy: fixed
period: 100 # In microseconds.
invocations: 10
deadline_variance: [50, 100]
```
```console
python3 analyze.py
--csv_files={PATH_TO_CSV_LOG_FILE} \
--csv_labels={SCHEDULER_NAME}
--all
--plot
The above example defines a graph named "AutonomousVehicle" that is released at a fixed interval of 100 microseconds for 10 invocations, each of which is randomly assigned a deadline slack equivalent to somewhere between 50 and a 100% of the runtime of the critical path of the pipeline.
The graph definition is given as a series of nodes, each of which defines the WorkProfile with which it can run, and the set of children that become available once it finishes its execution. For example, the `Perception` task below can be run with the `PerceptionProfile` by using 1 GPU and finishing within 200 microseconds. Once the `Perception` task is finished, the `Prediction` task becomes available for execution.

```yaml
graph: ...
- name: Perception
work_profile: PerceptionProfile
children: ["Prediction"]
profiles: ...
- name: PerceptionProfile
execution_strategies:
- batch_size: 1
runtime: 200
resource_requirements:
GPU:any: 1
```

To just output detailed statistics for all graphs, do
Similarly, a WorkerPool is defined as a collection of resources available in the cluster, and an example can be seen in [worker_1_machine_1_gpu_profile.yaml](./profiles/workers/worker_1_machine_1_gpu_profile.yaml), and defines a WorkerPool with 1 GPU as follows:

```console
python3 analyze.py
--csv_files={PATH_TO_CSV_LOG_FILE} \
--csv_labels={SCHEDULER_NAME}
--all
```yaml
- name: WorkerPool_1
workers:
- name: Worker_1_1
resources:
- name: GPU
quantity: 1
```

and to convert the given CSV files into Chrome traces (to be visualized in chrome://tracing), do
## Running the Example

```console
python3 analyze.py
--csv_files={PATH_TO_CSV_LOG_FILE} \
--csv_labels={SCHEDULER_NAME}
--chrome_trace=task
```
The easiest way to run an example is to define a configuration file for the flag values to main.py. For running the above example with an AV pipeline, a sample configuration has been provided in [simple_av_workload.conf](./configs/simple_av_workload.conf), which defines the names of the log and the CSV files, along with the scheduler that is to be used to place tasks on the WorkerPool.

The `scripts` directory provides helper scripts to spawn the execution of a large number of
experiments. To execute the experiments, change the exploration space in `scripts/run_experiments.sh`,
and then do
To run this example, simply run

```console
export ERDOS_SIMULATOR_DIR=/path/to/cloned/repository
./scripts/run_experiments.sh /path/to/store/results
```bash
python main.py --flagfile=configs/simple_av_workload.conf
```

To check on the status of the experiments periodically, run

```console
watch -c -n 10 ./scripts/check_experiment_status.sh /results/path
```
where `/results/path` is the path specified while invoking `run_experiments.sh`
## Questions / Comments?

Please feel free to raise issues / PRs for bugs that you encounter or enhancements that you would like to see!

0 comments on commit d0caea4

Please sign in to comment.