Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Alibaba Trace in Batches in Simulator #67

Merged
merged 38 commits into from
Nov 12, 2023

Conversation

ruizehung
Copy link
Contributor

@ruizehung ruizehung commented Nov 9, 2023

Summary

This PR updateAlibabaLoader and Simulator such that now Simulator can load job graphs in batches and gradually add the tasks to event qeueue.

  • Create a JobGraphLoader class as a generalized Job Graph loader that can dynamically load job graphs in simulator.
  • Update AlibabaLoader to implement JobGraphLoader
  • Refactor Simulator.dry_run and Simulator.simulate to accommodate the use of JobGraphLoader.
  • Add Workload.add_job_graphs
  • Add batch_size_job_loading flag in main.py to allow config files to specify loading workload in batches.

Test Plan

Unit test

It's failing

================================================================ short test summary info ================================================================
FAILED tests/test_simulator.py::test_simulator_handle_event - AssertionError: Incorrect length of EventQueue.
FAILED tests/test_tetrisched_scheduler.py::test_tetrisched_task_graph_strl_generation_simple - AssertionError: Incorrect number of STRL expressions.
================================================= 2 failed, 206 passed, 3 skipped, 2 warnings in 3.75s ==================================================

https://app.warp.dev/block/AIe92hXlVj2eDh2ZOsRB8L

tests/test_tetrisched_scheduler.py::test_tetrisched_task_graph_strl_generation_simple is already failing in main

I haven't figured out why tests/test_simulator.py::test_simulator_handle_event failed

Dry run

configs/alibaba_trace.conf:

# Output configs.
--log_file_name=./alibaba_trace_replay.log
--csv_file_name=./alibaba_trace_replay.csv
--log_level=debug

# Workload configs.
--execution_mode=replay
--replay_trace=alibaba
--workload_profile_path=./traces/alibaba-cluster-trace-v2018/alibaba_random_50_dags.pkl
--batch_size_job_loading=25
--override_num_invocations=1
--override_arrival_period=10
--randomize_start_time_max=100

# Worker configs.
--worker_profile_path=./profiles/workers/alibaba_cluster.yaml

# Scheduler configs.
# --scheduler=EDF
#--scheduler=TetriSched_Gurobi
--scheduler=TetriSched
--scheduler_runtime=0
--enforce_deadlines
--drop_skipped_tasks
--release_taskgraphs
--scheduler_log_times=2
--scheduler_time_discretization=1
--dry_run=true

Ouptut:

python3 main.py --flagfile=configs/alibaba_trace.conf && cat alibaba_trace_replay.log
Set parameter Username
Academic license - for non-commercial use only - expires 2024-10-07
Warning: Gurobi version mismatch between C++ 10.0.3 and C library 10.0.0
Set parameter Threads to value 96
Set parameter Cuts to value 3
Set parameter Presolve to value 1
Set parameter MIPFocus to value 1
2023-11-09,19:08:59.665 __main__ INFO: Starting the execution of the simulator loop.
2023-11-09,19:08:59.665 __main__ INFO: Workload File: ./traces/alibaba-cluster-trace-v2018/alibaba_random_50_dags.pkl
2023-11-09,19:08:59.665 __main__ INFO: Workers File: ./profiles/workers/alibaba_cluster.yaml
2023-11-09,19:08:59.665 __main__ INFO: Profile File: ./profiles/workload/pylot_profile.json
2023-11-09,19:08:59.674 WorkerLoader DEBUG: Loaded 1 worker pools from the file located at: ./profiles/workers/alibaba_cluster.yaml
2023-11-09,19:08:59.676 Simulator INFO: The Worker Pools are:
2023-11-09,19:08:59.676 Simulator INFO: WorkerPool(name=WorkerPool_1, id=17fc695a-07a0-4a6e-8822-e8f36c031199)
2023-11-09,19:08:59.677 Simulator INFO: 	Worker(name=Worker_1_1, id=972a8469-1641-4f82-8b9d-2434e465e150, resources=Resources(defaultdict(<class 'int'>, {Resource(name=Slot, id=bdd640fb-0667-4ad1-9c80-317fa3b1799d): 10})))
2023-11-09,19:08:59.677 Simulator INFO: [0] Added Event(time=0s, type=EventType.SIMULATOR_START) to the event queue.
2023-11-09,19:08:59.677 Simulator INFO: [0] Added Event(time=0s, type=EventType.SCHEDULER_START) to the event queue.
2023-11-09,19:08:59.689 Simulator INFO: [5s] The TaskGraph j_3130129@0 will be released with deadline 21s
2023-11-09,19:08:59.689 Simulator INFO: [9s] The TaskGraph j_2818427@0 will be released with deadline 38s
2023-11-09,19:08:59.690 Simulator INFO: [12s] The TaskGraph j_557089@0 will be released with deadline 128s
2023-11-09,19:08:59.690 Simulator INFO: [13s] The TaskGraph j_2991989@0 will be released with deadline 15s
2023-11-09,19:08:59.690 Simulator INFO: [20s] The TaskGraph j_1889356@0 will be released with deadline 233s
2023-11-09,19:08:59.690 Simulator INFO: [20s] The TaskGraph j_2027162@0 will be released with deadline 22s
2023-11-09,19:08:59.690 Simulator INFO: [21s] The TaskGraph j_1035670@0 will be released with deadline 22s
2023-11-09,19:08:59.690 Simulator INFO: [26s] The TaskGraph j_1629748@0 will be released with deadline 27s
2023-11-09,19:08:59.690 Simulator INFO: [27s] The TaskGraph j_2894565@0 will be released with deadline 29s
2023-11-09,19:08:59.690 Simulator INFO: [29s] The TaskGraph j_3637946@0 will be released with deadline 147s
2023-11-09,19:08:59.690 Simulator INFO: [31s] The TaskGraph j_1116129@0 will be released with deadline 39s
2023-11-09,19:08:59.690 Simulator INFO: [38s] The TaskGraph j_147660@0 will be released with deadline 40s
2023-11-09,19:08:59.691 Simulator INFO: [40s] The TaskGraph j_2962091@0 will be released with deadline 78s
2023-11-09,19:08:59.691 Simulator INFO: [42s] The TaskGraph j_2524953@0 will be released with deadline 168s
2023-11-09,19:08:59.691 Simulator INFO: [45s] The TaskGraph j_2852399@0 will be released with deadline 46s
2023-11-09,19:08:59.691 Simulator INFO: [56s] The TaskGraph j_2982359@0 will be released with deadline 57s
2023-11-09,19:08:59.691 Simulator INFO: [62s] The TaskGraph j_101519@0 will be released with deadline 88s
2023-11-09,19:08:59.691 Simulator INFO: [71s] The TaskGraph j_371746@0 will be released with deadline 84s
2023-11-09,19:08:59.691 Simulator INFO: [72s] The TaskGraph j_1254181@0 will be released with deadline 75s
2023-11-09,19:08:59.692 Simulator INFO: [72s] The TaskGraph j_722464@0 will be released with deadline 101s
2023-11-09,19:08:59.692 Simulator INFO: [81s] The TaskGraph j_1833847@0 will be released with deadline 91s
2023-11-09,19:08:59.692 Simulator INFO: [90s] The TaskGraph j_3606265@0 will be released with deadline 91s
2023-11-09,19:08:59.693 Simulator INFO: [93s] The TaskGraph j_1015056@0 will be released with deadline 1361s
2023-11-09,19:08:59.693 Simulator INFO: [97s] The TaskGraph j_639289@0 will be released with deadline 112s
2023-11-09,19:08:59.693 Simulator INFO: [97s] The TaskGraph j_1359659@0 will be released with deadline 114s
2023-11-09,19:08:59.706 Simulator INFO: [1s] The TaskGraph j_1345171@0 will be released with deadline 159s
2023-11-09,19:08:59.706 Simulator INFO: [5s] The TaskGraph j_3130129@0 will be released with deadline 21s
2023-11-09,19:08:59.707 Simulator INFO: [8s] The TaskGraph j_2032943@0 will be released with deadline 405s
2023-11-09,19:08:59.707 Simulator INFO: [9s] The TaskGraph j_2818427@0 will be released with deadline 38s
2023-11-09,19:08:59.707 Simulator INFO: [12s] The TaskGraph j_557089@0 will be released with deadline 128s
2023-11-09,19:08:59.707 Simulator INFO: [13s] The TaskGraph j_2991989@0 will be released with deadline 15s
2023-11-09,19:08:59.707 Simulator INFO: [13s] The TaskGraph j_2925815@0 will be released with deadline 58s
2023-11-09,19:08:59.708 Simulator INFO: [17s] The TaskGraph j_780541@0 will be released with deadline 19s
2023-11-09,19:08:59.708 Simulator INFO: [20s] The TaskGraph j_1889356@0 will be released with deadline 233s
2023-11-09,19:08:59.708 Simulator INFO: [20s] The TaskGraph j_2027162@0 will be released with deadline 22s
2023-11-09,19:08:59.708 Simulator INFO: [21s] The TaskGraph j_1035670@0 will be released with deadline 22s
2023-11-09,19:08:59.708 Simulator INFO: [21s] The TaskGraph j_2198546@0 will be released with deadline 22s
2023-11-09,19:08:59.708 Simulator INFO: [26s] The TaskGraph j_1629748@0 will be released with deadline 27s
2023-11-09,19:08:59.708 Simulator INFO: [27s] The TaskGraph j_2894565@0 will be released with deadline 29s
2023-11-09,19:08:59.708 Simulator INFO: [27s] The TaskGraph j_3641743@0 will be released with deadline 90s
2023-11-09,19:08:59.708 Simulator INFO: [29s] The TaskGraph j_3637946@0 will be released with deadline 147s
2023-11-09,19:08:59.709 Simulator INFO: [31s] The TaskGraph j_1116129@0 will be released with deadline 39s
2023-11-09,19:08:59.709 Simulator INFO: [32s] The TaskGraph j_3720649@0 will be released with deadline 58s
2023-11-09,19:08:59.709 Simulator INFO: [36s] The TaskGraph j_443298@0 will be released with deadline 57s
2023-11-09,19:08:59.709 Simulator INFO: [36s] The TaskGraph j_511911@0 will be released with deadline 317s
2023-11-09,19:08:59.709 Simulator INFO: [38s] The TaskGraph j_147660@0 will be released with deadline 40s
2023-11-09,19:08:59.709 Simulator INFO: [39s] The TaskGraph j_2017589@0 will be released with deadline 46s
2023-11-09,19:08:59.709 Simulator INFO: [40s] The TaskGraph j_2962091@0 will be released with deadline 78s
2023-11-09,19:08:59.709 Simulator INFO: [42s] The TaskGraph j_2524953@0 will be released with deadline 168s
2023-11-09,19:08:59.710 Simulator INFO: [45s] The TaskGraph j_2852399@0 will be released with deadline 46s
2023-11-09,19:08:59.710 Simulator INFO: [53s] The TaskGraph j_1153996@0 will be released with deadline 54s
2023-11-09,19:08:59.710 Simulator INFO: [54s] The TaskGraph j_787818@0 will be released with deadline 93s
2023-11-09,19:08:59.710 Simulator INFO: [56s] The TaskGraph j_2982359@0 will be released with deadline 57s
2023-11-09,19:08:59.710 Simulator INFO: [56s] The TaskGraph j_2422067@0 will be released with deadline 57s
2023-11-09,19:08:59.710 Simulator INFO: [59s] The TaskGraph j_2109343@0 will be released with deadline 63s
2023-11-09,19:08:59.710 Simulator INFO: [62s] The TaskGraph j_101519@0 will be released with deadline 88s
2023-11-09,19:08:59.710 Simulator INFO: [63s] The TaskGraph j_1008006@0 will be released with deadline 65s
2023-11-09,19:08:59.710 Simulator INFO: [65s] The TaskGraph j_3197913@0 will be released with deadline 134s
2023-11-09,19:08:59.710 Simulator INFO: [68s] The TaskGraph j_3500316@0 will be released with deadline 88s
2023-11-09,19:08:59.710 Simulator INFO: [69s] The TaskGraph j_3309303@0 will be released with deadline 185s
2023-11-09,19:08:59.710 Simulator INFO: [71s] The TaskGraph j_371746@0 will be released with deadline 84s
2023-11-09,19:08:59.711 Simulator INFO: [71s] The TaskGraph j_353101@0 will be released with deadline 1515s
2023-11-09,19:08:59.711 Simulator INFO: [72s] The TaskGraph j_1254181@0 will be released with deadline 75s
2023-11-09,19:08:59.711 Simulator INFO: [72s] The TaskGraph j_722464@0 will be released with deadline 101s
2023-11-09,19:08:59.711 Simulator INFO: [75s] The TaskGraph j_3445733@0 will be released with deadline 84s
2023-11-09,19:08:59.711 Simulator INFO: [80s] The TaskGraph j_3191542@0 will be released with deadline 85s
2023-11-09,19:08:59.711 Simulator INFO: [81s] The TaskGraph j_1833847@0 will be released with deadline 91s
2023-11-09,19:08:59.712 Simulator INFO: [81s] The TaskGraph j_1899282@0 will be released with deadline 221s
2023-11-09,19:08:59.712 Simulator INFO: [90s] The TaskGraph j_3606265@0 will be released with deadline 91s
2023-11-09,19:08:59.712 Simulator INFO: [92s] The TaskGraph j_905043@0 will be released with deadline 105s
2023-11-09,19:08:59.712 Simulator INFO: [93s] The TaskGraph j_1015056@0 will be released with deadline 1361s
2023-11-09,19:08:59.712 Simulator INFO: [97s] The TaskGraph j_639289@0 will be released with deadline 112s
2023-11-09,19:08:59.712 Simulator INFO: [97s] The TaskGraph j_1359659@0 will be released with deadline 114s
2023-11-09,19:08:59.713 Simulator INFO: [97s] The TaskGraph j_242256@0 will be released with deadline 126s
2023-11-09,19:08:59.713 Simulator INFO: [100s] The TaskGraph j_350557@0 will be released with deadline 102s

Simulate Alibaba Trace

python3 main.py --flagfile=configs/alibaba_trace.conf
alibaba_trace_replay.log: https://gist.github.com/ruizehung/0b58f26ad07cd53dc5e1ac51ef886fa6

@ruizehung ruizehung changed the title Initial attempt to load alibaba trace in chunks Load Alibaba Trace in Batches in Simulator Nov 9, 2023
@ruizehung ruizehung changed the title Load Alibaba Trace in Batches in Simulator Load Alibaba Trace in Batches in Simulator (In Progress) Nov 9, 2023
@ruizehung ruizehung requested a review from sukritkalra November 9, 2023 04:00
ruizehung and others added 17 commits November 9, 2023 15:06
…nction to workload class to make workload evolvable
- Add __assert_task_has_not_been_added_to_event_queue_before in simulator to help debugging
- Add some more log statement in __get_next_jobs
- Attempt to use self._randomize_start_time_max * self._job_graph_batch as time where we add LOAD_NEW_JOBS event
- Have AlibabaLoader its own random instance to ensure reproducibility
Copy link
Contributor

@sukritkalra sukritkalra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the issues are fixed now, but we'll keep a close eye on any Simulator abnormalities.

The tests are failing but that's not because of the changes on this PR, they're failing on main too.

@sukritkalra sukritkalra merged commit e6cc178 into main Nov 12, 2023
1 check passed
@sukritkalra sukritkalra deleted the load-alibaba-trace-in-chunks branch November 13, 2023 03:24
@ruizehung ruizehung changed the title Load Alibaba Trace in Batches in Simulator (In Progress) Load Alibaba Trace in Batches in Simulator Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants