Replies: 1 comment 1 reply
-
Another idea is to make warcli "fatter" and rpc-0 "thinner," i.e., move more logic into warcli and only use the central server when coordination is absolutely necessary. Otherwise, we should use Kubernetes native things as much as possible (like jobs and custom resource definitions). The goal here is to let Kubernetes manage the coordination and resource management as much as possible, because it's likely going to do a better job than us trying to model all of that in a central server. cc @m3dwards |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Jotting down some thoughts regarding scaling a warnet simulation and resilience. Starting a discussion first with the expectation we spin off issues for concrete deliverables.
Today, if I want to launch a warnet cluster, I need to know in advance the size of my simulations so I can provision the cluster correctly, e.g., choose the right resources per node and nodes per cluster. If I under specify these resources, my cluster will deploy but I will be left in an unusable state and need to destroy the warnet namespace and start over.
Ideally, if I deploy a network with too little resources, the cluster should be able to autoscale (add more nodes) and the network/scenario should be able to eventually recover.
On the autoscaling, this is something that should just work via cluster autoscaling, but doesn't seem to. Generally autoscaling kicks in when you have unscheduled pods, but something I noticed was all of the pods did get scheduled and were in a running state, but when the RPC server tried to dispatch commands, things started to fall over. Ideally, the cluster should be aware of how much CPU, RAM a pod might once a scenario is launched and refuse to schedule if there are not enough nodes to meet that capacity, i.e., maybe we set explicit resource requests on the pods?
On the recovery aspect, I think things are a bit fragile having the rpc server control every aspect of activity. It might make more sense to do something like:
The goal here is if the cluster is ever swamped and starts killing pods, ideally things should keep running until more resources can be added (either manually or via autoscaling). As more resources become available, the simulation should gracefully resume or be able to be restarted.
Lastly, another thing to consider is what information gets persisted so that at any point a pod can be knocked over and restart basically where it left off.
Beta Was this translation helpful? Give feedback.
All reactions