Scaling and resilience #442

josibake · 2024-08-16T11:53:00Z

josibake
Aug 16, 2024
Maintainer

Jotting down some thoughts regarding scaling a warnet simulation and resilience. Starting a discussion first with the expectation we spin off issues for concrete deliverables.

Today, if I want to launch a warnet cluster, I need to know in advance the size of my simulations so I can provision the cluster correctly, e.g., choose the right resources per node and nodes per cluster. If I under specify these resources, my cluster will deploy but I will be left in an unusable state and need to destroy the warnet namespace and start over.

Ideally, if I deploy a network with too little resources, the cluster should be able to autoscale (add more nodes) and the network/scenario should be able to eventually recover.

On the autoscaling, this is something that should just work via cluster autoscaling, but doesn't seem to. Generally autoscaling kicks in when you have unscheduled pods, but something I noticed was all of the pods did get scheduled and were in a running state, but when the RPC server tried to dispatch commands, things started to fall over. Ideally, the cluster should be aware of how much CPU, RAM a pod might once a scenario is launched and refuse to schedule if there are not enough nodes to meet that capacity, i.e., maybe we set explicit resource requests on the pods?

On the recovery aspect, I think things are a bit fragile having the rpc server control every aspect of activity. It might make more sense to do something like:

Have "out of the box" activities, such as mining and generating transactions, that are assigned to pods in the graph. Nodes with these "roles" or "activities" start performing them as soon as the pod comes up. This gives the network a "heartbeat" even before a scenario has been started (@willcl-ark has a PR that does this)
Have the rpc server run scenarios from the RPC server where coordination is required from central point. For everything else, try to dispatch a role/activity to the pod and run it locally
Another option would be run scenarios, etc from outside the cluster by communicating directly with the relevant pods from the users machine (this would require 2-way communication, which may be enabled by @m3dwards local dev change?)

The goal here is if the cluster is ever swamped and starts killing pods, ideally things should keep running until more resources can be added (either manually or via autoscaling). As more resources become available, the simulation should gracefully resume or be able to be restarted.

Lastly, another thing to consider is what information gets persisted so that at any point a pod can be knocked over and restart basically where it left off.

josibake · 2024-08-19T11:50:56Z

josibake
Aug 19, 2024
Maintainer Author

Another idea is to make warcli "fatter" and rpc-0 "thinner," i.e., move more logic into warcli and only use the central server when coordination is absolutely necessary. Otherwise, we should use Kubernetes native things as much as possible (like jobs and custom resource definitions). The goal here is to let Kubernetes manage the coordination and resource management as much as possible, because it's likely going to do a better job than us trying to model all of that in a central server. cc @m3dwards

1 reply

josibake Aug 19, 2024
Maintainer Author

related: #449

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling and resilience #442

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Scaling and resilience #442

josibake Aug 16, 2024 Maintainer

Replies: 1 comment · 1 reply

josibake Aug 19, 2024 Maintainer Author

josibake Aug 19, 2024 Maintainer Author

josibake
Aug 16, 2024
Maintainer

Replies: 1 comment 1 reply

josibake
Aug 19, 2024
Maintainer Author

josibake Aug 19, 2024
Maintainer Author