Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing Boot Time in SONiC by Replacing Process manager #1922

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

vidhya-rajan
Copy link

@vidhya-rajan vidhya-rajan commented Feb 10, 2025

Reducing Boot Time in SONiC by Replacing Process manager

What we did:
Replaced the current process manager (e.g., supervisord) in SONiC with a more efficient alternative Runit .

Why we did it:
In order to improve startup speed, this design focuses on optimizing service initialization by replacing the existing process manager with a higher-performance alternative. This is particularly crucial for switches leveraging the ASIC's internal CPU to run SONiC.

Support added:
Replaced Supervisord process manager with Runit which monitors all processes as supervisord

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

```
Initialization performance analysis revealed that supervisord and supervisorctl contribute significantly to boot time, consuming roughly 20% of the total initialization period. This suggests that migrating away from these Python-based tools might offer a performance improvement. Generally, Python applications can exhibit slower startup times in these types of scenarios.

![alt text](perf-supervisorctl.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how long perf was sampling the system during boot up? Could you please share the testing methodology?

Side note, supervisorctl shouldn't be invoked during startup as its usage was replaced with supervisord-dependent-startup plugin.


### 4.5 Supervisord to Runit config translation

One option we are choosing is to use a Python script to automate the conversion of existing process manager (specifically supervisord) configurations into the runit format. This script, executed as part of a Docker entrypoint, transforms the provided supervisord configuration into runit service directories. This approach facilitates migration for Docker applications utilizing Jinja2 templated configuration files alongside traditional supervisord.conf files. There can be other options as well like static sv scripts etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script generating configs from j2 templates at boot will itself be heavy on cpu, how you plan to mitigate that? Can be generated at build time?

One option we are choosing is to use a Python script to automate the conversion of existing process manager (specifically supervisord) configurations into the runit format. This script, executed as part of a Docker entrypoint, transforms the provided supervisord configuration into runit service directories. This approach facilitates migration for Docker applications utilizing Jinja2 templated configuration files alongside traditional supervisord.conf files. There can be other options as well like static sv scripts etc.

```
A Supervisord configuration defines a program named orchagent. This program, /usr/bin/orchagent.sh, depends on the portsyncd service. The conversion script translates this dependency into a runit run script for the orchagent service. The generated /etc/service/orchagent/run script waits for portsyncd to reach a running state before executing /usr/bin/orchagent.sh. This ensures the dependency is met before the orchagent process starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you determine a running state with runit? Please check https://supervisord.org/subprocess.html#process-states to ensure same behaviour as well as startsecs= parameter.

One option we are choosing is to use a Python script to automate the conversion of existing process manager (specifically supervisord) configurations into the runit format. This script, executed as part of a Docker entrypoint, transforms the provided supervisord configuration into runit service directories. This approach facilitates migration for Docker applications utilizing Jinja2 templated configuration files alongside traditional supervisord.conf files. There can be other options as well like static sv scripts etc.

```
A Supervisord configuration defines a program named orchagent. This program, /usr/bin/orchagent.sh, depends on the portsyncd service. The conversion script translates this dependency into a runit run script for the orchagent service. The generated /etc/service/orchagent/run script waits for portsyncd to reach a running state before executing /usr/bin/orchagent.sh. This ensures the dependency is met before the orchagent process starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, do you investigate why orchagent has dependency on portsyncd? The whole idea of syncrhonizing processes using supervisord process state like RUNNING does not seem to guarantee any synchronization.

Similarly, a common pattern accross multiple containers is to start all processes after rsyslogd reaches RUNNING state in supervisord which does not actually guarantee rsyslogd will be capable of receiving syslog messages.

Does runit (or alternatives) has something like systemd's sd_notify ?

One option we are choosing is to use a Python script to automate the conversion of existing process manager (specifically supervisord) configurations into the runit format. This script, executed as part of a Docker entrypoint, transforms the provided supervisord configuration into runit service directories. This approach facilitates migration for Docker applications utilizing Jinja2 templated configuration files alongside traditional supervisord.conf files. There can be other options as well like static sv scripts etc.

```
A Supervisord configuration defines a program named orchagent. This program, /usr/bin/orchagent.sh, depends on the portsyncd service. The conversion script translates this dependency into a runit run script for the orchagent service. The generated /etc/service/orchagent/run script waits for portsyncd to reach a running state before executing /usr/bin/orchagent.sh. This ensures the dependency is met before the orchagent process starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please check sonic-net/sonic-buildimage#13765 for more info on the delays caused by how supervisor works.


### 4.6 Enabling Runit as the process manager

To enable runit as the process manager, create an empty file named /etc/runit-manager and then trigger a configuration reload. This can be achieved by executing the command config reload or by reloading the device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why support both at the same time?

```

## 7. Warmboot and Fastboot Design Impact
N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has impact on warm and fast boot, please include reboot and upgrade timing

Any open issues or action items will be tracked here. This may include tasks like benchmarking different process managers, developing the configuration conversion tool, and updating the init process.

```
Currently, runit doesn't offer equivalent functionality to Supervisord supervisor-proc-exit-listener for syslog alerting based on process states. This is a gap in functionality we need to address.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do supervisord process states map to runit?

{
"PROCESS_MANAGER": {
"runit": {
"enabled": "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this related to a file based configuration described in 4.6 ?

```
Currently, runit doesn't offer equivalent functionality to Supervisord supervisor-proc-exit-listener for syslog alerting based on process states. This is a gap in functionality we need to address.

Restart of docker derived based on auto-restart attribute in Feature table is not currently handled and will be handled later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container lifetime is not controlled by init process inside container, coudl you please clarify?


![alt text](perf-supervisorctl.png)

Potential replacement process managers will be evaluated based on criteria such as speed, resource consumption, and ease of integration with SONiC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has systemd been evaluated? I realize that it is heavier in terms of code size and feature set (and possible impact to disk space), but I'm hoping that the fact that it is compiled code (compared to Python-based supervisord) still shows an improvement while preserving features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants