Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLPX-72326 Use OOMD instead of the kernel's OOM killer #297

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Apr 29, 2021

Context:

Link: https://facebookmicrosites.github.io/oomd/

OOMD only works with 5.0+ kernels because it needs the PSI kernel interface (we should be good in terms of that) and cgroups V2 hierarchies (I think we are currently V1 so we may have to look into that). The proposal to use this is to overcome difficulties that we have with the current in-kernel OOM killer. Currently when a process dies by the OOM killer:

  • It is hard to tell whose fault it was. Was it the application asking for too much memory, or was it that the process had an unfortunate OOM score and was being starved by one or more other processes?
  • If it was the process's fault a SIGKILL is sent which is not very useful. We currently have custom logic for the appstack process to generate a core dump so we can analyze where the memory went. If this is any other process we don't have that luxury.
  • If it was some kind of systemic failure due to many small processes starving the appstack or similar, we had no idea if, how, or why such thing happened.

OOMD is very flexible and configurable. It can deal with all of the above by being configured to send SIGABRTs (and generate crash dumps) when processes take too much memory, take a system-wide memory snapshot before killing a process, and enforce per-process memory limits.

This Patch:

Enables CGROUPs v2 for the appstack by setting a kernel parameter during boot through GRUB.

Testing:

I created a VM and passed it that kernel parameter. The I manually checked that we had switched to cgroupsv2 by looking at the cgroup directory under sysfs.

= Before the patch:

$ ls /sys/fs/cgroup/memory
cgroup.clone_children           memory.kmem.tcp.limit_in_bytes      memory.stat
cgroup.event_control            memory.kmem.tcp.max_usage_in_bytes  memory.swappiness
cgroup.procs                    memory.kmem.tcp.usage_in_bytes      memory.usage_in_bytes
cgroup.sane_behavior            memory.kmem.usage_in_bytes          memory.use_hierarchy
memory.failcnt                  memory.limit_in_bytes               notify_on_release
memory.force_empty              memory.max_usage_in_bytes           release_agent
memory.kmem.failcnt             memory.move_charge_at_immigrate     system.slice
memory.kmem.limit_in_bytes      memory.numa_stat                    tasks
memory.kmem.max_usage_in_bytes  memory.oom_control                  user.slice
memory.kmem.slabinfo            memory.pressure_level
memory.kmem.tcp.failcnt         memory.soft_limit_in_bytes

= After the patch (memory directory is gone as it is unified under the top-level cgroup directory in v2):

$ ls /sys/fs/cgroup/memory
ls: cannot access '/sys/fs/cgroup/memory': No such file or directory
$ ls /sys/fs/cgroup/
cgroup.controllers      cgroup.stat             cpuset.cpus.effective  io.cost.qos      user.slice
cgroup.max.depth        cgroup.subtree_control  cpuset.mems.effective  io.pressure
cgroup.max.descendants  cgroup.threads          init.scope             memory.pressure
cgroup.procs            cpu.pressure            io.cost.model          system.slice

The only service that had a problem starting was our existing OOM service which we wrote with the assumption of using cgroups v1:

$ sudo systemctl list-units --failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● delphix-oom-handler.service loaded failed failed OOM handler for delphix-mgmt.service

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

So I disabled it and run dx-test on that VM:

$ sudo systemctl stop delphix-oom-handler
$ sudo systemctl disable delphix-oom-handler
Removed /etc/systemd/system/delphix-mgmt.service.wants/delphix-oom-handler.service.
$ sudo systemctl daemon-reload
$ sudo systemctl reset-failed

dx-test:
http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/dx-integration-tests/23953/

the BB section of the above failed because of how I specified my VM sd-bpf instead of sd-bpf.dcol2 so I re-run the BB part:
http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/blackbox-self-service/89831/

Future Work

I still need to:

  • Ensure that cgroups v2 does not break the containers used in upgrade or docker used by the vSDK project (cc: @nhlien93)
  • Add OOMD to our list of packages (cc: @pzakha)
  • Test and ensure that no weird edge-cases come up for deferred upgrade. For example, if OOMD was installed but we haven't rebooted yet so cgroups v2 is not system-wide would we hit any errors? (cc @prakashsurya)
  • Create a commit that replaces our current OOM service with OOMD (cc: @johngallagher-dlpx)

Copy link
Contributor

@prakashsurya prakashsurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change looks good, but I can't say one way or the other w.r.t. if this is safe to do.. assuming this works with other parts of our software, it LGTM.

@prakashsurya
Copy link
Contributor

Out of curiosity, how do you plan to handle deferred upgrade w.r.t. this change (or perhaps more accurately, w.r.t. changes that'll depend on cgroups v2)? E.g. I'm sure it requires a reboot to take advantage of cgroups v2.. so, will any changes we incorporate that require v2 be backwards compatible, such that it works with both v1 or v2, based on what's enabled on the system (e.g. if the customer rebooted after upgrade or not)?

@sdimitro
Copy link
Contributor Author

Out of curiosity, how do you plan to handle deferred upgrade w.r.t. this change (or perhaps more accurately, w.r.t. changes that'll depend on cgroups v2)? E.g. I'm sure it requires a reboot to take advantage of cgroups v2.. so, will any changes we incorporate that require v2 be backwards compatible, such that it works with both v1 or v2, based on what's enabled on the system (e.g. if the customer rebooted after upgrade or not)?

That's a great question! I believe that we wouldn't encounter any failures as OOMD only checks if cgroups v2 is supported to run (so we'll get no errors from it) BUT it won't actually detect anything as no resource on our kernel will be using cgroups v2. I'll need to figure this out.

BTW I opened this as a draft to save my work and notes. This is kind of a side-project that's been slow-going.

Keep in mind that we'd need to deal with the cgroups v2 change regardless as it will be the default in Ubuntu 21.04 and it will break our current OOM handler code. I'm not sure if we have a JIRA issue out yet for moving away from Ubuntu 18.04 but it would be nice if we could cross-reference that JIRA issue with this one.

@prakashsurya
Copy link
Contributor

prakashsurya commented Apr 29, 2021

BUT it won't actually detect anything as no resource on our kernel will be using cgroups v2.

Yea, so it'd probably be "bad" if we switched entirely to using OOMD, and then had systems running v1 due to a deferred upgrade, but a silently nonfunctional OOMD, and also without our custom OOM handler (e.g. because we rip it out at the same time we add OOMD support).

Keep in mind that we'd need to deal with the cgroups v2 change regardless as it will be the default in Ubuntu 21.04 and it will break our current OOM handler code.

Yea, understood.. and in that case, I'm not sure if we'd have to force a reboot or not.. but I'd be surprised if all of Ubuntu's software properly handled that v1 to v2 transition gracefully, in a deferred upgrade fashion.

I'm not sure if we have a JIRA issue out yet for moving away from Ubuntu 18.04 but it would be nice if we could cross-reference that JIRA issue with this one.

IIRC, we have a current project to move from 18.04 to 20.04.. and if that transition will require a reboot upgrade, it'd probably be good to piggy back off that to migrate to cgroups v2, as that could alleviate some pain of later moving to 21.04 or future versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants