feat(app): revised `configure_torch_cuda_allocator()` & testing strategy #7733

psychedelicious · 2025-03-05T02:24:48Z

Summary

`configure_torch_cuda_allocator` changes

Log warning instead of raising when PYTORCH_CUDA_ALLOC_CONF is set to a different value than is configured in invokeai.yaml.
Log info instead of raising when PYTORCH_CUDA_ALLOC_CONF is set to the same value that is configured in invokeai.yaml.
Add explicit check for torch having been imported in the util. Note that this depends on the name of the module. So, if torch was imported with something like import torch as foo, then the check won't work. But this should cover 99.9% of cases.

Testing changes

I was running into an issue where I had the PYTORCH_CUDA_ALLOC_CONF env var set on my system and a test for configure_torch_cuda_allocator() failed. The function must be called before torch is imported, and the test is intended to confirm that behaviour.

By setting the env var before running the test, configure_torch_cuda_allocator() raises before getting to the code that the test intends to test.

To reproduce on main:

Run the test: pytest tests/app/util/test_torch_cuda_allocator.py
(test passes)
Set the env var to anything: export PYTORCH_CUDA_ALLOC_CONF="backend:native"
Re-run the test: pytest tests/app/util/test_torch_cuda_allocator.py
(test fails)

I fixed this by instead changing that first raise to be a log. Then, I added explicit tests to ensure the function behaved correctly when torch was imported before calling it.

But this doesn't work. The tests rely on implicit torch import behaviour - it configures itself based on the PYTORCH_CUDA_ALLOC_CONF env var on import. However, out test suite imports torch in many places, including this very test file. By the time we run the tests, torch is already imported. Catch 22!

So the test is unable to test the interaction of configure_torch_cuda_allocator() and torch's import behaviour as it should.

I could not find a simple, cross-platform way have the tests run in isolated environments. Apparently del sys.modules['module_name'] essentially un-imports a module, but it doesn't work for torch (maybe relates to native code execution?).

So, I brute-forced it and added a utility dangerously_run_function_in_subprocess() to run a python function in a subprocess. This utility takes a function without any args, runs it in a subprocess, and returns stdout/stderr/exit code. This allows a test to run in total isolation! The function to be run cannot have any closures of course.

The utility is tested separately and used to test the configure_torch_cuda_allocator() utility in an isolated environment.

Related Issues / Discussions

Closes #7731

QA Instructions

Try combinations of "backend:native" and "backend:cudaMallocAsync", for the env var and the setting in invokeai.yaml.

The app should:

Log at WARNING level when the values are mismatched
Log at INFO level when the values are matched
Never fail to start

Merge Plan

n/a

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

jazzhaiku · 2025-03-05T05:27:51Z

@psychedelicious
No worries if you prefer the current implementation, but I thought this might be useful:

import sys
from multiprocessing import Process, Queue
import logging


def subprocess_target(queue, func, *args, **kwargs):
    import sys
    from io import StringIO

    stdout_capture = StringIO()
    stderr_capture = StringIO()

    sys.stdout = stdout_capture
    sys.stderr = stderr_capture

    try:
        func(*args, **kwargs)
    except Exception:
        print(traceback.format_exc(), file=sys.stderr)
    finally:
        stdout = stdout_capture.getvalue().rstrip("\n")
        stderr = stderr_capture.getvalue().rstrip("\n")
        queue.put((stdout, stderr))


def run_function_in_subprocess(func, *args, **kwargs):
    queue = Queue()
    p = Process(target=subprocess_target, args=(queue, func, *args), kwargs=kwargs)
    p.start()
    p.join()

    stdout, stderr = queue.get()
    exit_code = p.exitcode
    return stdout, stderr, exit_code


def example():
    print("hi spencer")
    print("oh no", file=sys.stderr)
    sys.exit(42)

if __name__ == "__main__":
    stdout, stderr, exit_code = run_function_in_subprocess(example)
    assert stdout == "hi spencer"
    assert stderr == "oh no"
    assert exit_code == 42

jazzhaiku · 2025-03-05T05:32:34Z

Great work figuring out a very subtle bug!

psychedelicious · 2025-03-05T11:32:39Z

Thanks! That's a nicer way to run the isolated code, but it is a bit awkward with arbitrary functions. If I understand correctly, you have to manually catch errors and reroute messages to stderr and manually exit. The hacky function-to-script approach kinda gives you that for free. I'll keep the current implementation but keep this more pythonic method in mind.

…is already set

This allows our tests to run in an isolated environment. For tests taht implicitly depend on import behaviour, this can prevent side-effects. The function should only be used for tests.

…rch_cuda_allocator tests

jazzhaiku · 2025-03-05T20:59:46Z

Hey @psychedelicious, just to clarify - in this example, the arbitrary function being executed is:

def example():
    print("hi spencer")
    print("oh no", file=sys.stderr)
    sys.exit(42)

The caller of run_function_in_subprocess doesn’t need to manually catch errors or redirect I/O - this is done within subprocess_target, which acts as a private helper function from the user’s perspective.

psychedelicious requested review from blessedcoolant, brandonrising and hipsterusername as code owners March 5, 2025 02:24

github-actions bot added python PRs that change python files python-tests PRs that change python tests labels Mar 5, 2025

psychedelicious force-pushed the psyche/feat/app/cuda-alloc-util branch from 372d3e8 to 9867b32 Compare March 5, 2025 02:25

jazzhaiku self-requested a review March 5, 2025 05:32

hipsterusername approved these changes Mar 5, 2025

View reviewed changes

RyanJDick approved these changes Mar 5, 2025

View reviewed changes

psychedelicious enabled auto-merge (rebase) March 5, 2025 20:37

psychedelicious added 7 commits March 6, 2025 07:39

feat(app): make logger a required arg in configure_torch_cuda_allocator

357feae

feat(app): log warning instead of raising if PYTORCH_CUDA_ALLOC_CONF …

b033120

…is already set

tests: add testing utils to set/unset env var

282540a

tests: update tests for configure_torch_cuda_allocator

c0a7567

tests: add util to run a function in separate process

90279c9

This allows our tests to run in an isolated environment. For tests taht implicitly depend on import behaviour, this can prevent side-effects. The function should only be used for tests.

tests: use dangerously_run_function_in_subprocess to fix configure_to…

a17afd1

…rch_cuda_allocator tests

tidy: remove unused utils

bb0cd59

psychedelicious force-pushed the psyche/feat/app/cuda-alloc-util branch from 9867b32 to bb0cd59 Compare March 5, 2025 20:39

psychedelicious merged commit d2f8db9 into main Mar 5, 2025
15 checks passed

psychedelicious deleted the psyche/feat/app/cuda-alloc-util branch March 5, 2025 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(app): revised `configure_torch_cuda_allocator()` & testing strategy #7733

feat(app): revised `configure_torch_cuda_allocator()` & testing strategy #7733

psychedelicious commented Mar 5, 2025

jazzhaiku commented Mar 5, 2025 •

edited

Loading

jazzhaiku commented Mar 5, 2025

psychedelicious commented Mar 5, 2025

jazzhaiku commented Mar 5, 2025

feat(app): revised configure_torch_cuda_allocator() & testing strategy #7733

feat(app): revised configure_torch_cuda_allocator() & testing strategy #7733

Conversation

psychedelicious commented Mar 5, 2025

Summary

configure_torch_cuda_allocator changes

Testing changes

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

jazzhaiku commented Mar 5, 2025 • edited Loading

jazzhaiku commented Mar 5, 2025

psychedelicious commented Mar 5, 2025

jazzhaiku commented Mar 5, 2025

feat(app): revised `configure_torch_cuda_allocator()` & testing strategy #7733

feat(app): revised `configure_torch_cuda_allocator()` & testing strategy #7733

`configure_torch_cuda_allocator` changes

jazzhaiku commented Mar 5, 2025 •

edited

Loading