Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(app): revised configure_torch_cuda_allocator() & testing strategy #7733

Merged
merged 7 commits into from
Mar 5, 2025

Conversation

psychedelicious
Copy link
Collaborator

Summary

configure_torch_cuda_allocator changes

  • Log warning instead of raising when PYTORCH_CUDA_ALLOC_CONF is set to a different value than is configured in invokeai.yaml.
  • Log info instead of raising when PYTORCH_CUDA_ALLOC_CONF is set to the same value that is configured in invokeai.yaml.
  • Add explicit check for torch having been imported in the util. Note that this depends on the name of the module. So, if torch was imported with something like import torch as foo, then the check won't work. But this should cover 99.9% of cases.

Testing changes

I was running into an issue where I had the PYTORCH_CUDA_ALLOC_CONF env var set on my system and a test for configure_torch_cuda_allocator() failed. The function must be called before torch is imported, and the test is intended to confirm that behaviour.

By setting the env var before running the test, configure_torch_cuda_allocator() raises before getting to the code that the test intends to test.

To reproduce on main:

  • Run the test: pytest tests/app/util/test_torch_cuda_allocator.py
  • (test passes)
  • Set the env var to anything: export PYTORCH_CUDA_ALLOC_CONF="backend:native"
  • Re-run the test: pytest tests/app/util/test_torch_cuda_allocator.py
  • (test fails)

I fixed this by instead changing that first raise to be a log. Then, I added explicit tests to ensure the function behaved correctly when torch was imported before calling it.

But this doesn't work. The tests rely on implicit torch import behaviour - it configures itself based on the PYTORCH_CUDA_ALLOC_CONF env var on import. However, out test suite imports torch in many places, including this very test file. By the time we run the tests, torch is already imported. Catch 22!

So the test is unable to test the interaction of configure_torch_cuda_allocator() and torch's import behaviour as it should.

I could not find a simple, cross-platform way have the tests run in isolated environments. Apparently del sys.modules['module_name'] essentially un-imports a module, but it doesn't work for torch (maybe relates to native code execution?).

So, I brute-forced it and added a utility dangerously_run_function_in_subprocess() to run a python function in a subprocess. This utility takes a function without any args, runs it in a subprocess, and returns stdout/stderr/exit code. This allows a test to run in total isolation! The function to be run cannot have any closures of course.

The utility is tested separately and used to test the configure_torch_cuda_allocator() utility in an isolated environment.

Related Issues / Discussions

Closes #7731

QA Instructions

Try combinations of "backend:native" and "backend:cudaMallocAsync", for the env var and the setting in invokeai.yaml.

The app should:

  • Log at WARNING level when the values are mismatched
  • Log at INFO level when the values are matched
  • Never fail to start

Merge Plan

n/a

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@github-actions github-actions bot added python PRs that change python files python-tests PRs that change python tests labels Mar 5, 2025
@psychedelicious psychedelicious force-pushed the psyche/feat/app/cuda-alloc-util branch from 372d3e8 to 9867b32 Compare March 5, 2025 02:25
@jazzhaiku
Copy link
Collaborator

jazzhaiku commented Mar 5, 2025

@psychedelicious
No worries if you prefer the current implementation, but I thought this might be useful:

import sys
from multiprocessing import Process, Queue
import logging


def subprocess_target(queue, func, *args, **kwargs):
    import sys
    from io import StringIO

    stdout_capture = StringIO()
    stderr_capture = StringIO()

    sys.stdout = stdout_capture
    sys.stderr = stderr_capture

    try:
        func(*args, **kwargs)
    except Exception:
        print(traceback.format_exc(), file=sys.stderr)
    finally:
        stdout = stdout_capture.getvalue().rstrip("\n")
        stderr = stderr_capture.getvalue().rstrip("\n")
        queue.put((stdout, stderr))


def run_function_in_subprocess(func, *args, **kwargs):
    queue = Queue()
    p = Process(target=subprocess_target, args=(queue, func, *args), kwargs=kwargs)
    p.start()
    p.join()

    stdout, stderr = queue.get()
    exit_code = p.exitcode
    return stdout, stderr, exit_code


def example():
    print("hi spencer")
    print("oh no", file=sys.stderr)
    sys.exit(42)

if __name__ == "__main__":
    stdout, stderr, exit_code = run_function_in_subprocess(example)
    assert stdout == "hi spencer"
    assert stderr == "oh no"
    assert exit_code == 42

@jazzhaiku
Copy link
Collaborator

Great work figuring out a very subtle bug!

@jazzhaiku jazzhaiku self-requested a review March 5, 2025 05:32
@psychedelicious
Copy link
Collaborator Author

Thanks! That's a nicer way to run the isolated code, but it is a bit awkward with arbitrary functions. If I understand correctly, you have to manually catch errors and reroute messages to stderr and manually exit. The hacky function-to-script approach kinda gives you that for free. I'll keep the current implementation but keep this more pythonic method in mind.

@psychedelicious psychedelicious enabled auto-merge (rebase) March 5, 2025 20:37
@psychedelicious psychedelicious force-pushed the psyche/feat/app/cuda-alloc-util branch from 9867b32 to bb0cd59 Compare March 5, 2025 20:39
@psychedelicious psychedelicious merged commit d2f8db9 into main Mar 5, 2025
15 checks passed
@psychedelicious psychedelicious deleted the psyche/feat/app/cuda-alloc-util branch March 5, 2025 20:49
@jazzhaiku
Copy link
Collaborator

Hey @psychedelicious, just to clarify - in this example, the arbitrary function being executed is:

def example():
    print("hi spencer")
    print("oh no", file=sys.stderr)
    sys.exit(42)

The caller of run_function_in_subprocess doesn’t need to manually catch errors or redirect I/O - this is done within subprocess_target, which acts as a private helper function from the user’s perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python PRs that change python files python-tests PRs that change python tests
Projects
None yet
4 participants