Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision step should store even guests not fully provisioned #3514

Open
wants to merge 5 commits into
base: cooperative-sigint-handling
Choose a base branch
from

Conversation

happz
Copy link
Collaborator

@happz happz commented Feb 7, 2025

This should free finish to clean up guests even partially provisioned, for example in the case of tmt termination while provision is still running.

Pull Request Checklist

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

@happz happz added step | provision Stuff related to the provision step ci | full test Pull request is ready for the full test execution labels Feb 7, 2025
@happz happz added this to the 1.43 milestone Feb 7, 2025
@happz
Copy link
Collaborator Author

happz commented Feb 7, 2025

Still a draft, I have no idea what would explode in tests, so, collecting results.

@happz happz force-pushed the provision-guest-eager-storage branch from 005af81 to 939fead Compare February 10, 2025 08:50
@happz happz force-pushed the provision-guest-eager-storage branch 2 times, most recently from d479195 to 1dc3e47 Compare February 12, 2025 11:58
Copy link
Collaborator

@psss psss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving this. Looks good. One difference I've noticed: When scheduling a guest in Beaker and it fails I get just a concise fail message after the change:

/var/tmp/tmt/run-137

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-137-XtMVALFw
        fail: Failed to create Beaker job
        multihost name: default-0
    finish
        queued finish task #1: login
        
        finish task #1: login
        login: Starting interactive shell
        warn: Failed to push workdir to the guest.
    
        queued pull task #1: pull from default-0
        
        pull task #1: pull from default-0
    fail: The guest is not available.

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    finish step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        The guest is not available.

Before the change there's more details to investigate:

/var/tmp/tmt/run-138

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-138-QgpCiVBX
        fail: Failed to create Beaker job
    finish
        warn: Nothing to finish, no guests provisioned.

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    provision step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        Failed to create Beaker job

        The exception was caused by 1 earlier exceptions

        Cause number 1:

            ({'response': '\tNo distro tree matches Recipe\n\t<distroRequires><and><distro_name op="like" value="Fedora-41%"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'}, {'name': 'fedora-x86_64', 'distro': 'Fedora-41%', 'os': 'fedora', 'job_group': None, 'meta_distro': False, 'arch': 'x86_64', 'variant': 'BaseOS', 'job_owner': None, 'ks_meta': None, 'kernel_options': None, 'kernel_options_post': None, 'retention_tag': 'audit', 'product': '[internal]', 'whiteboard': 'tmt-138-QgpCiVBX', 'priority': 'Normal', 'tasks': [{'name': '/distribution/dummy', 'role': 'STANDALONE'}], 'ks_append': [], 'hostRequires': None, 'distro_tags': [], 'host_id': 'fedora-x86_64'})

            The exception was caused by 1 earlier exceptions

            Cause number 1:

                <Fault 1: '<class \'bkr.common.bexceptions.BX\'>:No distro tree matches Recipe: <distroRequires><and><distro_name op="like" value="Fedora-41%"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'>

Also, it seems that the login action is attempted although there is no guest:

        login: Starting interactive shell
        warn: Failed to push workdir to the guest.

@happz
Copy link
Collaborator Author

happz commented Feb 14, 2025

Thanks for improving this. Looks good. One difference I've noticed: When scheduling a guest in Beaker and it fails I get just a concise fail message after the change:

/var/tmp/tmt/run-137

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-137-XtMVALFw
        fail: Failed to create Beaker job
        multihost name: default-0
    finish
        queued finish task #1: login
        
        finish task #1: login
        login: Starting interactive shell
        warn: Failed to push workdir to the guest.
    
        queued pull task #1: pull from default-0
        
        pull task #1: pull from default-0
    fail: The guest is not available.

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    finish step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        The guest is not available.

Before the change there's more details to investigate:

/var/tmp/tmt/run-138

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-138-QgpCiVBX
        fail: Failed to create Beaker job
    finish
        warn: Nothing to finish, no guests provisioned.

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    provision step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        Failed to create Beaker job

        The exception was caused by 1 earlier exceptions

        Cause number 1:

            ({'response': '\tNo distro tree matches Recipe\n\t<distroRequires><and><distro_name op="like" value="Fedora-41%"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'}, {'name': 'fedora-x86_64', 'distro': 'Fedora-41%', 'os': 'fedora', 'job_group': None, 'meta_distro': False, 'arch': 'x86_64', 'variant': 'BaseOS', 'job_owner': None, 'ks_meta': None, 'kernel_options': None, 'kernel_options_post': None, 'retention_tag': 'audit', 'product': '[internal]', 'whiteboard': 'tmt-138-QgpCiVBX', 'priority': 'Normal', 'tasks': [{'name': '/distribution/dummy', 'role': 'STANDALONE'}], 'ks_append': [], 'hostRequires': None, 'distro_tags': [], 'host_id': 'fedora-x86_64'})

            The exception was caused by 1 earlier exceptions

            Cause number 1:

                <Fault 1: '<class \'bkr.common.bexceptions.BX\'>:No distro tree matches Recipe: <distroRequires><and><distro_name op="like" value="Fedora-41%"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'>

Can you share your command line? I just tried it and the error seems to be rich, but maybe I'm trying different path:

$ tmt run provision -h beaker --image not-really-existing-compose plan -n /plans/features/core
/var/tmp/tmt/run-181

/plans/features/core
    provision
        queued provision.provision task #1: default-0

        provision.provision task #1: default-0
        how: beaker
        image: not-really-existing-compose
        whiteboard: tmt-181-jBvpMbGp
        fail: Failed to create Beaker job
        multihost name: default-0

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    provision step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        Failed to create Beaker job

        The exception was caused by 1 earlier exceptions

        Cause number 1:

            ({'response': '\tNo distro tree matches Recipe\n\t<distroRequires><and><distro_name op="=" value="not-really-existing-compose"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'}, {'name': 'not-really-existing-compose-x86_64', 'distro': 'not-really-existing-compose', 'os': 'not-really-existing-compose', 'job_group': None, 'meta_distro': False, 'arch': 'x86_64', 'variant': 'BaseOS', 'job_owner': None, 'ks_meta': None, 'kernel_options': None, 'kernel_options_post': None, 'retention_tag': 'audit', 'product': '[internal]', 'whiteboard': 'tmt-181-jBvpMbGp', 'priority': 'Normal', 'tasks': [{'name': '/distribution/dummy', 'role': 'STANDALONE'}], 'ks_append': [], 'hostRequires': None, 'distro_tags': [], 'host_id': 'not-really-existing-compose-x86_64'})

            The exception was caused by 1 earlier exceptions

            Cause number 1:

                <Fault 1: '<class \'bkr.common.bexceptions.BX\'>:No distro tree matches Recipe: <distroRequires><and><distro_name op="=" value="not-really-existing-compose"/><distro_variant op="=" value="BaseOS"/><distro_arch op="=" value="x86_64"/></and></distroRequires>'>

Also, it seems that the login action is attempted although there is no guest:

        login: Starting interactive shell
        warn: Failed to push workdir to the guest.

I believe that's caused by actions now seeing all guests and simply ignoring the fact some of them might be not ready enough. I'll try to fix that, can you share the command line for verification?

@happz
Copy link
Collaborator Author

happz commented Feb 14, 2025

The internal-plugins failure now has a downstream patch.

@happz happz force-pushed the provision-guest-eager-storage branch from 1dc3e47 to b72fadb Compare February 14, 2025 09:51
@happz happz force-pushed the provision-guest-eager-storage branch from 5ad340d to 5992e36 Compare February 18, 2025 08:16
@psss
Copy link
Collaborator

psss commented Feb 18, 2025

Can you share your command line? I just tried it and the error seems to be rich, but maybe I'm trying different path

The command line was just provision and login:

tmt run provision -h beaker -i bad-image login finish

Hmmm, but after the rebase, I'm not able to reproduce the problem anymore. I can see the detailed error now.

Also, it seems that the login action is attempted although there is no guest:

        login: Starting interactive shell
        warn: Failed to push workdir to the guest.

I believe that's caused by actions now seeing all guests and simply ignoring the fact some of them might be not ready enough. I'll try to fix that, can you share the command line for verification?

And this part has disappeared as well. So it seems we're good.

@psss
Copy link
Collaborator

psss commented Feb 18, 2025

Was trying to verify that ctrl-c in the middle of guest provision in beaker will cancel the job and it seems it's not the case:

> tmt run provision -h beaker login finish
/var/tmp/tmt/run-022

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-022-BiaACIEM
        guest: has been requested
        job id: J:10664547
        status: New^C^C
    finish
        warn: Nothing to finish, no guests provisioned.

Aborted!
^CException ignored on threading shutdown:
Traceback (most recent call last):
  File "/usr/lib64/python3.13/threading.py", line 1534, in _shutdown
    atexit_call()
  File "/usr/lib64/python3.13/threading.py", line 1505, in <lambda>
    _threading_atexits.append(lambda: func(*arg, **kwargs))
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib64/python3.13/threading.py", line 1092, in join
    self._handle.join(timeout)
KeyboardInterrupt: 

Is it supposed to work only together with #3481 or should it clean up the running beaker job already now?

@happz
Copy link
Collaborator Author

happz commented Feb 18, 2025

Was trying to verify that ctrl-c in the middle of guest provision in beaker will cancel the job and it seems it's not the case:

> tmt run provision -h beaker login finish
/var/tmp/tmt/run-022

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-022-BiaACIEM
        guest: has been requested
        job id: J:10664547
        status: New^C^C
    finish
        warn: Nothing to finish, no guests provisioned.

Aborted!
^CException ignored on threading shutdown:
Traceback (most recent call last):
  File "/usr/lib64/python3.13/threading.py", line 1534, in _shutdown
    atexit_call()
  File "/usr/lib64/python3.13/threading.py", line 1505, in <lambda>
    _threading_atexits.append(lambda: func(*arg, **kwargs))
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib64/python3.13/threading.py", line 1092, in join
    self._handle.join(timeout)
KeyboardInterrupt: 

Is it supposed to work only together with #3481 or should it clean up the running beaker job already now?

I would expect these two need to work, as I don't trust Ctrl+C handling in its current state.

@happz happz force-pushed the provision-guest-eager-storage branch 2 times, most recently from 3b49fa6 to f80376e Compare February 19, 2025 08:15
@happz happz changed the base branch from main to cooperative-sigint-handling February 19, 2025 08:16
@happz
Copy link
Collaborator Author

happz commented Feb 19, 2025

Was trying to verify that ctrl-c in the middle of guest provision in beaker will cancel the job and it seems it's not the case:

> tmt run provision -h beaker login finish
/var/tmp/tmt/run-022

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-022-BiaACIEM
        guest: has been requested
        job id: J:10664547
        status: New^C^C
    finish
        warn: Nothing to finish, no guests provisioned.

Aborted!
^CException ignored on threading shutdown:
Traceback (most recent call last):
  File "/usr/lib64/python3.13/threading.py", line 1534, in _shutdown
    atexit_call()
  File "/usr/lib64/python3.13/threading.py", line 1505, in <lambda>
    _threading_atexits.append(lambda: func(*arg, **kwargs))
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib64/python3.13/threading.py", line 1092, in join
    self._handle.join(timeout)
KeyboardInterrupt: 

Is it supposed to work only together with #3481 or should it clean up the running beaker job already now?

I would expect these two need to work, as I don't trust Ctrl+C handling in its current state.

Oh, I just realized what's wrong, I forgot one patch. Ctrl+C is now captured, an exception is raised and tmt starts shutting down, but the thread checking on Beaker job status is not aware of this, because that thread has not been interrupted.

We have two options:

  1. threads spawned by tmt to manage a step phase need to be given a signal channel, e.g. threading.Event, which would be set by tmt core when the shutdown starts. Threads then periodically check whether the event is set or not, e.g. after each check of Beaker job status. How fast they would quit depends on the tick used by the respective "waiting for foo" block, but it's safe, sane, and explicit.
  2. use ctypes and raise exceptions in threads from outside. It's generally described as "good for debuggers and low-level tools", more or less an "I really hope you know what you're doing" idea. But it does work, and it isn't a long code: https://github.com/Bogdanp/dramatiq/blob/master/dramatiq/threading.py#L79. On the other hand, it's asynchronous by its nature.

I for one would prefer the first option, it's one extra argument to various go() methods, but it's clearly visible in the code.

@happz
Copy link
Collaborator Author

happz commented Feb 19, 2025

Was trying to verify that ctrl-c in the middle of guest provision in beaker will cancel the job and it seems it's not the case:

> tmt run provision -h beaker login finish
/var/tmp/tmt/run-022

/default/plan
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: beaker
        whiteboard: tmt-022-BiaACIEM
        guest: has been requested
        job id: J:10664547
        status: New^C^C
    finish
        warn: Nothing to finish, no guests provisioned.

Aborted!
^CException ignored on threading shutdown:
Traceback (most recent call last):
  File "/usr/lib64/python3.13/threading.py", line 1534, in _shutdown
    atexit_call()
  File "/usr/lib64/python3.13/threading.py", line 1505, in <lambda>
    _threading_atexits.append(lambda: func(*arg, **kwargs))
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib64/python3.13/threading.py", line 1092, in join
    self._handle.join(timeout)
KeyboardInterrupt: 

Is it supposed to work only together with #3481 or should it clean up the running beaker job already now?

I would expect these two need to work, as I don't trust Ctrl+C handling in its current state.

@psss you can give it a try now, after 8305b3b.

@psss
Copy link
Collaborator

psss commented Feb 19, 2025

@psss you can give it a try now, after 8305b3b.

Nice! Now works as expected and cleans up the job in progress. Thanks!

@happz happz force-pushed the cooperative-sigint-handling branch from 407ec72 to 677e4f0 Compare February 20, 2025 13:02
@happz happz force-pushed the cooperative-sigint-handling branch from 677e4f0 to ffa2eac Compare February 20, 2025 13:55
This should free `finish` to clean up guests even partially provisioned,
for example in the case of tmt termination while provision is still
running.
@happz happz force-pushed the provision-guest-eager-storage branch from 8305b3b to 62eff6f Compare February 20, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci | full test Pull request is ready for the full test execution step | provision Stuff related to the provision step
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants