Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GlobusComputeExecutor #3607

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented Sep 5, 2024

Description

This PR adds a new GlobusComputeExecutor that wraps the Globus Compute SDK to allow Parsl to execute tasks via Globus Compute. This mechanism supports remote execution of tasks similar to the functionality that parsl.channels enabled and is a potential replacement.

Since GlobusCompute often runs on remote machines that do not have a shared-filesystem with the parsl runtime, tests have been updated with a new shared_fs and staging_required pytest markers. I have not added tests and CI actions to enable executing these tests against our CI system, but you can run tests locally with these steps:

  1. Install globus-compute-sdk with pip install .[globus-compute]
  2. Configure and start a globus-compute-endpoint, globus-compute-endpoint start <endpoint_name>
  3. Set an env var with the endpoint id for tests: export GLOBUS_COMPUTE_ENDPOINT=<endpoint_id>
  4. Run tests with pytest -v -k "not shared_fs" --config parsl/tests/configs/globus_compute.py parsl/tests/

Changed Behaviour

N/A

Fixes

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

  • New feature

@benclifford
Copy link
Collaborator

Aside from a lack of CI test, my main concern (from seeing people implement this before) is what goes on with different Parsl versions: there's one version in the endpoint (pinned to a specific version by Globus Compute) and another version on the submit side, and having different Parsl versions like that is out of scope for Parsl.

This might be a case of documenting what users should expect to work or not work, or might be something deeper. At the very least we should be expecting them to be able to use the same combination of versions as used in CI.

parsl/tests/conftest.py Outdated Show resolved Hide resolved
parsl/tests/configs/globus_compute.py Outdated Show resolved Hide resolved
parsl/executors/globus_compute.py Outdated Show resolved Hide resolved
@yadudoc yadudoc force-pushed the globus_compute_executor.py branch 2 times, most recently from b004cf2 to a3cad96 Compare October 17, 2024 17:36
@yadudoc yadudoc force-pushed the globus_compute_executor.py branch 3 times, most recently from 3b26095 to dbc14cf Compare October 24, 2024 19:18
github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2024
# Description

This PR adds a new `staging_required` marker and a marker to several
tests that assume a shared filesystem to work. Similarly, a few tests
that use staging are now marked with the `staging_required` marker.

This PR splits out changes from #3607.

# Changed Behaviour

These markers should not affect any test-behavior since none of the test
entries in the Makefile make use of this.
These change will kick-in once they are used by `GlobusComputeExecutor`
and potentially `KubernetesProvider`
for tests.

## Type of change

Choose which options apply, and delete the ones which do not apply.

- New feature
- Code maintenance/cleanup
@yadudoc yadudoc force-pushed the globus_compute_executor.py branch 4 times, most recently from 7d7e895 to 4f54526 Compare November 7, 2024 15:27
make clean_coverage

# Temporary fix, until changes make it into compute releases
git clone -b configure_tasks_working_dir https://github.com/globus/globus-compute.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe its good to wait for Globus Compute to support the necessary features in a release, so that this can pin a relevant supported release?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(does this mean that right now a user shouldn't expect to be able to use an actual Globus Compute but instead needs to install this fork?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github action uses ThreadPoolEngine because it is lightweight compared to running the default config that uses GlobusComputeEngine. The branch that I'm pointing to fixes a ThreadPoolEngine only bug. You can run these same tests against a local GlobusCompute endpoint, by setting GLOBUS_COMPUTE_ENDPOINT=<ep_id> and I've done those tests against a few configs. I believe Chris (NOAA) has also done a bunch of testing at this point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing the GlobusComputeExecutor via this branch in combination with globus-compute-endpoint==2.30.1. I am using that release, not a special branch/tag. The one feature I haven't yet tested is passing in the user_endpoint_config at call time rather than at GlobusComputeExecutor instantiation time. What I've done so far is working great!


self._executor.resource_specification = res_spec
self._executor.user_endpoint_config = user_endpoint_config
return self._executor.submit(func, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit horribly thread unsafe, were Parsl ever to get multithreaded submission - the sort of thing DFK.submitter_lock was put in place to deal with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially yes. I don't see how this is only a GCE specific problem. Secondly, I think it's safe to wait until users report this as an issue, or ask for it before working on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I don't think it is a GCE specific problem. But I think the GCE class needs to handle it all the same. What's wrong with wrapping this work in the lock?

Or is the point that this method is called from within the lock already?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is used from the DFK, this is already in a lock (although as you're both well aware, there is strong community pressure to use pieces of Parsl outside of the DFK). That lock was introduced in PR #625 because executors are not expected to be thread safe on submission, and so in that context this code is not dangerous.

This is a more general backpressure against using this style of API that seems to conflate the state of the submission system as a whole with parameters to a single task execution - I've definitely fixed concurrency bugs in Parsl because of this coding style before, that lead not to Parsl errors but to subtle misexecutions that mostly look plausible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see #3492, #1413)

yadudoc and others added 10 commits November 14, 2024 14:09
* Support for testing GlobusComputeExecutor in a github action
* Adding shared_fs and staging_required tags to tests
* Adding GlobusComputeExecutor test config
* GCE now makes a deepcopy of the resource_specification to avoid modifying the user supplied object.
* stack of docstring changes
Adding doc link
* Removed python version test matrix
* Removed obsolete install of libpython5
* Renamed `.pytest` to `pytest-parsl`
@yadudoc
Copy link
Member Author

yadudoc commented Nov 15, 2024

@benclifford @khk-globus Thanks for the review and all the comments on this issue. I believe I've addressed all of them and at this point, we are only waiting for a release of globus-compute-endpoint with fixes from ComputePR #1689 to update the CI testing target.

Copy link
Collaborator

@khk-globus khk-globus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reading through the PR, but these are my first thoughts. I'll chime in on Monday morning (-0400).

Comment on lines 25 to 26
echo "Python Version: 3.11" >> ci_job_info.txt
echo "CI Triggering Event: ${{ github.event_name }}" >> ci_job_info.txt
Copy link
Collaborator

@khk-globus khk-globus Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider converting 3.11 into a variable that references the actual python version, rather than "happens to match until the python-version is changed." In ci.yaml, for example, this refers to the matrix.python-version variable. (Even in if the matrix is only one, that reduces the points of authority.)

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved
parsl/executors/globus_compute.py Outdated Show resolved Hide resolved

self._executor.resource_specification = res_spec
self._executor.user_endpoint_config = user_endpoint_config
return self._executor.submit(func, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I don't think it is a GCE specific problem. But I think the GCE class needs to handle it all the same. What's wrong with wrapping this work in the lock?

Or is the point that this method is called from within the lock already?

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved
parsl/executors/globus_compute.py Show resolved Hide resolved
Comment on lines +80 to +81
export GLOBUS_COMPUTE_ENDPOINT=$(globus-compute-endpoint list | grep default | cut -c 3-38)
echo "GLOBUS_COMPUTE_ENDPOINT = $GLOBUS_COMPUTE_ENDPOINT"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sense in starting all the heavyweight machinery just to eventually crudely parse the string. Consider:

$ jq -r .endpoint_id < ~/.globus_compute/default/endpoint.json

Comment on lines +65 to +73
cat << EOF > /home/runner/.globus_compute/default/config.yaml
engine:
type: ThreadPoolEngine
max_workers: 4
EOF
cat /home/runner/.globus_compute/default/config.yaml
mkdir ~/.globus_compute/default/tasks_working_dir
globus-compute-endpoint start default
globus-compute-endpoint list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new endpoint every test invocation, right? To be kind to the GC infrastructure, we should cache this, or save and re-use the endpoint id.

Comment on lines +86 to +87
ln -s pytest-parsl/parsltest-current test_runinfo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not chuffed about it, given the context of CI, but I do note that the endpoint is never shutdown.

Comment on lines 92 to 93
name: runinfo-3.11-${{ steps.job-info.outputs.as-ascii }}-${{ github.sha }}
path: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same context as earlier: it would be good to not hard code the python version but to collect it from the environment.

Comment on lines +90 to +91
5. `parsl.executors.globus_compute.GlobusComputeExecutor`: This executor uses `Globus Compute <https://globus-compute.readthedocs.io/en/latest/index.html>`_
as the execution backend to run tasks on remote systems.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding more context here. I'm not calling for complete documentation, but this currently only answers the "what." How about a "why" or in what cases the GCE would be a good fit?

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved
Comment on lines +51 to +53
# Temporary fix until fixes make it to a release
git clone -b main https://github.com/globus/globus-compute.git
pip3 install globus-compute/compute_sdk globus-compute/compute_endpoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the git clone still necessary after the next GC release, come this Thursday?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a stopgap until the release.

* Added an env var to specify Python version
* Added a step to shutdown endpoint after tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants