Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor handling of orphaned resources in the coordinator #1580

Merged
merged 2 commits into from
Feb 4, 2025

Conversation

jluebbe
Copy link
Member

@jluebbe jluebbe commented Jan 24, 2025

Labgrid currently uses bidirectional streams between coordinator and
client/exporter. For the client side this is a good fit, since the
client sends requests and the coordinator can answer directly. However
for the exporter we have a case where nested calls were used in the old
crossbar infrastructure, namely when re-acquiring a resource after the
exporter was offline but the place was kept acquired. We call these
orphaned resources. They replace the real resource on the coordinator
side until the resource can be reacquired on the respective exporter
after it has restarted.

With crossbar, when seeing the resource update, the coordinator could
directly call the exporter to acquire the resource for the specific
place. This was possible since crossbar did the RPC route handling and
arbitrary services connected to the crossbar could provide RPC calls to
the service. With GRPC, we are more constrained. Since we only have a
single Input/Output stream which needs to multiplex different objects,
nested calls are not directly supported, since the exporter side would
still wait for the coordinator to answer its own request.

A different approach to orphaned resource handling is required. The
coordinator now uses a loop where it checks the orphaned resources and
tries to reacquire them if the exporter reappears. This however
introduces another problem, the exporter can be under high load and thus
the acquire request from the coordinator can time out. In this case, we
need to abort the acquisition during a regular lock and in case of an
orphaned resource need to replace the orphaned resource with the
eventually acquired resource from the exporter.

We also need to handle the case where the exporter has an acquired
resource, but the place has been released in the meantime (perhaps due
to a timeout on a normal place acquire), the same poll loop handles this
in the coordinator as well.

All in all this means that the resource acquired state for each place is
not necessarily consistent on the coordinator, but will reach an
eventual consistent state. This should be sufficient, since exporter
restarts with orphaned resources should be relatively rare.

Emantor
Emantor previously approved these changes Feb 4, 2025
Emantor and others added 2 commits February 4, 2025 09:03
Check if a place acquire request tries to lock a resource that is marked
as orphaned. This means we no longer need to reacquire orphaned resource
before trying to acquire a places resources since we now refuse to
acquire the resources.

This avoids long delays on aquire calls if the exporter responsible for
an (unrelated) orphaned resource doesn't process commands quickly.

Signed-off-by: Rouven Czerwinski <[email protected]>
Labgrid currently uses bidirectional streams between coordinator and
client/exporter. For the client side this is a good fit, since the
client sends requests and the coordinator can answer directly. However
for the exporter we have a case where nested calls were used in the old
crossbar infrastructure, namely when re-acquiring a resource after the
exporter was offline but the place was kept acquired. We call these
orphaned resources. They replace the real resource on the coordinator
side until the resource can be reacquired on the respective exporter
after it has restarted.

With crossbar, when seeing the resource update, the coordinator could
directly call the exporter to acquire the resource for the specific
place. This was possible since crossbar did the RPC route handling and
arbitrary services connected to the crossbar could provide RPC calls to
the service. With GRPC, we are more constrained. Since we only have a
single Input/Output stream which needs to multiplex different objects,
nested calls are not directly supported, since the exporter side would
still wait for the coordinator to answer its own request.

A different approach to orphaned resource handling is required. The
coordinator now uses a loop where it checks the orphaned resources and
tries to reacquire them if the exporter reappears. This however
introduces another problem, the exporter can be under high load and thus
the acquire request from the coordinator can time out. In this case, we
need to abort the acquisition during a regular lock and in case of an
orphaned resource need to replace the orphaned resource with the
eventually acquired resource from the exporter.

We also need to handle the case where the exporter has an acquired
resource, but the place has been released in the meantime (perhaps due
to a timeout on a normal place acquire), the same poll loop handles this
in the coordinator as well.

All in all this means that the resource acquired state for each place is
not necessarily consistent on the coordinator, but will reach an
eventual consistent state. This should be sufficient, since exporter
restarts with orphaned resources should be relatively rare.

Signed-off-by: Jan Luebbe <[email protected]>
@jluebbe
Copy link
Member Author

jluebbe commented Feb 4, 2025

I've fixed the error detected by pylint.

@jluebbe jluebbe requested a review from Emantor February 4, 2025 08:05
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes missing coverage. Please review.

Project coverage is 55.8%. Comparing base (c609d2d) to head (109ef0d).
Report is 5 commits behind head on master.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
labgrid/remote/coordinator.py 0.0% 90 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1580     +/-   ##
========================================
- Coverage    56.0%   55.8%   -0.3%     
========================================
  Files         170     170             
  Lines       13307   13365     +58     
========================================
  Hits         7458    7458             
- Misses       5849    5907     +58     
Flag Coverage Δ
3.10 55.7% <0.0%> (-0.3%) ⬇️
3.11 55.7% <0.0%> (-0.3%) ⬇️
3.12 55.7% <0.0%> (-0.3%) ⬇️
3.13 55.7% <0.0%> (-0.3%) ⬇️
3.9 55.8% <0.0%> (-0.3%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Emantor Emantor merged commit a8b6ab3 into labgrid-project:master Feb 4, 2025
9 of 11 checks passed
@jluebbe jluebbe deleted the refactor-orphaned-resources branch February 4, 2025 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants