refactor handling of orphaned resources in the coordinator #1580

jluebbe · 2025-01-24T13:27:59Z

Labgrid currently uses bidirectional streams between coordinator and
client/exporter. For the client side this is a good fit, since the
client sends requests and the coordinator can answer directly. However
for the exporter we have a case where nested calls were used in the old
crossbar infrastructure, namely when re-acquiring a resource after the
exporter was offline but the place was kept acquired. We call these
orphaned resources. They replace the real resource on the coordinator
side until the resource can be reacquired on the respective exporter
after it has restarted.

With crossbar, when seeing the resource update, the coordinator could
directly call the exporter to acquire the resource for the specific
place. This was possible since crossbar did the RPC route handling and
arbitrary services connected to the crossbar could provide RPC calls to
the service. With GRPC, we are more constrained. Since we only have a
single Input/Output stream which needs to multiplex different objects,
nested calls are not directly supported, since the exporter side would
still wait for the coordinator to answer its own request.

A different approach to orphaned resource handling is required. The
coordinator now uses a loop where it checks the orphaned resources and
tries to reacquire them if the exporter reappears. This however
introduces another problem, the exporter can be under high load and thus
the acquire request from the coordinator can time out. In this case, we
need to abort the acquisition during a regular lock and in case of an
orphaned resource need to replace the orphaned resource with the
eventually acquired resource from the exporter.

We also need to handle the case where the exporter has an acquired
resource, but the place has been released in the meantime (perhaps due
to a timeout on a normal place acquire), the same poll loop handles this
in the coordinator as well.

All in all this means that the resource acquired state for each place is
not necessarily consistent on the coordinator, but will reach an
eventual consistent state. This should be sufficient, since exporter
restarts with orphaned resources should be relatively rare.

Check if a place acquire request tries to lock a resource that is marked as orphaned. This means we no longer need to reacquire orphaned resource before trying to acquire a places resources since we now refuse to acquire the resources. This avoids long delays on aquire calls if the exporter responsible for an (unrelated) orphaned resource doesn't process commands quickly. Signed-off-by: Rouven Czerwinski <[email protected]>

Labgrid currently uses bidirectional streams between coordinator and client/exporter. For the client side this is a good fit, since the client sends requests and the coordinator can answer directly. However for the exporter we have a case where nested calls were used in the old crossbar infrastructure, namely when re-acquiring a resource after the exporter was offline but the place was kept acquired. We call these orphaned resources. They replace the real resource on the coordinator side until the resource can be reacquired on the respective exporter after it has restarted. With crossbar, when seeing the resource update, the coordinator could directly call the exporter to acquire the resource for the specific place. This was possible since crossbar did the RPC route handling and arbitrary services connected to the crossbar could provide RPC calls to the service. With GRPC, we are more constrained. Since we only have a single Input/Output stream which needs to multiplex different objects, nested calls are not directly supported, since the exporter side would still wait for the coordinator to answer its own request. A different approach to orphaned resource handling is required. The coordinator now uses a loop where it checks the orphaned resources and tries to reacquire them if the exporter reappears. This however introduces another problem, the exporter can be under high load and thus the acquire request from the coordinator can time out. In this case, we need to abort the acquisition during a regular lock and in case of an orphaned resource need to replace the orphaned resource with the eventually acquired resource from the exporter. We also need to handle the case where the exporter has an acquired resource, but the place has been released in the meantime (perhaps due to a timeout on a normal place acquire), the same poll loop handles this in the coordinator as well. All in all this means that the resource acquired state for each place is not necessarily consistent on the coordinator, but will reach an eventual consistent state. This should be sufficient, since exporter restarts with orphaned resources should be relatively rare. Signed-off-by: Jan Luebbe <[email protected]>

jluebbe · 2025-02-04T08:05:20Z

I've fixed the error detected by pylint.

codecov · 2025-02-04T08:10:40Z

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes missing coverage. Please review.

Project coverage is 55.8%. Comparing base (c609d2d) to head (109ef0d).
Report is 5 commits behind head on master.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
labgrid/remote/coordinator.py	0.0%	90 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1580     +/-   ##
========================================
- Coverage    56.0%   55.8%   -0.3%     
========================================
  Files         170     170             
  Lines       13307   13365     +58     
========================================
  Hits         7458    7458             
- Misses       5849    5907     +58

Flag	Coverage Δ
3.10	`55.7% <0.0%> (-0.3%)`	⬇️
3.11	`55.7% <0.0%> (-0.3%)`	⬇️
3.12	`55.7% <0.0%> (-0.3%)`	⬇️
3.13	`55.7% <0.0%> (-0.3%)`	⬇️
3.9	`55.8% <0.0%> (-0.3%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jluebbe added the enhancement label Jan 24, 2025

jluebbe requested review from Emantor and Bastian-Krause January 24, 2025 13:27

Emantor previously approved these changes Feb 4, 2025

View reviewed changes

Emantor and others added 2 commits February 4, 2025 09:03

jluebbe dismissed Emantor’s stale review via 109ef0d February 4, 2025 08:04

jluebbe force-pushed the refactor-orphaned-resources branch from 8ca0da8 to 109ef0d Compare February 4, 2025 08:04

jluebbe requested a review from Emantor February 4, 2025 08:05

Emantor approved these changes Feb 4, 2025

View reviewed changes

Emantor merged commit a8b6ab3 into labgrid-project:master Feb 4, 2025
9 of 11 checks passed

jluebbe deleted the refactor-orphaned-resources branch February 4, 2025 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor handling of orphaned resources in the coordinator #1580

refactor handling of orphaned resources in the coordinator #1580

jluebbe commented Jan 24, 2025

jluebbe commented Feb 4, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

refactor handling of orphaned resources in the coordinator #1580

refactor handling of orphaned resources in the coordinator #1580

Conversation

jluebbe commented Jan 24, 2025

jluebbe commented Feb 4, 2025

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

codecov bot commented Feb 4, 2025 •

edited

Loading