Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 retry pulling image layer when they timeout #7051

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Jan 16, 2025

What do these changes do?

Sometimes when pulling images from a registry they timeout. Usually this is more evident when pulling very big images, especially if the registry is not located close by to your machine.
Most likely the Timeout error is a symptom of a lower networking error.

  • Added a retry in the image pulling.
  • extended progress bar to rese the progress in order to cope nicely with the progress bar display after an error and avoid warnings in the logs

Avoids services failing upon start when this error happens.

Related issue/s

How to test

  • point your registry in your deployment to to master-zmt and try to open a notebook
  • progress bar always goes up (never goes down)
  • there are no errors logs Progress already reached maximum in the dy-sidecar
  • if the error occurs you will find multiple attempt='%s' to pull image='%s' log entries

Dev-ops checklist

@GitHK GitHK self-assigned this Jan 16, 2025
@GitHK GitHK added a:services-library issues on packages/service-libs a:dynamic-sidecar dynamic-sidecar service t:maintenance Some planned maintenance work bug buggy, it does not work as expected labels Jan 16, 2025
@GitHK GitHK added this to the Singularity milestone Jan 16, 2025
@GitHK GitHK marked this pull request as ready for review January 16, 2025 15:46
@GitHK GitHK requested a review from pcrespov as a code owner January 16, 2025 15:46
packages/service-library/tests/test_progress_bar.py Outdated Show resolved Hide resolved
pull_progress
)
except ValidationError:
_logger.exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why you want to log this error?

THOUGHT: here the idea is that you will parse pull_progress if it is structured as defined in _DockerPullImage, otherwise just ignore it.

BUT, if you want to "update" the structure of _DockerPullImage , then it might be interesting to "be aware" that you are missing some logs.

Otherwise I would just ignore it

with suppress(ValidationError):
      parsed_progress = TypeAdapter(_DockerPullImage).validate_python(
                            pull_progress
                        )
      await _parse_pull_information(
                            parsed_progress, layer_id_to_size=layer_id_to_size
                        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea here is to detect in the logs, if we are missing logs without disrupting the pulling.
I would still like to know when we have errors instead of silencing them using suppress. These messages have changed in the past and I guess they might still change.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments, I would like you to reconsider the user friendliness of the progress bar changes. we can discuss

@@ -197,6 +197,9 @@ async def update(self, steps: float = 1) -> None:
await self._update_parent(parent_update_value)
await self._report_external(new_progress_value)

def reset_progress(self) -> None:
self._current_steps = _INITIAL_VALUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what you mean by the progress never goes "down"?
you silently reset the progress and the user does not know about it.

Let's discuss the following use-case:

  • a large image of 20-60GiB,
  • pulling it takes a very long time, and the progress is at 80%,
  • now comes an error and it will retry, if I get that right then you reset the progress bar which kind of make sense,
    --> but the progress stays at 80%??
    --> what does the user see until it eventually reaches 80% again? nothing?
    if that is the case I don't think that is a nice user experience at all:
    --> it makes user not believe our progress bars at all,
    --> it probably makes the user shutdown the website and call it a day, it's too much waiting without feedback.

As I said, I think what makes sense if I am a user is:

  • the progress bar goes back to 0
  • I get a message saying "sorry something bad happened while pulling your image, we try again, please be patient..."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one option when you reset is to change the message by adding something like:

new_description = current_progress_bar.description += f"(retrying {attemp})"
current_progress_bar.reset_progress(new_message=new_description)

and reset_progress should then call set_ so that the update is transmitted

packages/service-library/tests/test_progress_bar.py Outdated Show resolved Hide resolved
pull_progress
)
except ValidationError:
_logger.exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea here is to detect in the logs, if we are missing logs without disrupting the pulling.
I would still like to know when we have errors instead of silencing them using suppress. These messages have changed in the past and I guess they might still change.

Comment on lines +265 to +278
async for attempt in AsyncRetrying(
wait=wait_random_exponential(),
stop=stop_after_attempt(retry_upon_error_count),
reraise=True,
retry=retry_if_exception_type(asyncio.TimeoutError),
):
try:
parsed_progress = TypeAdapter(_DockerPullImage).validate_python(
pull_progress
)
except ValidationError:
_logger.exception(
"Unexpected error while validating '%s'. "
"TIP: This is probably an unforeseen pull status text that shall be added to the code. "
"The pulling process will still continue.",
f"{pull_progress=}",
)
else:
await _parse_pull_information(
parsed_progress, layer_id_to_size=layer_id_to_size
)

# compute total progress
total_downloaded_size = sum(
layer.downloaded for layer in layer_id_to_size.values()
)
total_extracted_size = sum(
layer.extracted for layer in layer_id_to_size.values()
)
total_progress = (total_downloaded_size + total_extracted_size) / 2.0
progress_to_report = total_progress - reported_progress
await progress_bar.update(progress_to_report)
reported_progress = total_progress

await log_cb(
f"pulling {image_short_name}: {pull_progress}...",
logging.DEBUG,
# each time there is an error progress starts from zero
progress_bar.reset_progress()
_logger.info(
"attempt='%s' to pull image='%s'",
attempt.retry_state.attempt_number,
image,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I thought we already discussed that AsyncRetrying like so creates bad logs where you do not see what function is retried, so I would recommend creating a separate function and using the retry decorator instead.
  2. you do not log as warning like we ususally do, which would allow to not have the log at line 273
  3. you should not reset the progress on the very first attempt as a consequence of the discussion in the progress bar module
  4. I personally think that 10 retries for pulling images is too much. 10 times pulling a large image is equivalent to infinity. I guess 3 is already more than enough, I guess people will get bored before it tried 3 times.

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.62%. Comparing base (b50e7e8) to head (d27914b).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7051      +/-   ##
==========================================
- Coverage   87.70%   87.62%   -0.09%     
==========================================
  Files        1635     1276     -359     
  Lines       63875    53946    -9929     
  Branches     1173      651     -522     
==========================================
- Hits        56022    47270    -8752     
+ Misses       7543     6496    -1047     
+ Partials      310      180     -130     
Flag Coverage Δ
integrationtests 64.66% <100.00%> (-0.08%) ⬇️
unittests 85.60% <89.65%> (-0.40%) ⬇️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 74.41% <91.66%> (+0.02%) ⬆️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 85.46% <100.00%> (-0.05%) ⬇️
agent 96.46% <ø> (ø)
api_server 90.55% <ø> (ø)
autoscaling 96.08% <ø> (ø)
catalog 90.33% <ø> (ø)
clusters_keeper 99.24% <ø> (ø)
dask_sidecar 91.25% <ø> (-0.02%) ⬇️
datcore_adapter 93.19% <ø> (ø)
director 76.92% <ø> (ø)
director_v2 91.27% <ø> (-0.03%) ⬇️
dynamic_scheduler 97.21% <ø> (ø)
dynamic_sidecar 89.77% <ø> (+0.02%) ⬆️
efs_guardian 90.25% <ø> (ø)
invitations 93.28% <ø> (ø)
osparc_gateway_server ∅ <ø> (∅)
payments 92.66% <ø> (ø)
resource_usage_tracker 88.95% <ø> (-0.11%) ⬇️
storage 89.57% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 86.23% <ø> (+0.02%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b50e7e8...d27914b. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:dynamic-sidecar dynamic-sidecar service a:services-library issues on packages/service-libs bug buggy, it does not work as expected t:maintenance Some planned maintenance work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants