Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat endpoint for background services monitoring (Fixes #20602) #20603

Merged
merged 11 commits into from
May 3, 2023

Conversation

leplatrem
Copy link
Contributor

@leplatrem leplatrem commented Apr 20, 2023

Fixes mozilla/addons#9152

This pull-requests adds a simple monitor check that will read the heartbeat of the configured Remote Settings server

  • This PR relates to an existing open issue and there are no existing
    PRs open for the same issue.
  • Add Fixes #ISSUENUM at the top of your PR.
  • Add a description of the changes introduced in this PR.
  • The change has been successfully run locally.
  • Add tests to cover the changes added in this PR.
  • Add before and after screenshots (Only for changes that impact the UI).

@leplatrem leplatrem force-pushed the add-remote-settings-heartbeat branch 5 times, most recently from b8c7145 to 8bb9272 Compare April 20, 2023 23:00
@leplatrem leplatrem force-pushed the add-remote-settings-heartbeat branch from f41fa8b to a218561 Compare April 24, 2023 15:08
@diox diox requested review from a team and eviljeff and removed request for a team April 24, 2023 16:25
@bqbn
Copy link
Contributor

bqbn commented Apr 24, 2023

I suspect this will not work because among all AMO components only the celery workers have access to the remote-settings writer, but the monitor endpoint runs on the web component.

@eviljeff
Copy link
Member

only the celery workers have access to the remote-settings writer, but the monitor endpoint runs on the web component.

I can't recall, is that by design?

@leplatrem
Copy link
Contributor Author

The heartbeat already checks connectivity with Autograph, and I followed what was there, thinking Remote Settings would be in a similar position.

That being said, I can move the implementation of the connectivity check to the worker, and execute it "synchronously" from the monitor view.
I will update the code, and see how it looks like.

@leplatrem
Copy link
Contributor Author

Where do you think the tests should live now that the test is done in a task? Inolympia.blocklists or leave them with test_monitor

@diox
Copy link
Member

diox commented Apr 25, 2023

@bqbn how often do we call /services/monitor.json and what's the expected time for a response ?

Moving the check in a task is an interesting idea but I'm worried that sometimes the task might take too long to be processed...

@leplatrem
Copy link
Contributor Author

the task might take too long to be processed...

@diox what would be your pessimistic estimate here? More than 30 seconds?

From what @ahoneiser could see in Pingdom, heartbeat endpoints are pulled every minute.

@diox
Copy link
Member

diox commented Apr 25, 2023

Long comment about legacy decisions:

The main issue is that the priority queue is also used for indexing tasks. And every day, we update every add-on with fresh users & growth data, and index that as well, so that popularity ordering, whether an add-on is trending or not and users numbers in search are accurate. That is done relatively slowly over the course of ~ 15 minutes, which means that during that window, tasks in that queue can take up to 15 minutes. We should probably explicitly route such indexing tasks differently, but this a story for another day...

The best option in this case is going to be the amo queue. It's kind of a catch all, with a bunch of different tasks, so there is some risk that it may take longer, but most of the time it should be processed fairly quickly (< 30 seconds).

Since the goal here is not to test how fast tasks queues are being processed, maybe we should add code to ensure we don't wait too long, and add a soft fail mode that doesn't page anyone if it times out?

TL;DR: use amo queue instead of priority, consider adding a short timeout and not fail loudly when it's reached.

@bqbn
Copy link
Contributor

bqbn commented Apr 25, 2023

@bqbn how often do we call /services/monitor.json and what's the expected time for a response ?

We don't monitor this endpoint actually. The backend services this endpoint reports, such as memcache, database, elasticsearch and etc. are monitored separately.

@bqbn
Copy link
Contributor

bqbn commented Apr 25, 2023

only the celery workers have access to the remote-settings writer, but the monitor endpoint runs on the web component.

I can't recall, is that by design?

What is by design? Only celery workers can access the remote-settings writer, or the monitor endpoint runs on the web component?

@diox
Copy link
Member

diox commented Apr 25, 2023

@bqbn how often do we call /services/monitor.json and what's the expected time for a response ?

We don't monitor this endpoint actually. The backend services this endpoint reports, such as memcache, database, elasticsearch and etc. are monitored separately.

Sorry, it's actually an alias for https://addons.mozilla.org/__heartbeat__, which I assume is monitored ? So same question for /__heartbeat__

@eviljeff
Copy link
Member

only the celery workers have access to the remote-settings writer, but the monitor endpoint runs on the web component.

I can't recall, is that by design?

What is by design? Only celery workers can access the remote-settings writer, or the monitor endpoint runs on the web component?

That only celery workers can access the remote-settings writer.

@bqbn
Copy link
Contributor

bqbn commented Apr 25, 2023

@bqbn how often do we call /services/monitor.json and what's the expected time for a response ?

We don't monitor this endpoint actually. The backend services this endpoint reports, such as memcache, database, elasticsearch and etc. are monitored separately.

Sorry, it's actually an alias for https://addons.mozilla.org/__heartbeat__, which I assume is monitored ? So same question for /__heartbeat__

We monitor /en-US/android/ and /en-US/firefox/ for the site availability. /__heartbeat__ is not monitored.

@bqbn
Copy link
Contributor

bqbn commented Apr 25, 2023

only the celery workers have access to the remote-settings writer, but the monitor endpoint runs on the web component.

I can't recall, is that by design?

What is by design? Only celery workers can access the remote-settings writer, or the monitor endpoint runs on the web component?

That only celery workers can access the remote-settings writer.

It's more like to follow the least privilege access principle because the celery workers are the only component that needs to access the remote-settings writer.

@diox
Copy link
Member

diox commented Apr 26, 2023

After more discussion with @bqbn : we want to keep the __heartbeat__ endpoint for things that are essential to the user-facing part of AMO, so we can use it as a kubernetes readiness probe when we switch to GCP.

So autograph, rabbitmq and, more relevant to this PR, remote settings monitoring should move to a separate endpoint (and can use a celery task on the amo queue to unblock this, we may revisit that later). Then, we'll add monitoring of that endpoint separately.

@leplatrem
Copy link
Contributor Author

/__heartbeat__ is not monitored.

Oh, but that's our Dockerflow standard 😕

We rely on those standard endpoints for alerting and troubleshooting in other systems like https://delivery-checks.prod.mozaws.net

After more discussion with @bqbn : we want to keep the heartbeat endpoint for things that are essential to the user-facing part of AMO

Ok. Makes sense!
Shall I open a dedicated issue and do it? ie. move celery, autograph, and rabbit to a new endpoint like /services/__heartbeat__?

so we can use it as a kubernetes readiness probe when we switch to GCP.

Isn't it the role of __lbheartbeat__?
https://github.com/mozilla-services/Dockerflow#containerized-app-requirements

@diox
Copy link
Member

diox commented Apr 26, 2023

/heartbeat is not monitored.

Oh, but that's our Dockerflow standard confused

We rely on those standard endpoints for alerting and troubleshooting in other systems like
https://delivery-checks.prod.mozaws.net/

Agreed, and I think it would be useful to monitor that endpoint on AMO going forward.

After more discussion with @bqbn : we want to keep the heartbeat endpoint for things that are essential to the user-facing part of AMO

Ok. Makes sense!
Shall I open a dedicated issue and do it? ie. move celery, autograph, and rabbit to a new endpoint like /services/__heartbeat__?

If you want to take care of it, then I think it's fine to just repurpose the issue you opened about remote settings, editing its title/description. Then moving your new probe to that new endpoint.

so we can use it as a kubernetes readiness probe when we switch to GCP.

Isn't it the role of __lbheartbeat__?
https://github.com/mozilla-services/Dockerflow#containerized-app-requirements

__lbheartbeat__ (which we also implement in addons-server, in a middleware, bypassing url resolution) does not care about services we might depend on, so I don't think it makes a good kubernetes readiness probe: if we can't establish connection to database or elasticsearch for instance, we don't want to return a ready state. We might tweak that in the future though.

@diox diox requested review from diox and removed request for eviljeff April 26, 2023 11:53
@leplatrem leplatrem changed the title Add Remote Settings to heartbeat monitoring (Fixes #20602) Add heartbeat endpoint for background services monitoring (Fixes #20602) Apr 26, 2023
@leplatrem leplatrem marked this pull request as draft April 26, 2023 13:51
@leplatrem leplatrem marked this pull request as ready for review April 27, 2023 15:09
@leplatrem leplatrem requested a review from ahoneiser April 27, 2023 16:12
@leplatrem
Copy link
Contributor Author

I wondered:

  • whether monitor.json should be a permanent redirect instead of an alias.
  • whether there should be a /monitor.json, like there is a /services/monitor.json
  • why monitor.json is not listed in settings.SUPPORTED_NONAPPS
  • whether the remote settings check should catch all exceptions, or just timeout (knowing that blocklist.tasks.remotesettings_check already does catch all exceptions)

@diox
Copy link
Member

diox commented Apr 27, 2023

I wondered:

  • whether monitor.json should be a permanent redirect instead of an alias.
  • whether there should be a /monitor.json, like there is a /services/monitor.json
  • why monitor.json is not listed in settings.SUPPORTED_NONAPPS

We should probably just get rid of it entirely, so don't worry about it.

  • whether the remote settings check should catch all exceptions, or just timeout (knowing that blocklist.tasks.remotesettings_check already does catch all exceptions)

It should catch everything - ideally we don't want the endpoint to ever raise a 500 for anything.

@diox
Copy link
Member

diox commented May 2, 2023

No idea why yet, but when trying this locally, I get:

Traceback (most recent call last):
  File "/deps/lib/python3.10/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/deps/lib/python3.10/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/deps/lib/python3.10/site-packages/sentry_sdk/integrations/django/views.py", line 85, in sentry_wrapped_callback
    return callback(request, *args, **kwargs)
  File "/deps/lib/python3.10/site-packages/django/views/decorators/cache.py", line 44, in _wrapped_view_func
    response = view_func(request, *args, **kwargs)
  File "/data/olympia/src/olympia/amo/views.py", line 55, in services_heartbeat
    status_summary = monitors.execute_checks(
  File "/data/olympia/src/olympia/amo/monitors.py", line 28, in execute_checks
    status, _ = globals()[check]()
  File "/data/olympia/src/olympia/amo/monitors.py", line 224, in remotesettings
    status = result.get(timeout=settings.REMOTE_SETTINGS_CHECK_TIMEOUT_SECONDS)

Exception Type: AttributeError at /services/monitor.json
Exception Value: 'NoneType' object has no attribute 'get'

I think we at least need the following patch (CELERY_TASK_IGNORE_RESULT defaults to True in our settings):

diff --git a/src/olympia/blocklist/tasks.py b/src/olympia/blocklist/tasks.py
index 42d3e6e70c..cc4ba5c5bd 100644
--- a/src/olympia/blocklist/tasks.py
+++ b/src/olympia/blocklist/tasks.py
@@ -64,8 +64,8 @@ def process_blocklistsubmission(multi_block_submit_id, **kw):
         raise exc
 
 
-@task
-def monitor_remote_settings():
+@task(ignore_result=False)
+def monitor_remote_settings(*args, **kwargs):
     # check Remote Settings connection
     client = RemoteSettings(
         settings.REMOTE_SETTINGS_WRITER_BUCKET,

But that wasn't enough to fix it for me...

@diox
Copy link
Member

diox commented May 3, 2023

No idea why yet, but when trying this locally, I get:

This is our fault, a problem with the way our https://github.com/mozilla/django-post-request-task/ behaves for tasks queued from a view decorated with @non_atomic_requests. Doesn't happen in tests because for legacy reasons we execute tasks in eager mode in tests. I filed mozilla/django-post-request-task#32 but let's work around it in the meantime.

@diox
Copy link
Member

diox commented May 3, 2023

Hopefully fixed in leplatrem@a0a6da8

@leplatrem
Copy link
Contributor Author

leplatrem commented May 3, 2023

Hopefully fixed in leplatrem@a0a6da8

Great! 🎉

I think we at least need the following patch (CELERY_TASK_IGNORE_RESULT defaults to True in our settings):

I haven't fully understood the purpose of https://docs.celeryq.dev/en/stable/userguide/configuration.html#std-setting-task_ignore_result
If it's not obvious for other developers, maybe a comment justifying why we override the settings for the Remote Settings task could be relevant.

@diox
Copy link
Member

diox commented May 3, 2023

Agreed, added in leplatrem@3d97f16

@leplatrem
Copy link
Contributor Author

:shipit:

@diox diox merged commit cb203f4 into mozilla:master May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a heartbeat endpoint for background services
5 participants