Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

magnarem · 2024-02-06T12:20:25Z

The https://github.com/rycus86/prometheus_flask_exporter, supports exporting flask metrics out of the box.
This could be beneficial to monitor HTTP error codes for the DMCI endpoints, if insert and update calls fails etc, and then we can get alerts if DMCI have a lot of failed inserts for example.

magnarem · 2024-10-04T08:30:09Z

Also what my thought was for this issue, when we have a working prometheus exporter, is to see if we also can create some
custom metrics for the worker in dmci.

The worker returns a list of failed distributors, so we should have some metrics like:

# failed file_dist <counter>
# failed csw_dist <counter>
# failed solr_dist <counter>

From the failed array in https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/worker.py#L171.

Maybe the code have to be rewritten a bit, since we dont get this array back into the /update and /insert endpoints,

So the lines at https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L74
and https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L69

need to get the failed array back. so the code

msg, code = self._insert_update_method_post("insert", request) need to be changed to
msg, code, failed = self._insert_update_method_post("insert", request)
and if there are elements in the failed array, then it should be possible to add it as a custom metric using the prometheus_flask_exporter API.

charlienegri · 2024-10-04T10:16:31Z

do I read correctly that the only place in practice where this 'failed' array would be not null is in this if block? https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L159
because it would have to be returned by _distributor_wrapper at L157, something like
err, failed = self._distributor_wrapper(worker)
and then it's relevant only for that next if block

magnarem · 2024-10-04T12:35:52Z

Yes. You are right. you will have to return the failed array from the _distributor_wrapper and then if err and failed is not empty then add to failed metrics.

I think if failed is not empty, the elements there are the name of the distributors that failed.

charlienegri · 2024-10-04T13:13:51Z

ok this probably means creating custom metrics counters defined via prometheus_client and incrementing them in post_insert and post_update based on failed, but I have no idea of that is out of the box compatible with the GunicornPromethusMetrics wrapper or then there's gonna have to have some extra custom collection registry layer...

magnarem · 2024-10-04T13:40:44Z

I am not sure myself how this is working.. but It should be possible somehow. I bet other projects using flask app also want to expose custom prometheus metrics for their business-logic and not just reponse codes from api endpoints

charlienegri · 2024-10-04T13:49:27Z

anyway, for now still no default metrics exposure..
from the pod /proc/net/tcp I see nothing about port 9200
I think 'expose' in the dockerfile does nothing here, and we have to expose the port in the tjeneste repo and update here too https://gitlab.met.no/tjenester/s-enda/-/blob/dev/base/dmci/statefulset.yaml?ref_type=heads#L128

charlienegri · 2024-10-07T08:44:17Z

giving it a try https://gitlab.met.no/tjenester/s-enda/-/merge_requests/2928

magnarem · 2024-10-08T09:33:59Z

Seems like it works now:
https://dmci.s-enda-dev.k8s.met.no/metrics

But now I do not see those default prometheus_flask_exporters (https://github.com/rycus86/prometheus_flask_exporter?tab=readme-ov-file#default-metrics) for the endpoints anymore. Just the custom failed, and some python related metrics. Would be nice to have also, for checking request time and http codes

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 976.0
python_gc_objects_collected_total{generation="1"} 569.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 154.0
python_gc_collections_total{generation="1"} 13.0
python_gc_collections_total{generation="2"} 1.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.550450688e+09
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.6800768e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72837924519e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3.06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP failed_file_dist_total Number of failed file_dist
# TYPE failed_file_dist_total counter
# HELP failed_csw_dist_total Number of failed csw_dist
# TYPE failed_csw_dist_total counter
# HELP failed_solr_dist_total Number of failed solr_dist
# TYPE failed_solr_dist_total counter
# HELP flask_exporter_info Information about the Prometheus Flask exporter
# TYPE flask_exporter_info gauge
flask_exporter_info{version="0.23.1"} 1.0

charlienegri · 2024-10-08T10:40:10Z

with this latest PR I try again with GunicornPrometheusMetrics + CollectorRegistry call + having all the metrics at port 9200.. again it works for me when I run the container locally

charlienegri · 2024-10-09T06:51:22Z

maybe we can add to the mid-level instance the dashboard of the exporter customized with out extra metrics, I 'll look into it

charlienegri · 2024-10-10T13:39:18Z

it's a start https://prometheus-fou-b.met.no/d/_eX4mpl3/prometheus-flask-exporter-custom-dashboard-for-dmci?orgId=1&from=now-15h&to=now&refresh=30s

charlienegri self-assigned this Oct 1, 2024

This was referenced Oct 1, 2024

add prometheus_flask_exporter #233

Merged

dependency needs to be added to setup.cfg too #234

Merged

use GunicornPrometheusMetrics #235

Merged

charlienegri mentioned this issue Oct 4, 2024

expose port 9200 #238

Merged

10 tasks

This was referenced Oct 7, 2024

Custom metrics #241

Merged

remove collector registry #242

Merged

use GunicornInternalPrometheusMetrics #243

Merged

charlienegri mentioned this issue Oct 8, 2024

back to GunicornPrometheusMetrics, CollectorRegistry, port 9200 #244

Merged

10 tasks

This was referenced Oct 11, 2024

fix string name #245

Merged

do not flood the logs with None strings #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

magnarem commented Feb 6, 2024

magnarem commented Oct 4, 2024 •

edited

Loading

charlienegri commented Oct 4, 2024

magnarem commented Oct 4, 2024

charlienegri commented Oct 4, 2024

magnarem commented Oct 4, 2024

charlienegri commented Oct 4, 2024 •

edited

Loading

charlienegri commented Oct 7, 2024

magnarem commented Oct 8, 2024

charlienegri commented Oct 8, 2024

charlienegri commented Oct 9, 2024

charlienegri commented Oct 10, 2024

Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

Comments

magnarem commented Feb 6, 2024

magnarem commented Oct 4, 2024 • edited Loading

charlienegri commented Oct 4, 2024

magnarem commented Oct 4, 2024

charlienegri commented Oct 4, 2024

magnarem commented Oct 4, 2024

charlienegri commented Oct 4, 2024 • edited Loading

charlienegri commented Oct 7, 2024

magnarem commented Oct 8, 2024

charlienegri commented Oct 8, 2024

charlienegri commented Oct 9, 2024

charlienegri commented Oct 10, 2024

magnarem commented Oct 4, 2024 •

edited

Loading

charlienegri commented Oct 4, 2024 •

edited

Loading