Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding prometheus metrics for DMCI /insert, /update and maybe /validate endpoints #220

Open
magnarem opened this issue Feb 6, 2024 · 11 comments
Assignees

Comments

@magnarem
Copy link
Contributor

magnarem commented Feb 6, 2024

The https://github.com/rycus86/prometheus_flask_exporter, supports exporting flask metrics out of the box.
This could be beneficial to monitor HTTP error codes for the DMCI endpoints, if insert and update calls fails etc, and then we can get alerts if DMCI have a lot of failed inserts for example.

@magnarem
Copy link
Contributor Author

magnarem commented Oct 4, 2024

Also what my thought was for this issue, when we have a working prometheus exporter, is to see if we also can create some
custom metrics for the worker in dmci.

The worker returns a list of failed distributors, so we should have some metrics like:

# failed file_dist <counter>
# failed csw_dist <counter>
# failed solr_dist <counter>

From the failed array in https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/worker.py#L171.

Maybe the code have to be rewritten a bit, since we dont get this array back into the /update and /insert endpoints,

So the lines at https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L74
and https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L69

need to get the failed array back. so the code

msg, code = self._insert_update_method_post("insert", request) need to be changed to
msg, code, failed = self._insert_update_method_post("insert", request)
and if there are elements in the failed array, then it should be possible to add it as a custom metric using the prometheus_flask_exporter API.

@charlienegri
Copy link
Contributor

do I read correctly that the only place in practice where this 'failed' array would be not null is in this if block? https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L159
because it would have to be returned by _distributor_wrapper at L157, something like
err, failed = self._distributor_wrapper(worker)
and then it's relevant only for that next if block

@charlienegri charlienegri mentioned this issue Oct 4, 2024
10 tasks
@magnarem
Copy link
Contributor Author

magnarem commented Oct 4, 2024

Yes. You are right. you will have to return the failed array from the _distributor_wrapper and then if err and failed is not empty then add to failed metrics.

I think if failed is not empty, the elements there are the name of the distributors that failed.

@charlienegri
Copy link
Contributor

ok this probably means creating custom metrics counters defined via prometheus_client and incrementing them in post_insert and post_update based on failed, but I have no idea of that is out of the box compatible with the GunicornPromethusMetrics wrapper or then there's gonna have to have some extra custom collection registry layer...

@magnarem
Copy link
Contributor Author

magnarem commented Oct 4, 2024

I am not sure myself how this is working.. but It should be possible somehow. I bet other projects using flask app also want to expose custom prometheus metrics for their business-logic and not just reponse codes from api endpoints

@charlienegri
Copy link
Contributor

charlienegri commented Oct 4, 2024

anyway, for now still no default metrics exposure..
from the pod /proc/net/tcp I see nothing about port 9200
I think 'expose' in the dockerfile does nothing here, and we have to expose the port in the tjeneste repo and update here too https://gitlab.met.no/tjenester/s-enda/-/blob/dev/base/dmci/statefulset.yaml?ref_type=heads#L128

@charlienegri
Copy link
Contributor

@magnarem
Copy link
Contributor Author

magnarem commented Oct 8, 2024

Seems like it works now:
https://dmci.s-enda-dev.k8s.met.no/metrics

But now I do not see those default prometheus_flask_exporters (https://github.com/rycus86/prometheus_flask_exporter?tab=readme-ov-file#default-metrics) for the endpoints anymore. Just the custom failed, and some python related metrics. Would be nice to have also, for checking request time and http codes

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 976.0
python_gc_objects_collected_total{generation="1"} 569.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 154.0
python_gc_collections_total{generation="1"} 13.0
python_gc_collections_total{generation="2"} 1.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.550450688e+09
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.6800768e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72837924519e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3.06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP failed_file_dist_total Number of failed file_dist
# TYPE failed_file_dist_total counter
# HELP failed_csw_dist_total Number of failed csw_dist
# TYPE failed_csw_dist_total counter
# HELP failed_solr_dist_total Number of failed solr_dist
# TYPE failed_solr_dist_total counter
# HELP flask_exporter_info Information about the Prometheus Flask exporter
# TYPE flask_exporter_info gauge
flask_exporter_info{version="0.23.1"} 1.0

@charlienegri
Copy link
Contributor

with this latest PR I try again with GunicornPrometheusMetrics + CollectorRegistry call + having all the metrics at port 9200.. again it works for me when I run the container locally

@charlienegri
Copy link
Contributor

maybe we can add to the mid-level instance the dashboard of the exporter customized with out extra metrics, I 'll look into it

This was referenced Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants