Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resilience4j): add new integration #2581

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,10 @@ coverage:
target: 75
flags:
- redpanda
Resilience4j:
target: 75
flags:
- resilience4j
Riak_MDC_Replication:
target: 75
flags:
Expand Down Expand Up @@ -577,6 +581,11 @@ flags:
paths:
- redpanda/datadog_checks/redpanda
- redpanda/tests
resilience4j:
carryforward: true
paths:
- resilience4j/datadog_checks/resilience4j
- resilience4j/tests
riak_repl:
carryforward: true
paths:
Expand Down
7 changes: 4 additions & 3 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
/redis_enterprise/ @redis-field-engineering [email protected]
/redis_sentinel/ @DataDog/agent-integrations
/redpanda/ @redpanda-data [email protected]
/resilience4j/ @willianccs [email protected]
/resin/ @brentm5
/retool/ @jamiecuffe @DataDog/ecosystems-review
/riak_repl/ @abtreece
Expand Down Expand Up @@ -318,7 +319,7 @@
/aqua/*metadata.csv @DataDog/container-integrations @DataDog/documentation
/aqua/manifest.json @DataDog/container-integrations @DataDog/documentation
/aqua/README.md @DataDog/container-integrations @DataDog/documentation
/aqua/assets/dashboards @DataDog/container-integrations @DataDog/documentation @DataDog/reporting-and-sharing
/aqua/assets/dashboards @DataDog/container-integrations @DataDog/documentation @DataDog/reporting-and-sharing
/aqua/assets/monitors @DataDog/container-integrations @DataDog/documentation @DataDog/alerting-product

/auth0/*metadata.csv @DataDog/agent-integrations @DataDog/documentation
Expand Down Expand Up @@ -1098,8 +1099,8 @@
/sosivio/assets/dashboards @danarlowski @DataDog/documentation @DataDog/reporting-and-sharing @DataDog/agent-integrations
/sosivio/assets/monitors @danarlowski @DataDog/documentation @DataDog/alerting-product @DataDog/agent-integrations

/emnify/*metadata.csv @EMnify/development @EMnify/rademade @DataDog/documentation
/emnify/manifest.json @EMnify/development @EMnify/rademade @DataDog/documentation
/emnify/*metadata.csv @EMnify/development @EMnify/rademade @DataDog/documentation
/emnify/manifest.json @EMnify/development @EMnify/rademade @DataDog/documentation
/emnify/README.md @EMnify/development @EMnify/rademade @DataDog/documentation
/emnify/assets/dashboards @EMnify/development @EMnify/rademade @DataDog/documentation @DataDog/reporting-and-sharing
/emnify/assets/monitors @EMnify/development @EMnify/rademade @DataDog/documentation @DataDog/alerting-product
Expand Down
19 changes: 19 additions & 0 deletions .github/workflows/test-all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1056,6 +1056,25 @@ jobs:
test-py3: ${{ inputs.test-py3 }}
setup-env-vars: "${{ inputs.setup-env-vars }}"
secrets: inherit
j09df637:
uses: DataDog/integrations-core/.github/workflows/test-target.yml@master
with:
job-name: Resilience4j
target: resilience4j
platform: linux
runner: '["ubuntu-22.04"]'
repo: "${{ inputs.repo }}"
python-version: "${{ inputs.python-version }}"
standard: ${{ inputs.standard }}
latest: ${{ inputs.latest }}
agent-image: "${{ inputs.agent-image }}"
agent-image-py2: "${{ inputs.agent-image-py2 }}"
agent-image-windows: "${{ inputs.agent-image-windows }}"
agent-image-windows-py2: "${{ inputs.agent-image-windows-py2 }}"
test-py2: ${{ inputs.test-py2 }}
test-py3: ${{ inputs.test-py3 }}
setup-env-vars: "${{ inputs.setup-env-vars }}"
secrets: inherit
jc5ec7c0:
uses: DataDog/integrations-core/.github/workflows/test-target.yml@master
with:
Expand Down
8 changes: 8 additions & 0 deletions resilience4j/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# CHANGELOG - Resilience4j

## 1.0.0 / 2025-01-24

***Added***:

* Initial Release

60 changes: 60 additions & 0 deletions resilience4j/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Agent Check: Resilience4j

## Overview

[Resilience4j](https://github.com/resilience4j/resilience4j) is a lightweight fault tolerance library inspired by Netflix Hystrix, but designed for functional programming. This check monitors [Resilience4j][1] through the Datadog Agent.

## Setup

### Installation

To install the Resilience4j check on your host:

1. Install the [developer toolkit]
(<https://docs.datadoghq.com/developers/integrations/python/>)
on any machine.

2. Run `ddev release build resilience4j` to build the package.

3. [Download the Datadog Agent][2].

4. Upload the build artifact to any host with an Agent and
run `datadog-agent integration install -w
path/to/resilience4j/dist/<ARTIFACT_NAME>.whl`.

### Configuration

1. Edit the `resilience4j/conf.yaml` file in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Resilience4j performance data. See the [sample resilience4j/conf.yaml][4] for all available configuration options.

2. [Restart the Agent][5].

### Validation

[Run the Agent's status subcommand][6] and look for `resilience4j` under the Checks section.

## Data Collected

### Metrics

See [metadata.csv][7] for a list of metrics provided by this integration.

### Service Checks

See [service_checks.json][8] for a list of service checks provided by this integration.

### Events

Resilience4j does not include any events.

## Troubleshooting

Need help? Contact [Datadog support][3].

[1]: https://resilience4j.readme.io/docs/micrometer#prometheus
[2]: https://app.datadoghq.com/account/settings/agent/latest
[3]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[4]: https://github.com/DataDog/integrations-extras/blob/master/resilience4j/datadog_checks/resilience4j/data/conf.yaml.example
[5]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent
[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[7]: https://github.com/DataDog/integrations-extras/blob/master/resilience4j/metadata.csv
[8]: https://github.com/DataDog/integrations-extras/blob/master/resilience4j/assets/service_checks.json
10 changes: 10 additions & 0 deletions resilience4j/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Resilience4j
files:
- name: resilience4j.yaml
options:
- template: init_config
options:
- template: init_config/default
- template: instances
options:
- template: instances/default
willianccs marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"title":"Resilience4j Circuit Breaker Overview","description":"This dashboard is designed to monitor the health and performance of Resilience4j Circuit Breakers and providing insights into the behavior of the circuit breakers in a Spring Boot application. It uses Spring Boot Actuator to expose metrics and Prometheus to collect and store these metrics. \n\n- [Resilience4j Metrics](https://resilience4j.readme.io/docs/micrometer#prometheus)\nClone this template dashboard to make changes and add your own graph widgets.","widgets":[{"id":3501442757873750,"definition":{"title":"Summary","background_color":"gray","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":3494549748379606,"definition":{"title":"Number of closed CircuitBreaker","title_size":"16","title_align":"left","type":"query_value","requests":[{"response_format":"scalar","queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.state{state:closed}","aggregator":"last"}],"conditional_formats":[{"comparator":">","value":0,"palette":"white_on_green"}],"formulas":[{"formula":"query1"}]}],"autoscale":true,"precision":0,"timeseries_background":{"yaxis":{"include_zero":true},"type":"area"}},"layout":{"x":0,"y":0,"width":4,"height":2}},{"id":5787006276604120,"definition":{"title":"Number of open CircuitBreaker","title_size":"16","title_align":"left","type":"query_value","requests":[{"response_format":"scalar","queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.state{state:open}","aggregator":"last"}],"conditional_formats":[{"comparator":"=","value":0,"palette":"white_on_green"},{"comparator":">=","value":1,"palette":"white_on_yellow"}],"formulas":[{"formula":"query1"}]}],"autoscale":true,"precision":0,"timeseries_background":{"yaxis":{"include_zero":true},"type":"area"}},"layout":{"x":4,"y":0,"width":4,"height":2}},{"id":5152242796496788,"definition":{"title":"Number of half_open CircuitBreaker","title_size":"16","title_align":"left","type":"query_value","requests":[{"response_format":"scalar","queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.state{state:half_open}","aggregator":"last"}],"conditional_formats":[{"comparator":"=","value":0,"palette":"white_on_green"},{"comparator":">=","value":1,"palette":"white_on_yellow"}],"formulas":[{"formula":"query1"}]}],"autoscale":true,"precision":0,"timeseries_background":{"yaxis":{"include_zero":true},"type":"area"}},"layout":{"x":8,"y":0,"width":4,"height":2}},{"id":5317812649799226,"definition":{"title":"CircuitBreaker states","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.state{$service,$state} by {state,service}"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":0,"y":2,"width":12,"height":4}}]},"layout":{"x":0,"y":0,"width":12,"height":7}},{"id":911329350165930,"definition":{"title":"Circuit Breaker","background_color":"yellow","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":1469478086748770,"definition":{"title":"Failure Rate: $circuit_breaker_name","title_size":"16","title_align":"left","type":"query_value","requests":[{"response_format":"scalar","queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.failure.rate{$circuit_breaker_name}","aggregator":"last"}],"conditional_formats":[{"comparator":">","value":50,"palette":"white_on_red"},{"comparator":">=","value":40,"palette":"white_on_yellow"},{"comparator":"<","value":40,"palette":"white_on_green"}],"formulas":[{"formula":"default_zero(cutoff_min(query1, 0))"}]}],"autoscale":true,"precision":0,"timeseries_background":{"type":"area"}},"layout":{"x":0,"y":0,"width":6,"height":3}},{"id":5088831156945942,"definition":{"title":"Call rate: $circuit_breaker_name","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"},{"formula":"query2"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:resilience4j.circuitbreaker.calls.seconds.count{$service, $state}.as_rate()"},{"data_source":"metrics","name":"query2","query":"sum:resilience4j.circuitbreaker.calls.seconds.sum{$service, $state} by {service}.as_rate()"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":6,"y":0,"width":6,"height":3}},{"id":8277186327531272,"definition":{"title":"Buffered calls: $circuit_breaker_name","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.buffered.calls{$service, $circuit_breaker_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":0,"y":3,"width":6,"height":3}},{"id":8901593457473692,"definition":{"title":"Average call durations","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"per_minute(query1)"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:resilience4j.circuitbreaker.calls.seconds.sum{$service}.as_rate()"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":6,"y":3,"width":6,"height":3}}]},"layout":{"x":0,"y":0,"width":12,"height":7,"is_column_break":true}}],"template_variables":[{"name":"service","prefix":"service","available_values":[],"default":"*"},{"name":"state","prefix":"state","available_values":["closed","disabled","forced_open","half_open","metrics_only","open"],"default":"*"},{"name":"circuit_breaker_name","prefix":"circuit_breaker_name","available_values":[],"default":"*"}],"layout_type":"ordered","notify_list":[],"reflow_type":"fixed"}
38 changes: 38 additions & 0 deletions resilience4j/assets/monitors/circuitbreaker_state_open.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"version": 2,
"created_at": "2025-01-24",
"last_updated_at": "2025-01-24",
"title": "Circuit Breaker State Alert",
"tags": [
"integration:resilience4j"
],
"description": "This monitor alerts when the Circuit Breaker state change.",
willianccs marked this conversation as resolved.
Show resolved Hide resolved
"definition": {
"message": "The percentage of impacted services due to circuit breaker state is above 50%.\n\nService: {{name.name}}\n\nState: {{state.name}}",
willianccs marked this conversation as resolved.
Show resolved Hide resolved
"name": "[Resilience4j] Circuit Breaker State Alert for {{name.name}} with state {{state.name}}",
"options": {
"escalation_message": "The circuit breaker `{{name.name}}` has been `{{state.name}}` for more than **30 minutes**.",
"include_tags": true,
"new_group_delay": 300,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": 30,
"renotify_statuses": [
"alert"
],
"require_full_window": false,
"thresholds": {
"critical": 50,
"critical_recovery": 0,
"warning": 15,
"warning_recovery": 10
},
"timeout_h": 0
},
"query": "sum(last_5m):max:resilience4j.circuitbreaker.state{state IN (open,half_open,forced_open)} by {name,state} / max:kubernetes.pods.running{*} * 100 >= 50",
"tags": [
"integration:resilience4j"
],
"type": "query alert"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"version": 2,
"created_at": "2025-01-24",
"last_updated_at": "2025-01-24",
"title": "Circuit Breaker State Alert with Slow Calls",
willianccs marked this conversation as resolved.
Show resolved Hide resolved
"tags": [
"integration:resilience4j"
],
"description": "This monitor alerts when the Circuit Breaker state with slow calls.",
willianccs marked this conversation as resolved.
Show resolved Hide resolved
"definition": {
"message": "The percentage of slow calls for the circuit breaker {{name.name}} is above 85%.\n\nService: {{name.name}}\n\nState: {{state.name}}",
"name": "[Resilience4j] Circuit Breaker State Alert Slow Calls for {{name.name}}",
"options": {
"include_tags": true,
"locked": false,
"new_group_delay": 300,
"notify_audit": false,
"notify_no_data": false,
"require_full_window": false,
"thresholds": {
"critical": 85,
"warning": 70
},
"evaluation_delay": 30
},
"query": "avg(last_5m):sum:resilience4j.circuitbreaker.slow.calls.count{*} by {name,state} / sum:resilience4j.circuitbreaker.calls.seconds.count{*} by {name,state} * 100 > 85",
"tags": [
"integration:resilience4j"
],
"type": "query alert"
}
}
Binary file added resilience4j/assets/resilience4j.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions resilience4j/assets/service_checks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[
{
"agent_version": "7.59.0",
"integration": "Resilience4j",
"check": "resilience4j.prometheus.health",
"statuses": [
"ok",
"critical"
],
"groups": [
"host",
"endpoint"
],
"name": "Resilience4j endpoint health",
"description": "Returns `CRITICAL` if the Agent is unable to connect to the Resilience4j endpoint, otherwise returns `OK`."
}
]
1 change: 1 addition & 0 deletions resilience4j/datadog_checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__path__ = __import__('pkgutil').extend_path(__path__, __name__) # type: ignore
1 change: 1 addition & 0 deletions resilience4j/datadog_checks/resilience4j/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = '1.0.0'
4 changes: 4 additions & 0 deletions resilience4j/datadog_checks/resilience4j/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .__about__ import __version__
from .check import Resilience4jCheck

__all__ = ['__version__', 'Resilience4jCheck']
39 changes: 39 additions & 0 deletions resilience4j/datadog_checks/resilience4j/check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from datadog_checks.base import ConfigurationError, OpenMetricsBaseCheck

from .metrics import METRIC_MAP


class Resilience4jCheck(OpenMetricsBaseCheck):
DEFAULT_METRIC_LIMIT = 0

def __init__(self, name, init_config, instances):
default_instances = {
'resilience4j': {
'metrics': [METRIC_MAP],
'send_distribution_sums_as_monotonic': 'true',
'send_distribution_counts_as_monotonic': 'true',
}
}

super(Resilience4jCheck, self).__init__(
name, init_config, instances, default_instances=default_instances, default_namespace='resilience4j'
)

def _http_check(self, url, check_name):
try:
response = self.http.get(url)
response.raise_for_status()
except Exception as e:
self.service_check(check_name, self.CRITICAL, message=str(e))
else:
if response.status_code == 200:
self.service_check(check_name, self.OK)
else:
self.service_check(check_name, self.WARNING)

def check(self, instance):
prometheus_url = instance.get("prometheus_url")
if prometheus_url is None:
raise ConfigurationError("Each instance must have a url to the metrics endpoint")

super(Resilience4jCheck, self).check(instance)
20 changes: 20 additions & 0 deletions resilience4j/datadog_checks/resilience4j/config_models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# This file is autogenerated.
# To change this file you should edit assets/configuration/spec.yaml and then run the following commands:
# ddev -x validate config -s <INTEGRATION_NAME>
# ddev -x validate models -s <INTEGRATION_NAME>

from .instance import InstanceConfig
from .shared import SharedConfig


class ConfigMixin:
_config_model_instance: InstanceConfig
_config_model_shared: SharedConfig

@property
def config(self) -> InstanceConfig:
return self._config_model_instance

@property
def shared_config(self) -> SharedConfig:
return self._config_model_shared
16 changes: 16 additions & 0 deletions resilience4j/datadog_checks/resilience4j/config_models/defaults.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# This file is autogenerated.
# To change this file you should edit assets/configuration/spec.yaml and then run the following commands:
# ddev -x validate config -s <INTEGRATION_NAME>
# ddev -x validate models -s <INTEGRATION_NAME>


def instance_disable_generic_tags():
return False


def instance_empty_default_hostname():
return False


def instance_min_collection_interval():
return 15
Loading
Loading