Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system] - Add a flag to the system module to control whether metrics failures mark agent as degraded #42160

Merged
merged 14 commits into from
Jan 17, 2025

Conversation

VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Dec 26, 2024

Add a new config to mark metricsets as degarded if partial metrics are enabled.

Currently, it's disabled by default and we will only enable it in integration test cases in elastic-agent. Once we're confident enough, we will enable this feature by default in beats and eventually, remove the flag.

Closes #40543

@VihasMakwana VihasMakwana added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Dec 26, 2024
@VihasMakwana VihasMakwana self-assigned this Dec 26, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 26, 2024
@VihasMakwana VihasMakwana marked this pull request as ready for review December 31, 2024 12:57
@VihasMakwana VihasMakwana requested a review from a team as a code owner December 31, 2024 12:57
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

metricbeat/mb/event.go Outdated Show resolved Hide resolved
Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 150 to 151
# Enable below config to mark metricset as degraded if partial metrics are emitted
#degrade_on_partial: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This either needs more documentation or a more specific name. Users reading this will have no idea what partial metrics mean or what can lead to this happening.

I like the hostmetrics receiver's mute_* settings as a model: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver#process

Maybe we can't detect in code the specific type of error yet, but we can at least document the error classes this actually covers. We can also expect we may need or want to extend this to be more specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz I've documented in detail in reference.yml. I'll also open up a separate ticket to update https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-metricset-system-process.html, once we're satisfied with the wordings here. Let me know what do you think

@VihasMakwana VihasMakwana requested a review from cmacknz January 16, 2025 09:04
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a minor nit on the wording but in reviewing this again I realized documenting this for standalone Beats doesn't really make sense. It has to be documented and exposed in the system integration or nobody can use it.

metricbeat/docs/modules/system.asciidoc Outdated Show resolved Hide resolved
metricbeat/mb/event.go Outdated Show resolved Hide resolved
metricbeat/metricbeat.reference.yml Outdated Show resolved Hide resolved
metricbeat/module/system/_meta/config.reference.yml Outdated Show resolved Hide resolved
metricbeat/module/system/_meta/config.reference.yml Outdated Show resolved Hide resolved
@VihasMakwana
Copy link
Contributor Author

@cmacknz I've removed references of config from beats, as it doesn't make sense for standalone beats.

I'll work on a followup PR for agent. Let me know if you're good with this PR! Thanks!

@VihasMakwana VihasMakwana requested a review from cmacknz January 16, 2025 16:21
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nit on the sleep in one of the unit tests.

metricbeat/module/system/process/process_test.go Outdated Show resolved Hide resolved
@VihasMakwana VihasMakwana merged commit 7a2f8d4 into elastic:main Jan 17, 2025
31 checks passed
mergify bot pushed a commit that referenced this pull request Jan 17, 2025
… failures mark agent as degraded (#42160)

* chore: initial commit

* doc and test cases

* lint

* remove errors.Is

* chore: notice

* update docs

* Update metricbeat/mb/event.go

Co-authored-by: Craig MacKenzie <[email protected]>

* remove references

* remove sleep

---------

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit 7a2f8d4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
4 participants