Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration for the StackHPC fork of Redfish Exporter #1530

Draft
wants to merge 1 commit into
base: stackhpc/2024.1
Choose a base branch
from

Conversation

m-bull
Copy link
Contributor

@m-bull m-bull commented Feb 22, 2025

  • Use updated container image
  • Update scrape jobs to not collect logs during frequent scrapes, then collect logs once per hour
  • Clean up Redfish dashboard to work with Lenovo hardware and remove deprecated panel types

Dashboard needs testing for compatibility with metrics produced by other manufacturer's Redfish implementations.

@product-auto-label product-auto-label bot added size: xl monitoring All things related to observability & telemetry labels Feb 22, 2025
@m-bull m-bull force-pushed the redfish-exporter-2.0 branch from 79f8936 to 690623c Compare February 22, 2025 10:11
Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, many thanks for adding this.

env: "{{ kayobe_environment | default('openstack') }}"
group: "{{ hostvars[host]['redfish_exporter_scrape_group'] | default('overcloud') }}"
{% endfor %}
- job_name: redfish-exporter-collectlog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we should put this behind a redfish_exporter_collect_logs flag so we can easily disable it at sites if it causes issues. Having said that, it should be a lot more robust now it lives in a separate scrape job. Many thanks for adding it.

Copy link
Contributor Author

@m-bull m-bull Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its more nuanced than that, and I couldn't quite get my brain around it when I made this PR, but I think its a bit clearer to me now...

There's two cases of scrape style (currently anyway!):

  1. [iDRAC style] Scrape normally in a single job with collectlog not present in the job, just use the defaults - this is what we've always done and should be the default IMO
  2. [Lenovo XCC style] Two jobs, one with collectlog=true and the other more frequent with collectlog=false

I think we should put the second style behind a feature flag as you suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. A limp mode flag (2) for cases when the logs are taking too long to fetch, and 1 as the default. On SMSlab (iDRAC) I noticed that most of the logs fetched are actually logs from logging into and out of the BMC - so I am hoping that once we switch to using persistent sessions, the scrape time will improve. I see about 5 minutes for an iDRAC there, which easily causes trouble.

@jovial
Copy link
Contributor

jovial commented Feb 28, 2025

For me a bunch of stuff doesn't work with dell. I will try and fix up a few bits. We also have lost the health summary. Did that not work on lenovo? That was one of more useful bits for me.

@m-bull
Copy link
Contributor Author

m-bull commented Feb 28, 2025

This is just up as a record of things that worked on Lenovo, I don't have the systems to be able to coalesce the dashboards to work on both types of hardware unfortunately :(. Added to metrics and panel names not really matching up, I didn't make any real attempt to remain compatible with the Dell metrics.

I also had to remove some bits of the dashboard because of the Angular deprecation, though I don't remember if the health summary was one of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring All things related to observability & telemetry size: xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants