-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(k8s-infra): introduce prometheus scraper using pod annotations #455
Conversation
Signed-off-by: Prashant Shahi <[email protected]>
Signed-off-by: Prashant Shahi <[email protected]>
Signed-off-by: Prashant Shahi <[email protected]>
…s by default Signed-off-by: Prashant Shahi <[email protected]>
One issue with this will only work for minimal workloads. The pod will quickly face OOM issues for non-trivial workloads. When you get some time, please check how otel helm charts enable the Prom target allocator. |
WalkthroughThe pull request introduces enhanced configuration support for Prometheus metrics scraping in the k8s-infra Helm chart. It adds new configuration templates and values that enable flexible collection of application-level metrics from Kubernetes pods. The changes include conditional Prometheus scraper configuration, namespace-specific deployment options, and improved resource detection settings for the OpenTelemetry Collector. Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
charts/k8s-infra/templates/_config.tpl (3)
109-111
: Expand logging capabilities conditionally
You are conditionally adding the"logging"
exporter for themetrics/scraper
pipeline, which is good. Double-check that thelogging
exporter is only appended if you indeed want logs of scraping activities. Otherwise, consider making this behavior optional.
140-142
: Append Otel exporter carefully
Similar logic to logging exporter. Adding theotlp
exporter to themetrics/scraper
pipeline is helpful, but ensure it’s correctly enabled or disabled based on user configuration to avoid sending unnecessary data.
204-210
: Merge with caution: metrics/internal
InapplyPrometheusScraperConfig
, adding"prometheus/scraper"
to"metrics/internal"
might cause confusion between the internal pipeline and the new scraper pipeline. Consider documenting how the two pipelines differ in function.charts/k8s-infra/values.yaml (2)
312-346
: Clarify default enabling of Prometheus scrapers
Bothsignoz.enabled
andprometheus.enabled
default totrue
, effectively enabling scraping. This is convenient out-of-the-box, but can lead to unintended overhead in large clusters. Document the resource implications for each new user.
1012-1015
: Assess concurrency & memory overhead
Introducing themetrics/scraper
pipeline can escalate memory usage if many pods are scraped collectively. Monitor usage or provide best-practice guidance for production environments to avoid out-of-memory issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
charts/k8s-infra/templates/_config.tpl
(5 hunks)charts/k8s-infra/templates/otel-agent/tests/test-connection.yaml
(1 hunks)charts/k8s-infra/templates/otel-deployment/tests/test-connection.yaml
(1 hunks)charts/k8s-infra/values.yaml
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (5)
charts/k8s-infra/templates/_config.tpl (3)
74-76
: Check Prometheus scraper usage logic
The condition to enable the Prometheus scraper looks correct. Ensure that the relevant documentation clarifies when both.Values.presets.prometheusScraper.signoz.enabled
and.Values.presets.prometheusScraper.prometheus.enabled
can be toggled together, to avoid confusion or conflicting scrape configurations.
89-91
: Prevent potential reference errors
When unsetting the"metrics/internal"
pipeline, confirm that no references to it remain anywhere else in the final merged configuration. Leaving references in the code or the chart's documentation can cause confusion.
212-355
: Validate relabeling rules
TheprometheusScraperConfig
includes multiple relabel configs. Verify that they align with best practices, especially around dropping-init
containers. For high-volume workloads, ensure these rules are optimal for performance.charts/k8s-infra/templates/otel-agent/tests/test-connection.yaml (1)
7-7
: Namespace definition
Addingnamespace: {{ include "k8s-infra.namespace" . }}
is good for ensuring the Pod deploys into the correct namespace. No concerns here.charts/k8s-infra/templates/otel-deployment/tests/test-connection.yaml (1)
7-7
: Consistent namespace handling
Similarly, addingnamespace: {{ include "k8s-infra.namespace" . }}
is consistent with other test Pod definitions. Looks clean and maintains predictable deployment.
Signed-off-by: Prashant Shahi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
charts/k8s-infra/templates/_config.tpl
(5 hunks)charts/k8s-infra/values.yaml
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (3)
charts/k8s-infra/templates/_config.tpl (2)
212-219
: Well-implemented annotation conversion function!The function follows best practices for label name sanitization with proper ordering of replacements.
89-91
: Well-structured pipeline configuration!The metrics/scraper pipeline follows the established patterns for pipeline configuration and cleanup.
Also applies to: 109-111, 140-142
charts/k8s-infra/values.yaml (1)
996-999
: LGTM: Pipeline configuration follows established patterns.The metrics/scraper pipeline configuration is consistent with other pipeline configurations in the chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
charts/k8s-infra/templates/_config.tpl (1)
212-219
: Optimize annotation conversion for better performance.The current implementation makes multiple regex replacements. Consider combining them into a single regex operation for better performance.
{{- define "opentelemetry-collector.convertAnnotationToPrometheusMetaLabel" -}} {{- $name := . | lower -}} -{{- $name = regexReplaceAll "\\." $name "_" -}} -{{- $name := regexReplaceAll "/" $name "_" -}} -{{- $name := regexReplaceAll "-" $name "_" -}} -{{- $name := regexReplaceAll "[^a-z0-9_]" $name "_" -}} +{{- $name := regexReplaceAll "[^a-z0-9]+" $name "_" -}} {{- $name -}} {{- end -}}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
charts/k8s-infra/templates/_config.tpl
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (4)
charts/k8s-infra/templates/_config.tpl (4)
74-76
: LGTM: Pipeline configuration is well-structured.The metrics scraper pipeline configuration follows the established pattern and includes proper cleanup logic.
Also applies to: 89-91
109-111
: LGTM: Exporter configuration is consistent.The logging and OTLP exporter configurations for the metrics/scraper pipeline follow the established pattern.
Also applies to: 140-142
221-295
: 🛠️ Refactor suggestionAdd safeguards to prevent OOM issues.
To address the concerns about Out of Memory issues with non-trivial workloads, add the following configurations:
receivers: prometheus/scraper: config: + global: + scrape_timeout: 10s + sample_limit: {{ .Values.presets.prometheus.sampleLimit | default 50000 }} scrape_configs: {{- if .Values.presets.prometheus.enabled }} - job_name: "signoz-scraper" scrape_interval: {{ .Values.presets.prometheus.scrapeInterval }} + target_limit: {{ .Values.presets.prometheus.targetLimit | default 1000 }} + label_limit: {{ .Values.presets.prometheus.labelLimit | default 100 }} + label_name_length_limit: {{ .Values.presets.prometheus.labelNameLengthLimit | default 100 }} + label_value_length_limit: {{ .Values.presets.prometheus.labelValueLengthLimit | default 100 }}Also, consider implementing the OpenTelemetry target allocator for better scalability:
- Use the target allocator to distribute scraping load across multiple collector instances
- Implement sharding to prevent duplicate scraping
- Consider using service discovery cache to reduce API server load
Would you like me to provide an example configuration for the target allocator?
Likely invalid or redundant comment.
204-210
: 🛠️ Refactor suggestionAdd performance optimizations for large-scale deployments.
Based on the concerns about OOM issues with non-trivial workloads, consider adding the following configurations:
- Memory management settings
- Scrape limits
- Target refresh intervals
{{- define "opentelemetry-collector.applyPrometheusConfig" -}} {{- $config := mustMergeOverwrite (include "opentelemetry-collector.prometheusConfig" .Values | fromYaml) .config }} +{{- if not $config.extensions }} +{{- $_ := set $config "extensions" dict }} +{{- end }} +{{- if not $config.extensions.memory_ballast }} +{{- $_ := set $config.extensions "memory_ballast" dict }} +{{- $_ := set $config.extensions.memory_ballast "size_mib" 683 }} +{{- end }} +{{- if not $config.extensions.memory_limiter }} +{{- $_ := set $config.extensions "memory_limiter" dict }} +{{- $_ := set $config.extensions.memory_limiter "check_interval" "5s" }} +{{- $_ := set $config.extensions.memory_limiter "limit_mib" 900 }} +{{- end }} {{- if index $config.service.pipelines "metrics/internal" }} {{- $_ := set (index $config.service.pipelines "metrics/internal") "receivers" (append (index (index $config.service.pipelines "metrics/internal") "receivers") "prometheus/scraper" | uniq) }} {{- end }}Likely invalid or redundant comment.
Signed-off-by: Prashant Shahi <[email protected]>
…en empty Signed-off-by: Prashant Shahi <[email protected]>
Signed-off-by: Prashant Shahi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
charts/k8s-infra/templates/_config.tpl (1)
224-298
:⚠️ Potential issueAdd performance optimization configurations.
To address the performance concerns with non-trivial workloads:
- Add sample and target limits
- Configure scrape timeout
- Add target refresh interval
This is a duplicate of the performance concerns raised in the previous review. Apply the suggested configurations to prevent OOM issues with large workloads.
🧹 Nitpick comments (1)
charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml (1)
34-139
: Add test cases for error scenarios.While the current test cases validate the happy path, consider adding tests for:
- Invalid annotation values
- Missing required annotations
- Malformed port numbers
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
charts/k8s-infra/templates/_config.tpl
(5 hunks)charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (3)
charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml (2)
1-7
: LGTM! Well-structured test suite configuration.The test suite is properly configured with schema validation and correct template specification.
8-33
: Verify scrape interval for production workloads.The scrape interval is set to 60s, which might need adjustment based on the workload size and resource constraints. Consider the following recommendations:
- For large clusters, a longer interval might be necessary to prevent OOM issues.
- Make the interval configurable based on workload size.
Based on the user's concern about performance with non-trivial workloads, run this script to check if other users have reported similar issues or shared recommended configurations:
✅ Verification successful
Current scrape interval configuration is appropriate ✅
The 60s default interval is a standard practice, and the configuration already provides flexibility through:
- Configurable interval via values.yaml
- Optional namespace scoping to limit collection scope
- Ability to target specific namespaces
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for discussions about Prometheus scraper performance gh issue list --repo SigNoz/charts --label performance --json title,body | jq '.[] | select(.title | contains("prometheus") or .body | contains("prometheus"))' # Check for existing configurations in other charts rg -g "values.yaml" -A 5 "scrapeInterval" charts/Length of output: 682
charts/k8s-infra/templates/_config.tpl (1)
215-222
: Add test cases for annotation conversion edge cases.The conversion utility should be tested with various edge cases:
- Unicode characters
- Multiple consecutive special characters
- Leading/trailing special characters
Run this script to verify the conversion utility with edge cases:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
charts/k8s-infra/templates/_config.tpl (1)
226-317
:⚠️ Potential issueAdd performance optimization configurations.
The Prometheus scraper configuration should include performance optimization settings to prevent OOM issues.
Add the following configurations to the scrape_configs:
scrape_configs: {{- if .Values.presets.prometheus.enabled }} - job_name: "signoz-scraper" scrape_interval: {{ .Values.presets.prometheus.scrapeInterval }} + scrape_timeout: {{ .Values.presets.prometheus.scrapeTimeout }} + sample_limit: {{ .Values.presets.prometheus.sampleLimit }} + target_limit: {{ .Values.presets.prometheus.targetLimit }}
🧹 Nitpick comments (2)
charts/k8s-infra/templates/_config.tpl (2)
217-224
: Add validation for annotation prefix.The annotation conversion utility should validate the input to prevent empty or invalid prefixes.
Apply this diff:
{{- define "opentelemetry-collector.convertAnnotationToPrometheusMetaLabel" -}} +{{- if not . }} +{{- fail "Annotation prefix cannot be empty" }} +{{- end }} {{- $name := . | lower -}}
460-462
: Consolidate processor configurations.The processor configurations for metrics/scraper are scattered across multiple sections. Consider consolidating them for better maintainability.
Create a new template for metrics/scraper processors and reuse it across sections.
Also applies to: 499-501, 519-521
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
charts/k8s-infra/templates/_config.tpl
(8 hunks)charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml
(1 hunks)charts/k8s-infra/values.yaml
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (2)
charts/k8s-infra/values.yaml (1)
312-329
:⚠️ Potential issueAdd performance tuning configurations to prevent OOM issues.
Based on the PR comments about potential OOM issues with non-trivial workloads, consider adding the following configuration parameters:
sampleLimit
: Maximum number of samples per scrapetargetLimit
: Maximum number of targets to scrapescrapeTimeout
: Timeout for scrape requestsApply this diff:
prometheus: enabled: false annotationsPrefix: signoz.io scrapeInterval: 60s + # Maximum number of samples per scrape + sampleLimit: 1000 + # Maximum number of targets to scrape + targetLimit: 100 + # Timeout for scrape requests + scrapeTimeout: 10s namespaceScoped: falseLikely invalid or redundant comment.
charts/k8s-infra/templates/_config.tpl (1)
207-215
: 🛠️ Refactor suggestionAdd error handling for missing pipeline.
The current implementation assumes the metrics/scraper pipeline exists. Add error handling to create the pipeline if it doesn't exist.
Apply this diff:
{{- define "opentelemetry-collector.applyPrometheusConfig" -}} {{- $config := mustMergeOverwrite (include "opentelemetry-collector.prometheusConfig" .Values | fromYaml) .config }} +{{- if not (index $config.service.pipelines "metrics/scraper") }} +{{- $_ := set $config.service.pipelines "metrics/scraper" dict }} +{{- $_ := set (index $config.service.pipelines "metrics/scraper") "receivers" list }} +{{- $_ := set (index $config.service.pipelines "metrics/scraper") "processors" list }} +{{- $_ := set (index $config.service.pipelines "metrics/scraper") "exporters" list }} +{{- end }} {{- if index $config.service.pipelines "metrics/scraper" }}Likely invalid or redundant comment.
Signed-off-by: Prashant Shahi <[email protected]>
Signed-off-by: Prashant Shahi <[email protected]>
Signed-off-by: Prashant Shahi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
charts/k8s-infra/templates/NOTES.txt (1)
11-11
: Consider enhancing warning messages with configuration guidance.The warnings about configMap creation are clear, but consider adding guidance on how to enable configMap creation or where to find more information.
-[WARNING] otel-agent "configMap" will not be created and "otelAgent.config" will not take effect. +[WARNING] otel-agent "configMap" will not be created and "otelAgent.config" will not take effect. Set otelAgent.configMap.create=true to enable. -[WARNING] otel-deployment "configMap" will not be created and "otelDeployment.config" will not take effect. +[WARNING] otel-deployment "configMap" will not be created and "otelDeployment.config" will not take effect. Set otelDeployment.configMap.create=true to enable.Also applies to: 15-15
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
charts/k8s-infra/Chart.yaml
(1 hunks)charts/k8s-infra/templates/NOTES.txt
(1 hunks)charts/k8s-infra/templates/_config.tpl
(8 hunks)charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- charts/k8s-infra/Chart.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- charts/k8s-infra/tests/otel-deployment_prometheus_test.yaml
🧰 Additional context used
🪛 LanguageTool
charts/k8s-infra/templates/NOTES.txt
[uncategorized] ~7-~7: You might be missing the article “the” here.
Context: ...ment.image.tag }}' NOTE: If you enable Prometheus preset in this chart and have installed...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: test-chart
🔇 Additional comments (7)
charts/k8s-infra/templates/NOTES.txt (2)
4-5
: LGTM! Clear version information display.The version information for both components is clearly displayed using proper Helm templating syntax.
7-8
: Improve grammar and verify configuration option.The note about data duplication is crucial. Consider these improvements:
- Add "the" before "Prometheus preset" for better grammar.
- Consider adding a reference to the specific configuration path in the SigNoz chart.
-NOTE: If you enable Prometheus preset in this chart and have installed SigNoz helm chart in the same cluster, +NOTE: If you enable the Prometheus preset in this chart and have installed the SigNoz helm chart in the same cluster,Let's verify the existence of this configuration option:
✅ Verification successful
Grammar improvements approved, configuration verified.
The warning about potential data duplication is accurate and important. The
otelCollectorMetrics.enabled
configuration exists in the SigNoz chart and is enabled by default, making this warning particularly relevant.The suggested grammar improvements should be implemented:
-NOTE: If you enable Prometheus preset in this chart and have installed SigNoz helm chart in the same cluster, +NOTE: If you enable the Prometheus preset in this chart and have installed the SigNoz helm chart in the same cluster,🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the otelCollectorMetrics.enabled configuration exists in SigNoz chart rg -l "otelCollectorMetrics:\s*\n\s*enabled:" charts/Length of output: 247
Script:
#!/bin/bash # Search for otelCollectorMetrics configuration with proper multiline support rg -U "otelCollectorMetrics:(\s+enabled:|.*\n.*enabled:)" charts/ # Search for any mentions of this configuration in documentation rg -i "otelcollectormetrics" --type md charts/ # Look for values.yaml files that might contain this configuration fd "values.yaml" charts/ --exec grep -l "otelCollectorMetrics" {} \;Length of output: 9500
🧰 Tools
🪛 LanguageTool
[uncategorized] ~7-~7: You might be missing the article “the” here.
Context: ...ment.image.tag }}' NOTE: If you enable Prometheus preset in this chart and have installed...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
charts/k8s-infra/templates/_config.tpl (5)
74-76
: LGTM: Clean integration of Prometheus config.The conditional inclusion of Prometheus configuration is properly integrated into the deployment configuration pipeline.
89-91
: LGTM: Consistent pipeline configuration.The metrics/scraper pipeline is properly integrated with all necessary components:
- Pipeline cleanup
- Logging exporter
- OTLP exporter
- Kubernetes attributes processor
- Resource detection processor
Also applies to: 112-114, 143-145, 441-443, 480-482, 500-502
215-222
: LGTM: Robust annotation conversion utility.The
convertAnnotationToPrometheusMetaLabel
function properly handles various special characters and follows Prometheus naming conventions.
235-243
: LGTM: Flexible namespace configuration.The implementation properly handles both namespace-scoped and explicit namespace configurations, providing flexibility in deployment scenarios.
244-296
: LGTM: Comprehensive relabeling configuration.The relabeling configuration:
- Properly filters pods based on annotations
- Adds essential Kubernetes metadata
- Handles container name inclusion/exclusion
- Maintains consistent label naming
…or now Signed-off-by: Prashant Shahi <[email protected]>
Track #596 for enabling the prom operator |
Summary
Fixes #445
Fixes #196
This has been requested by lots of OSS and cloud users who wished to scrape metrics from various cluster and send to SigNoz running in different clusters.
Note: If you enable Prometheus preset in K8s-Infra chart and have SigNoz in the same cluster, please disable otelCollectorMetrics to avoid data duplication.
Signed-off-by: Prashant Shahi [email protected]
Summary by CodeRabbit
New Features
Improvements
Configuration Updates