Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate new OTLP log setup #1631

Open
3 of 6 tasks
a-thaler opened this issue Nov 22, 2024 · 0 comments · May be fixed by #1705
Open
3 of 6 tasks

Validate new OTLP log setup #1631

a-thaler opened this issue Nov 22, 2024 · 0 comments · May be fixed by #1705
Assignees
Labels
area/logs LogPipeline

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Nov 22, 2024

Description
Review and collect the result of the PoCs done for #556 and come up with a final config for the OTLP based log agent.
The agent should:

  • tail logs from the container runtime
  • map the content to OTLP
  • if applicable parse the JSON payload
  • send the data synchronuous to the log gateway
  • will pause tailing in case of gateway refusals

With that setup, perform a load test to see that the maximum ingestion rate of a Cloud Logging instance can be achieved.

  • Document the final configmap and document the load test results

Considerations:

  • An observed maximum output per overall fluentbit DaemonSet observed was 60 MB/s or 44K records/s (where the backend could not stand the load)
  • An observed maximum per fluentbit instance was 4,4MB/s or 15K records/s
  • Cloud Logging can handle in fluentbit format 2,5K logs/s in standard or avg. 15K logs/s (max 25K logs/s) in large
  • Cloud Logging can handle in OTLP format 10K logs/s in standard or 30K logs/s in large

After first perf test results, following aspects should be double-checked:

  • Check the official perf test results and see if they reflect similar results. Is the resource utilization that low as well?
  • The agent throughput was halfed by switching from debugexporter to a OTLP exporter with a mock backend, can that impact be really that drastically?
  • Tryout the new batchingexporter and check the perf increase for the agent
  • Based on the actual requirements (survive spike scenarios where backends are refusing data for some shorter amount of time), check if a persistent queue really brings a benefit over a simple in-memory and backpressure scenario?

The key decision points are:

  • Is the performance good enough to have logging based on the otel-collector already
  • Decide on architecture (agent pushes directly or via gateway) considering aspects around persistency and memory consumption (APIServer caching)
@a-thaler a-thaler added the area/logs LogPipeline label Nov 22, 2024
@a-thaler a-thaler mentioned this issue Nov 22, 2024
24 tasks
@TeodorSAP TeodorSAP self-assigned this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logs LogPipeline
Projects
None yet
2 participants