Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code review comments from @plkokanov #6

Open
4 of 16 tasks
plkokanov opened this issue Feb 28, 2024 · 0 comments
Open
4 of 16 tasks

Code review comments from @plkokanov #6

plkokanov opened this issue Feb 28, 2024 · 0 comments

Comments

@plkokanov
Copy link

plkokanov commented Feb 28, 2024

As @ialidzhikov mentioned in #5 please address @oliver-goetz's comments first. I will try to post my thoughts on them directly in his review.

Note that I have not managed to look deeper into the metrics_scraper and input_data_registry packages and only had a rough overview of them. I plan to take a look at them when I have a bit more capacity in the following days


General questions

  • I see in the PR for GEP-23 that there was a lengthy discussion about enabling HA. However, I feel that the current approach is a bit hacky as @ialidzhikov and @oliver-goetz outlined. I also understand the reasoning behind wanting to save network traffic by running an active/passive replica. Why did you choose to not forward requests for metrics from the passive replica to the active one in this setup? Additionally, the metrics-server already runs in active/active mode if its HA is enabled.
    • [andrerun]: AFAICT, in a three-AZ arrangement, forwarding from passive to active replica incurs extra cross-AZ trip for 1/6 of the metrics requests. Also adds a bit of extra complexity - to implementation and operational (adds new failure modes). [under-discission]
  • Did you consider the https://github.com/kubernetes/client-go/tree/master/tools/cache package instead of the pod and secret controllers and listing the pods/secrets right before scraping metrics? The pods/secret should be available in the cache.
  • Instead of fetching the root ca and the secret used for prometheus, we should use a dedicated secret for this component. This can be deployed and managed by gardenlet similar to what we already do for the dependency watchdog.
    • This will be addressed when deploying via gardenlet.
  • Are stale metric points deleted? I didn't see a parameter which can be used to configure the maximum age of metrics. If we never delete older metrics won't this lead to high memory consumptions? Are you sure that there is high enough volatility with kube-apiservers (pods getting deleted frequently enough) such that this will never occur?
    • A: we only keep the last 2 request metrics for each kube-apiserver pod.
  • Do you plan to add metrics for the gardener-custom-metrics itself? For instance, exposing information about how long scrape operations take, how many scrapes have been done, etc. could be useful and will also allow us to monitor if the scraper is running properly and fire alerts if it is not.
    • [andrerun]: The stated goal for the first release was around the lines of a minimum value product. I'd absolutely want such observability in the long run, including, I'd want the data recorded in Prometheus.
  • https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/metrics_scraper/scraper.go#L21
    • How was this range (20-6000) evaluated?
      • [andrerun]: It was not an evaluation after the fact, but rather an advance statement of design intent. On the lower end, it reflects the view that for <20 pods, compact data structures (e.g. a flat array) would be a better fit for the internal implementation . The existing implementation would work, but it's not the optimal one for that scale. On the upper end, the statement means "that is the scale I had in mind when designing; more than that could bring surprises".

Mid Findings

  • https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_registry/input_data_registry.go#L211-L213
    • Shouldn't we use only a read lock here and also a sync.RWMutex instead of sync.Mutex?
      • [andrerun]: My superficial take on it was "no need for the complication at the current level of contention". With <1000 rapid lock-unlock cycles per second, I expect that the lock is sitting free >99% of the time. I guess a deeper look - one where we look into actual usage statistics - might also end in "it's a write-heavy scenario, RWMutex is slower".
  • https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_registry/input_data_registry.go#L507-L512
    • This is related to the previous point. In one the comment yo mention that a read lock should be acquired, however using sync.Mutex there are no separate read or write locks. Additionally, this function is called from a functions that modify the registry (starting with Set* and Remove*) so maybe I'm just confused by this comment.
      • [andrerun]: You' re right that it's confusing, but I don't know how to do it better. Since it's an internal implementation comment, I strive to keep it resilient to change and decouple it from what is another piece of code's current choice of locking implementation. What needs to be held is indeed a read lock. Now, if the currently chosen lock implementation does not allow acquiring a read lock separately, then the read lock will have to be acquired along with a write lock, but that does not concern the commented function. Compare to what happens if I say just "lock" instead of "read lock", and we later switch to a RW lock. Suddenly the comment becomes ambiguous. I changed the comment to "Caller must acquire read lock before calling this function (or a semantic extension of a read lock - e.g. a read-write lock)"

Minor Findings


Nits


Other suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant