Code review comments from @plkokanov #6

plkokanov · 2024-02-28T15:09:35Z

As @ialidzhikov mentioned in #5 please address @oliver-goetz's comments first. I will try to post my thoughts on them directly in his review.

Note that I have not managed to look deeper into the metrics_scraper and input_data_registry packages and only had a rough overview of them. I plan to take a look at them when I have a bit more capacity in the following days

General questions

I see in the PR for GEP-23 that there was a lengthy discussion about enabling HA. However, I feel that the current approach is a bit hacky as @ialidzhikov and @oliver-goetz outlined. I also understand the reasoning behind wanting to save network traffic by running an active/passive replica. Why did you choose to not forward requests for metrics from the passive replica to the active one in this setup? Additionally, the metrics-server already runs in active/active mode if its HA is enabled.
- [andrerun]: AFAICT, in a three-AZ arrangement, forwarding from passive to active replica incurs extra cross-AZ trip for 1/6 of the metrics requests. Also adds a bit of extra complexity - to implementation and operational (adds new failure modes). [under-discission]
Did you consider the https://github.com/kubernetes/client-go/tree/master/tools/cache package instead of the pod and secret controllers and listing the pods/secrets right before scraping metrics? The pods/secret should be available in the cache.
Instead of fetching the root ca and the secret used for prometheus, we should use a dedicated secret for this component. This can be deployed and managed by gardenlet similar to what we already do for the dependency watchdog.
- This will be addressed when deploying via gardenlet.
Are stale metric points deleted? I didn't see a parameter which can be used to configure the maximum age of metrics. If we never delete older metrics won't this lead to high memory consumptions? Are you sure that there is high enough volatility with kube-apiservers (pods getting deleted frequently enough) such that this will never occur?
- A: we only keep the last 2 request metrics for each kube-apiserver pod.
Do you plan to add metrics for the gardener-custom-metrics itself? For instance, exposing information about how long scrape operations take, how many scrapes have been done, etc. could be useful and will also allow us to monitor if the scraper is running properly and fire alerts if it is not.
- [andrerun]: The stated goal for the first release was around the lines of a minimum value product. I'd absolutely want such observability in the long run, including, I'd want the data recorded in Prometheus.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/metrics_scraper/scraper.go#L21
- How was this range (20-6000) evaluated?
  - [andrerun]: It was not an evaluation after the fact, but rather an advance statement of design intent. On the lower end, it reflects the view that for <20 pods, compact data structures (e.g. a flat array) would be a better fit for the internal implementation . The existing implementation would work, but it's not the optimal one for that scale. On the upper end, the statement means "that is the scale I had in mind when designing; more than that could bring surprises".

Mid Findings

https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_registry/input_data_registry.go#L211-L213
- Shouldn't we use only a read lock here and also a sync.RWMutex instead of sync.Mutex?
  - [andrerun]: My superficial take on it was "no need for the complication at the current level of contention". With <1000 rapid lock-unlock cycles per second, I expect that the lock is sitting free >99% of the time. I guess a deeper look - one where we look into actual usage statistics - might also end in "it's a write-heavy scenario, RWMutex is slower".
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_registry/input_data_registry.go#L507-L512
- This is related to the previous point. In one the comment yo mention that a read lock should be acquired, however using sync.Mutex there are no separate read or write locks. Additionally, this function is called from a functions that modify the registry (starting with Set* and Remove*) so maybe I'm just confused by this comment.
  - [andrerun]: You' re right that it's confusing, but I don't know how to do it better. Since it's an internal implementation comment, I strive to keep it resilient to change and decouple it from what is another piece of code's current choice of locking implementation. What needs to be held is indeed a read lock. Now, if the currently chosen lock implementation does not allow acquiring a read lock separately, then the read lock will have to be acquired along with a write lock, but that does not concern the commented function. Compare to what happens if I say just "lock" instead of "read lock", and we later switch to a RW lock. Suddenly the comment becomes ambiguous. I changed the comment to "Caller must acquire read lock before calling this function (or a semantic extension of a read lock - e.g. a read-write lock)"

Minor Findings

https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/Makefile#L21
- This sets LD_FLAGS, however it is then overwritten on line 33
  - [andrerun]: Done, thanks for catching, that was misleading!
make start fails with no Go files in /Users/i077286/SAPDevelop/go/src/github.com/gardener/gardener-custom-metrics/cmd.
To fix this you should change https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/Makefile#L47 to
./cmd/gardener-custom-metrics/... \
[andrerun]: Done
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/cmd/gardener-custom-metrics/main.go#L29
- You can use cmd.ExecuteContext() and setup the signal handlers directly here.
```
ctx := genericapiserver.SetupSignalContext() // Context closed on SIGTERM and SIGINT
ctx, cancel := context.WithCancel(ctx)
defer cancel()

rootCmd.ExecuteContext(ctx)
```
  - [andrerun]: [under-disucssion] I'd like to ask something offline about that approach.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/cli_options.go#L68 this help text is the same as this https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/cli_options.go#L61
- [andrerun]: Fixed, thanks!
Why does this method have to return an error https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/cli_options.go#L73
- [andrerun]: It doesn't. Returning error looked like an established convention for Complete(), so I stuck with it. Should I remove the return type?
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/controller/reconciler.go#L36
- Consider adding timeout to contexts in Reconcile fnctions similar to what we do in gardener: https://github.com/gardener/gardener/blob/2a2240a0e1000dda21a883a1028ea5cf0e16369d/pkg/gardenlet/controller/managedseed/reconciler.go#L52-L53
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/util/errutil/errutil.go#L13
- Why is this function needed? I don't see it adding too much value.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/metrics_provider/metrics_provider_service.go#L49-L51
- Once you update the k8s.io/apiserver dependencies, check if you also need to include openapi/v3 for k8s >= 1.27. Refs: https://github.com/gardener/gardener/blob/4520dfe7e1742dbcbf0baf914c6bc2cbd0f01991/cmd/gardener-apiserver/app/gardener_apiserver.go#L374-L384 and https://kubernetes.io/blog/2023/04/24/openapi-v3-field-validation-ga/ Add an OpenAPIV3Config to gardener-apiserver gardener/gardener#8468
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/cmd/gardener-custom-metrics/main.go#L64
- You could use cmd.RunE to return errors from the run fucntion.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/cmd/gardener-custom-metrics/main.go#L212
- Why is this always set with true?

Nits

https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/controller/pod/add.go#L30
- We follow a naming scheme for package aliases that makes it easy to identify the package when reading the code. gcmctl is rather hard to unterstand and pronounce. Using something like inputcontroller makes more sense.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/controller/pod/add.go#L58 and https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/controller/pod/add.go#L40
- No reason to pass by pointer, only to have a dereferrence here: https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/controller/pod/add.go#L47
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/metrics_scraper/metrics_client.go#L95
- Isn't checking if it equals http.StatusOK enough?
You could simply use the "k8s.io/component-base/version/verflag" package and its --version flag as we do in other components. There is probably no need for a version subcommand. You could also get rid of the pkg/version package.
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/cmd/gardener-custom-metrics/main.go#L40
- This text could be adapted if we also plan to reuse this for the gardener api server
https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_service.go#L64
- Maybe input-data-service fits better for the log name here. There might be other similar cases where the name of the log can be improved.

https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/cmd/gardener-custom-metrics/main.go#L125-L136

return kmgr.RunnableFunc(func(ctx context.Context) error {
            if err := metricsService.Run(ctx.Done()); err != nil {
                    log.V(app.VerbosityError).Error(err, "Failed to run custom metrics adapter")
                    onFailedFunc()
                    return err
            }
            log.Info("Metrics provider service exited")
            return nil
}), nil

https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/metrics_provider/metrics_provider.go#L72
- rename this var to namespaceName it is a lot clearer like that

Other suggestions

I see in some places where you use factories e.g.: https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_service.go#L22-L36
There is a different way you can do this in Go as functions can also be used to implement interfaces

type InputDataServiceFactory interface {
    New func(cliConfig *CLIConfig, parentLogger logr.Logger) InputDataService
}

type InputDataServiceFactoryFunc(cliConfig *CLIConfig, parentLogger logr.Logger) func InputDataService

func (fn InputDataServiceFactoryFunc) New(cliConfig *CLIConfig, parentLogger logr.Logger) InputDataService {
    return fn(cliConfig, parentLogger) 
}

func NewInputDataServiceFactory() InputDataServiceFactory {
    return InputDataServiceFactoryFunc(func(cliConfig *CLIConfig, parentLogger logr.Logger) InputDataService{
        return newInputDataService(cliConfig, parentLogger)
    })
}

This allows you to reimplement the InputDataServiceFactory for tests by creating a new InputDataServiceFactoryFunc which returns a mock/fake/stub InputDataService. This could be applied to other factories as well.

Please put mocks, fakes, stubs and such types in dedicated subpacakges. E.g.: https://github.com/gardener/gardener-custom-metrics/blob/main/pkg/input/input_data_registry/test_fakes.go should be in pkg/input/input_data_registry/fakes/input_data_registry.go
You could extract interface classes in a separate interface.go files instead of mixing them with other types.
Make sure to go through our dev logging guide here: https://github.com/gardener/gardener/blob/master/docs/development/logging.md
Why do we need such adapters: https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/pkg/input/input_data_registry/consumer_interface.go#L23-L24
Some other suggestions are already outlined by @ialidzhikov

The text was updated successfully, but these errors were encountered:

ialidzhikov mentioned this issue Feb 28, 2024

☂️ [GEP-23] Autoscaling Shoot kube-apiserver via Independently Driven HPA and VPA gardener/gardener#8259

Closed

31 tasks

ialidzhikov mentioned this issue Mar 28, 2024

Add liveness and readiness probes to the example Deployment #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code review comments from @plkokanov #6

Code review comments from @plkokanov #6

plkokanov commented Feb 28, 2024 •

edited by andrerun

Loading

Code review comments from @plkokanov #6

Code review comments from @plkokanov #6

Comments

plkokanov commented Feb 28, 2024 • edited by andrerun Loading

General questions

Mid Findings

Minor Findings

Nits

Other suggestions

plkokanov commented Feb 28, 2024 •

edited by andrerun

Loading