You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As @ialidzhikov mentioned in #5 please address @oliver-goetz's comments first. I will try to post my thoughts on them directly in his review.
Note that I have not managed to look deeper into the metrics_scraper and input_data_registry packages and only had a rough overview of them. I plan to take a look at them when I have a bit more capacity in the following days
General questions
I see in the PR for GEP-23 that there was a lengthy discussion about enabling HA. However, I feel that the current approach is a bit hacky as @ialidzhikov and @oliver-goetz outlined. I also understand the reasoning behind wanting to save network traffic by running an active/passive replica. Why did you choose to not forward requests for metrics from the passive replica to the active one in this setup? Additionally, the metrics-server already runs in active/active mode if its HA is enabled.
[andrerun]: AFAICT, in a three-AZ arrangement, forwarding from passive to active replica incurs extra cross-AZ trip for 1/6 of the metrics requests. Also adds a bit of extra complexity - to implementation and operational (adds new failure modes). [under-discission]
Did you consider the https://github.com/kubernetes/client-go/tree/master/tools/cache package instead of the pod and secret controllers and listing the pods/secrets right before scraping metrics? The pods/secret should be available in the cache.
Instead of fetching the root ca and the secret used for prometheus, we should use a dedicated secret for this component. This can be deployed and managed by gardenlet similar to what we already do for the dependency watchdog.
This will be addressed when deploying via gardenlet.
Are stale metric points deleted? I didn't see a parameter which can be used to configure the maximum age of metrics. If we never delete older metrics won't this lead to high memory consumptions? Are you sure that there is high enough volatility with kube-apiservers (pods getting deleted frequently enough) such that this will never occur?
A: we only keep the last 2 request metrics for each kube-apiserver pod.
Do you plan to add metrics for the gardener-custom-metrics itself? For instance, exposing information about how long scrape operations take, how many scrapes have been done, etc. could be useful and will also allow us to monitor if the scraper is running properly and fire alerts if it is not.
[andrerun]: The stated goal for the first release was around the lines of a minimum value product. I'd absolutely want such observability in the long run, including, I'd want the data recorded in Prometheus.
[andrerun]: It was not an evaluation after the fact, but rather an advance statement of design intent. On the lower end, it reflects the view that for <20 pods, compact data structures (e.g. a flat array) would be a better fit for the internal implementation . The existing implementation would work, but it's not the optimal one for that scale. On the upper end, the statement means "that is the scale I had in mind when designing; more than that could bring surprises".
Shouldn't we use only a read lock here and also a sync.RWMutex instead of sync.Mutex?
[andrerun]: My superficial take on it was "no need for the complication at the current level of contention". With <1000 rapid lock-unlock cycles per second, I expect that the lock is sitting free >99% of the time. I guess a deeper look - one where we look into actual usage statistics - might also end in "it's a write-heavy scenario, RWMutex is slower".
This is related to the previous point. In one the comment yo mention that a read lock should be acquired, however using sync.Mutex there are no separate read or write locks. Additionally, this function is called from a functions that modify the registry (starting with Set* and Remove*) so maybe I'm just confused by this comment.
[andrerun]: You' re right that it's confusing, but I don't know how to do it better. Since it's an internal implementation comment, I strive to keep it resilient to change and decouple it from what is another piece of code's current choice of locking implementation. What needs to be held is indeed a read lock. Now, if the currently chosen lock implementation does not allow acquiring a read lock separately, then the read lock will have to be acquired along with a write lock, but that does not concern the commented function. Compare to what happens if I say just "lock" instead of "read lock", and we later switch to a RW lock. Suddenly the comment becomes ambiguous. I changed the comment to "Caller must acquire read lock before calling this function (or a semantic extension of a read lock - e.g. a read-write lock)"
We follow a naming scheme for package aliases that makes it easy to identify the package when reading the code. gcmctl is rather hard to unterstand and pronounce. Using something like inputcontroller makes more sense.
You could simply use the "k8s.io/component-base/version/verflag" package and its --version flag as we do in other components. There is probably no need for a version subcommand. You could also get rid of the pkg/version package.
This allows you to reimplement the InputDataServiceFactory for tests by creating a new InputDataServiceFactoryFunc which returns a mock/fake/stub InputDataService. This could be applied to other factories as well.
As @ialidzhikov mentioned in #5 please address @oliver-goetz's comments first. I will try to post my thoughts on them directly in his review.
Note that I have not managed to look deeper into the
metrics_scraper
andinput_data_registry
packages and only had a rough overview of them. I plan to take a look at them when I have a bit more capacity in the following daysGeneral questions
metrics-server
already runs inactive/active
mode if its HA is enabled.https://github.com/kubernetes/client-go/tree/master/tools/cache
package instead of the pod and secret controllers and listing the pods/secrets right before scraping metrics? The pods/secret should be available in the cache.gardener-custom-metrics
itself? For instance, exposing information about how long scrape operations take, how many scrapes have been done, etc. could be useful and will also allow us to monitor if the scraper is running properly and fire alerts if it is not.Mid Findings
sync.RWMutex
instead ofsync.Mutex
?read
lock should be acquired, however usingsync.Mutex
there are no separateread
orwrite
locks. Additionally, this function is called from a functions that modify the registry (starting withSet*
andRemove*
) so maybe I'm just confused by this comment.Minor Findings
LD_FLAGS
, however it is then overwritten on line 33make start
fails withno Go files in /Users/i077286/SAPDevelop/go/src/github.com/gardener/gardener-custom-metrics/cmd
.To fix this you should change https://github.com/gardener/gardener-custom-metrics/blob/4b9dafe19127d22272a2613b46179bdc70532662/Makefile#L47 to
./cmd/gardener-custom-metrics/... \
[andrerun]: Done
Reconcile
fnctions similar to what we do in gardener: https://github.com/gardener/gardener/blob/2a2240a0e1000dda21a883a1028ea5cf0e16369d/pkg/gardenlet/controller/managedseed/reconciler.go#L52-L53k8s.io/apiserver
dependencies, check if you also need to include openapi/v3 for k8s >= 1.27. Refs: https://github.com/gardener/gardener/blob/4520dfe7e1742dbcbf0baf914c6bc2cbd0f01991/cmd/gardener-apiserver/app/gardener_apiserver.go#L374-L384 and https://kubernetes.io/blog/2023/04/24/openapi-v3-field-validation-ga/ Add anOpenAPIV3Config
togardener-apiserver
gardener/gardener#8468cmd.RunE
to return errors from the run fucntion.true
?Nits
gcmctl
is rather hard to unterstand and pronounce. Using something likeinputcontroller
makes more sense.--version
flag as we do in other components. There is probably no need for aversion
subcommand. You could also get rid of thepkg/version
package.input-data-service
fits better for the log name here. There might be other similar cases where the name of the log can be improved.namespaceName
it is a lot clearer like thatOther suggestions
There is a different way you can do this in Go as functions can also be used to implement interfaces
InputDataServiceFactory
for tests by creating a newInputDataServiceFactoryFunc
which returns a mock/fake/stubInputDataService
. This could be applied to other factories as well.pkg/input/input_data_registry/fakes/input_data_registry.go
interface.go
files instead of mixing them with other types.The text was updated successfully, but these errors were encountered: