Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warehouse does not reconcile #2836

Open
4 tasks done
wmiller112 opened this issue Oct 24, 2024 · 17 comments
Open
4 tasks done

Warehouse does not reconcile #2836

wmiller112 opened this issue Oct 24, 2024 · 17 comments

Comments

@wmiller112
Copy link
Contributor

Checklist

  • I've searched the issue queue to verify this is not a duplicate bug report.
  • I've included steps to reproduce the bug.
  • I've pasted the output of kargo version.
  • I've pasted logs, if applicable.

Description

This is a continuation of this issue. I continue to experience warehouses not respecting spec.interval or reconciling when changed. I've observed the same behavior across ~20 different projects. It does not seem related to git/image repo because when the reconcile does finally trigger, it executes in a matter of seconds, as can be seen in the screenshot of logs. The reconcile occurs at fairly regular interval of close to 15-20 minutes when the interval is set to any time, ranging from 2m to 10m. Triggering a sync via gui adds annotation to warehouse, but even this does not trigger the reconcile. It eventually reconciles on the same interval.

Screenshots

Screenshot 2024-10-24 at 8 59 21 AM

Steps to Reproduce

  • Create a warehouse with spec similar to
spec:
  freightCreationPolicy: Automatic
  interval: 2m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestFromBranch
        discoveryLimit: 20
        includePaths:
          - regex:path/to/manifests/.*
        repoURL: https://github.com/<org>/<repo>
    - image:
        allowTags: ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
        discoveryLimit: 20
        gitRepoURL: https://github.com/<org>/<repo>
        imageSelectionStrategy: Lexical
        repoURL: xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo>
  • Click refresh on the warehouse in gui
  • Watch it spin until it eventually reconciles

I'd expect immediately after clicking reconcile, the modified warehouse would be seen by the controller and reconcile - triggering a freight fetch.

Version

Client Version: v1.0.3
Server Version: v1.0.3
@krancour
Copy link
Member

@wmiller112 first, my apologies. I really thought we'd found it and nailed it in #2544 and #2754.

I know you don't suspect rate limiting at this point (nor do I), but I'm still curious if you observe any change in this behavior when you're subscribed to just one repo. In particular, if you subscribe only to the Git repo, does the situation improve?

Also, just to get more data here, is there any sharding involved here?

@hiddeco
Copy link
Contributor

hiddeco commented Oct 24, 2024

It may be an idea for us to bake an image with net/http/pprof enabled so we can collect a profile.

@wmiller112
Copy link
Contributor Author

wmiller112 commented Oct 24, 2024

Its the same for a warehouse with only a git subscription.

I've built/am running the controller image with pprof, and I'm able to collect profiles/traces, but am not exactly sure what to look for. Also added some additional logging around requeue interval. Confirmed requeue is being set to defined interval when reconcile does run.

Edit: Ah, I see the debug readme that was added in the pr instructs to collect heap profile

@hiddeco
Copy link
Contributor

hiddeco commented Oct 24, 2024

The things typically of interest are the following endpoints:

  • /debug/pprof/goroutine?debug=2 to see what goroutines are currently running, waiting, or blocked.
  • /debug/pprof/profile to see what the CPU is spending most time on.
  • /debug/pprof/heap to get a better insight into memory allocations.
  • /debug/pprof/trace?seconds=<number of seconds> to collect a trace (i.e., almost anything the program is doing over these seconds).

@hiddeco
Copy link
Contributor

hiddeco commented Oct 24, 2024

Some of these profiles can hypothetically contain slightly sensitive data (e.g., memory addresses or minimal operational details). If you want to share them, you can do so via mail: [email protected].

@hiddeco
Copy link
Contributor

hiddeco commented Oct 24, 2024

My current working theory based on the traces shared by @wmiller112 in private, is that due to the concurrency configuration of our reconcilers currently being set to the default (which equals 1). Things are not being picked up in time because the controller is working on gathering the latest image build information for another Warehouse (which can be quite an expensive task depending on the number of images you have).

I have shared a branch with him (https://github.com/hiddeco/kargo/tree/allow-concurrency-config) that allows the configuration of the number of concurrent reconciliations to see if this proves I am right. If this is true, we should create a proper solution for this and think about the default configuration we want to ship with.

@hiddeco hiddeco self-assigned this Oct 24, 2024
@wmiller112
Copy link
Contributor Author

With the MAX_CONCURRENT_RECONCILES set to 4 I definitely see more timely warehouse reconciles (down to about 5-10 minutes, with the interval still set to 2m). Interestingly the execution of the reconcile takes longer (was 10-20 seconds, after at about 1-2 min). These logs are for single warehouse, same as initial post:
Screenshot 2024-10-24 at 7 30 14 PM
It seems there is a point of (completely) diminishing returns; Increasing beyond 4 (tried 6, 8) reconcile seems to start on some warehouses and just hang for 20+ minutes. Have been keeping an eye on resource utilization and staying well within requests/limits

@hiddeco
Copy link
Contributor

hiddeco commented Oct 25, 2024

Would it be possible for you to send me another set of profiles with this configuration enabled?

@hiddeco
Copy link
Contributor

hiddeco commented Oct 28, 2024

After some more debugging through private channels, the reason for this can be summarized as follows:

  • At present, all reconcilers have a single worker configured. Because of this, anything that takes a long time will delay the processing of all other resources of the same kind.
  • When using (many) "newest build" selectors for container images in Warehouses, there is a high possibility they will run into rate limits from the registry at some point. Because there is a fair chance multiple Warehouses will be rate-limited due to the request coming from the same IP, the credentials being shared with other Warehouses, the rate limit being tied to the account/organization, or a combination — it will cause a flood of HTTP 429 responses. Combined with our fairly simple registry implementation, I suspect these are not detected correctly and retried, in addition, I suspect that running Goroutines that have already kicked off are not canceled as soon as a first error occurs.
  • We are not configuring GOMAXPROCS and GOMEMLIMIT to match CPU and memory limits while this is required for Go to properly detect the limits of the Pod, and not the limits of the host:

To address this, we should take multiple actions:

  • Allow configuration of the number of workers, preferably per reconciler. The rationale for this is that not every kind is expected to be a long-running process. Warehouses and Promotions can be, but Stages should be relatively quick.

  • Ensure the chart properly detects CPU and memory limits and configures the earlier-mentioned environment variables. For example:

    env:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          containerName: controller
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          containerName: controller
          resource: limits.cpu
  • Revise the registry logic, find the gaps that result in a performance impact, and seal them so it won't happen again.

@ddeath
Copy link
Contributor

ddeath commented Oct 29, 2024

Hi, I am on 1.0.3 and I observing the similar behavior.

I have 1 project, 1 warehouse, 1 stage and it does not discover new docker images automatically... If I press button to refresh, it will find it but other otherwise it does not do anything.

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  name: testing-playground-warehouse
  namespace: app
spec:
  interval: "1m0s"
  subscriptions:
    - image:
        repoURL: registry.gitlab.com/xxxxxxx/testing-playground-service
        strictSemvers: false   

@ddeath
Copy link
Contributor

ddeath commented Oct 29, 2024

I deleted Project, Warehouse and Stage and recreated them again an now it works fine. Will try to do some stuff to get it stuck again.

@ddeath
Copy link
Contributor

ddeath commented Oct 29, 2024

Just a note, I am observing this behavior again. I still have just 1 Project, 1 Stage, 1 Warehouse.

I dont see any errors except:

time="2024-10-29T14:40:32Z" level=error msg="finished streaming call" connect.code=internal connect.duration=17.922126759s connect.method=WatchWarehouses connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-29T14:40:14Z" error="internal: context canceled"
time="2024-10-29T14:40:32Z" level=error msg="finished streaming call" connect.code=internal connect.duration=17.92214805s connect.method=WatchStages connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-29T14:40:14Z" error="internal: context canceled"

So if it would be Rate limit I would expect to see some errors in pods logs. I dont really have reproducible steps yet, the only thing I was doing is changing Stage manifest (testing different imageSelectionStrategy

Let me know if you think this is completely different issue, I can open new one.

@krancour
Copy link
Member

@ddeath you're looking at API server logs when you should be looking at controller logs.

@ddeath
Copy link
Contributor

ddeath commented Oct 29, 2024

@krancour I checked all pods (kargo-api, kargo-controller, kargo-management-controoler, kargo-webhooks-server) and the only errors there where those from kargo-api

@krancour
Copy link
Member

@ddeath, you'll notice @hiddeco and @wmiller112 had a lot of back and forth over the course of two issues to figure out what was going on here. @hiddeco has summarized findings nicely above. You and I also had some recent discussion about what sort of things can make Warehouse reconciliation slow (and therefore become a bottleneck). Please try to use this information to determine whether you may have one or more poorly-configured Warehouses that are taking a long time to reconcile and blocking others from reconciling at regular intervals. Enabling debug or even tracke-level logging may help as well.

@ddeath
Copy link
Contributor

ddeath commented Oct 29, 2024

Yeah I am slowly doing that. However as I noted I have only 1 Warehouse with only 1 subscription (with private token, not public one so rate limits will be higher) and I am not seeing any reconcile happening for a long time (30 minutes). So just reporting my findings 🤷
But ok I promise will not write another comment with some sort of hard evidence of what is wrong in my case...

@krancour
Copy link
Member

with only 1 subscription

What are its details? Is it a subscription you'd expect to perform poorly based on our other discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants