Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on closed-loop and control theory #24

Closed
todaywasawesome opened this issue Oct 1, 2021 · 7 comments · Fixed by #31
Closed

Discussion on closed-loop and control theory #24

todaywasawesome opened this issue Oct 1, 2021 · 7 comments · Fixed by #31
Milestone

Comments

@todaywasawesome
Copy link
Member

todaywasawesome commented Oct 1, 2021

How you can help now

  1. Please read Dan’s good issue summary (original issue below), which came out of our last group meeting. The gist of the question is, do we think the originally included “closed loop” principle should be added back before the planned v1.0.0 release milestone (scheduled for EOW, no later than Monday Oct 11)

  2. We are now asking YOU – WG members & maintainers – to 👍 or 👎 this issue by Friday Oct 8th. (ideally, also comment a short reason).

    Note this is not a decision for all time, but just for this first full release. We can continue to discuss for possible inclusion after v1.0.0 if there is too much divergence in opinion. It was left out of RC 1 (see “items left out of this PR”) and not yet included in RC 2 because there was not enough group response so far.


Original issue

This came up as part of an ongoing discussion around #22 with @squaremo @lloydchang @scottrigby @murillodigital.

#22 (comment)

When we did #21 we removed the closed-loop because we couldn't really articulate why it was there. Some of the discussions referenced above brought back some of the ideas of why we had included closed-loop in the first place.

I'm not very familiar with control theory so the usage of the word "feedback" really threw me off. Feedback seemed like something could happen to your actual state that would then somehow inform your desired state. After going through it with @scottrigby, the way control theory uses "feedback" is that there is a recognition of what is actually occurring in the system and that it is taken into account.

On some level, I think this is encapsulated in two principles "declarative" + "continuously reconciled". With those two ideas, I think you can probably get to closed-loop. For example, if you were to create a progressive delivery plan that involved checking metrics and then making a decision to rollback or move ahead I would view that as part of GitOps as long as it is declaratively expressed and continuously reconciled. The reconciliation implies the idea of closed-loop feedback.

However, as it has come up over and over again and @scottrigby has pointed out, it may be worthwhile making that idea clearly explicit by making it a principle.

For #22 I think we're time boxing to move toward GitOps V1 but if we:

  1. Believe there is an important value add to bring closed-loop back and
  2. That we can articulate that value clearly

then we may move ahead to include it in v1. So long as it does not introduce a delay.

In other words: if this is important to you, please fight for it here and give your ideas for how to communicate it.

@scottrigby scottrigby mentioned this issue Oct 2, 2021
3 tasks
@scottrigby scottrigby added this to the v1.0.0 milestone Oct 2, 2021
@lloydchang
Copy link
Contributor

lloydchang commented Oct 2, 2021

Hi @todaywasawesome @scottrigby

I view Continuous Delivery (CD) as "weak" GitOps, and Continuous Operations as "strong" GitOps. My understanding of our agenda here is whether to require "strong" GitOps by adding a principle about a closed loop control system.


From @todaywasawesome

On some level, I think this is encapsulated in two principles "declarative" + "continuously reconciled". With those two ideas, I think you can probably get to closed-loop.

However, as it has come up over and over again and @scottrigby has pointed out, it may be worthwhile making that idea clearly explicit by making it a principle.


From @scottrigby

  1. Operated in a closed loop
    Software agents observe desired state and meta-state to take actions based on policy when the desired state cannot be reached. This may include things like notifications, rollbacks, etc.

#21 (comment)
and
#22 (comment)


From @cdavisafc

Kubernetes itself doesn't address continuous operations - CO is sold separately. GitOps has emerged as the way to continuously operate your Kubernetes ecosystem and the applications running thereon.

https://www.linkedin.com/pulse/gitops-fan-here-cornelia-davis


@joebowbeer's description about "weak" GitOps versus "strong" GitOps captures the essence of this topic:

From @joebowbeer via aws-samples/eks-workshop#1162 (comment)

@briancaffey regarding CDK's suitability for GitOps, I think the assessment depends on your definition of GitOps.

In "weak" GitOps, which is solely "operations by pull request", CDK seems fairly suitable.

I think CDK is a more difficult fit in "strong" GitOps, which, in addition to operations by pull request also leverages k8s operators/controllers such as Flux's GitOps operator and Amazon's controllers (ACK). The GitOps operators reconcile the cluster state with the desired state stored in git. The desired state is periodically synced to an in-cluster clone of the repo.


Concrete examples of "weak" GitOps are Operations by Pull Request via Amazon ECS; there are examples by @rizblie at

Amazon ECS: AWS CloudFormation, AWS CodePipeline, Container Registry (Amazon ECR), Git (AWS CodeCommit)

https://cicd-for-ecs.workshop.aws/en/5-advanced/lab4-gitops.html

and

Amazon ECS: Hashicorp Terraform, AWS CodeBuild, Container Registry (Amazon ECR), Git (AWS CodeCommit)

https://cicd-for-ecs.workshop.aws/en/6-other/lab-terraform.html


Concrete examples of "strong" GitOps are Integrated Policy Enforcement via Amazon EKS that @mikestef9 is looking to get input at

Amazon EKS: AWS Containers Roadmap: [EKS] [request]: Integrated Policy Enforcement

aws/containers-roadmap#1435


In the above examples, a single vendor, Amazon Web Services (AWS), offers "weak" GitOps via Amazon ECS, and "strong" GitOps via Amazon EKS.


Concrete examples from different vendors:


"strong" GitOps via Microsoft Azure, by @v-thepet, @EdPrice-MSFT, @v-kents, @alexhart11

GitOps for Azure Kubernetes Service: Azure Kubernetes Service (AKS), GitHub, Flux, Open Policy Agent (OPA) Gatekeeper, Syncier Security Tower

This solution follows a strong GitOps approach.

https://docs.microsoft.com/en-us/azure/architecture/example-scenario/gitops-aks/gitops-blueprint-aks

https://github.com/MicrosoftDocs/architecture-center/blob/master/docs/example-scenario/gitops-aks/gitops-blueprint-aks-content.md


"strong" GitOps via Google Cloud, by @crcsmnky and @ggalloro

GitOps Con 2021: Shifting Policy Enforcement to the Left using GitOps: Open Policy Agent (OPA) Gatekeeper by @crcsmnky

https://www.youtube.com/watch?v=XvQZ3ZDjRls

GitOps Days 2021: Using Source Code Management Patterns to Configure & Secure Kubernetes Clusters: ACM Policy Controller, based on Open Policy Agent (OPA) Gatekeeper by @ggalloro

https://www.youtube.com/watch?v=u2rmx-2MwNA


"strong" GitOps via @redhat-cop Red Hat Community of Practice

Automate Your Security Practices and Policies on OpenShift With Open Policy Agent by @garethahealy, @wmcdonald404, @noelo, @monodot

This blog post aims to explain Open Policy Agent (OPA) basics and how the Red Hat Containers Community of Practice (CoP) has started to implement a collection of policies using the toolset.

As a member of the Red Hat UK&I Consulting team, I work with customers who are in the process of onboarding their applications onto OpenShift Container Platform (OCP). One type of question customers typically ask is: "How do I stop an application team from deploying images with the latest tag or from using requests and limits that are disruptive to the platform?"

Previously, I would have suggested building a process around their continuous integration/continuous delivery (CI/CD) pipelines to validate the Kubernetes resources and, based on company policy, allow or deny the release. Although this works in most situations, it has one major flaw. It is not natively built into or on top of Kubernetes, which allows teams to bypass policies if they are not mandated or manually change the released resources via oc or the web console. This type of implementation always has aspects of "security through obscurity," which is doomed to fail.

So what do I think the answer could be? OPA!

One quick note before we begin: Open Policy Agent is an open source project. It is not Red Hat sponsored, nor is it supported under a Red Hat subscription.

https://cloud.redhat.com/blog/automate-your-security-practices-and-policies-on-openshift-with-open-policy-agent


As for less-vendor-specific examples:


"weak" GitOps:

• Spinnaker when used without Kubernetes at all

• Spinnaker: Git (GitHub)

https://spinnaker.io/docs/guides/user/pipeline/triggers/github/

• Spinnaker: Application Deployment

https://spinnaker.io/docs/concepts/#application-deployment


• Spinnaker supports application deployments without Kubernetes, and with Kubernetes


"strong" GitOps:

• Spinnaker when used with Kubernetes

https://spinnaker.io/docs/setup/install/providers/kubernetes-v2/

• Argo CD which requires Kubernetes

https://argo-cd.readthedocs.io/en/stable/operator-manual/architecture/#application-controller

• Flux CD which requires Kubernetes

https://fluxcd.io/docs/components/

• Jenkins X which requires Kubernetes

https://jenkins-x.io/v3/develop/reference/jx/gitops/

Open Policy Agent (OPA) Gatekeeper - Policy Controller for Kubernetes

https://github.com/open-policy-agent/gatekeeper

When K8s is used, there is implicitly "strong" GitOps because K8s is a closed loop control system, explained at https://kubernetes.io/docs/concepts/architecture/controller/

In Kubernetes, controllers are control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.

... similar to the definition from https://en.wikipedia.org/wiki/Control_theory#Open-loop_and_closed-loop_(feedback)_control

The definition of a closed loop control system according to the British Standard Institution is "a control system possessing monitoring feedback, the deviation signal formed as a result of this feedback being used to control the action of a final control element in such a way as to tend to reduce the deviation to zero."


The above relate to my note at #22 (comment)

@moshloop
Copy link
Contributor

moshloop commented Oct 3, 2021

For me feedback in a GitOps system is from the perspective of the desired state, i.e. If I look at the desired state in Git can I see if that state has been applied before I make a decision as to whether I can safely make a change to that desired state?

If you do not have this feedback, you cannot determine if you are making changes based on an actual state or a desired state that has not yet been realized. Deploying either https://fluxcd.io/docs/components/notification/ or https://github.com/Azure/gitops-connector would close the loop on a standard flux implementation.

PID loops (closed loops that use metrics to make decisions) are orthogonal to GitOps based systems - they occur at an abstraction layer 1 level above (e.g. automatically committing the results of the decision to git) or 1 level below (as a controller that uses a policy defined in the desired state).

In either case, I don't think this can be a principle as it would raise the barrier to entry significantly - We should however add closed loops as a best practice and possibly create a guide on implementing PID loops in GitOps compatible way.

@lloydchang
Copy link
Contributor

Thank you, @moshloop, for clarifying PID loops at different abstraction layers.

If "Operated in a closed loop" isn't a principle, then I agree with:

We should however add closed loops as a best practice and possibly create a guide on implementing PID loops in GitOps compatible way.

While this discussion, adding a principle about a closed loop control system, remains open:

... I respectfully disagree with the assertion, "it would raise the barrier to entry significantly", because the assertion seems to consider principles as prescriptive and exclusionary.

In practice, principles are open for interpretation.

Some people view principles holistically, while other people view principles incrementally as maturity levels (from 1 to n, or working backwards from n to 1). Principles are something to strive for, and pragmatic practitioners would admit that some principles won't be accomplished during initial attempts at implementation; there is usually room for improvement.

Regardless of the content within principles... The marketing and branding of GitOps or GitOps-like would be used, either with or without GitOps certification, because GitOps is usually a means, and not an end.

To use an analogy:
• Like UNIX, GitOps is usually a means, and not an end
• People are willing to use UNIX-like systems such as Linux, and UNIX-certified systems such as macOS, as a means, and not an end
• Similarly, people are willing to use either GitOps-like or GitOps-certified systems, as a means, and not an end

Our agenda is about making the idea, "Operated in a closed loop", clearly explicit by making it a principle.


@moshloop @todaywasawesome @scottrigby Our GitOps Working Group may be interested in @colmmacc's talk, PID Loops and the Art of Keeping Systems Stable: Control Theory: Where the fruit is hanging so low IT IS TOUCHING THE GROUND
• Video & Transcript https://www.infoq.com/presentations/pid-loops/
• Video https://www.youtube.com/watch?v=3AxSwCC7I4s
Slides https://www.slideshare.net/InfoQ/pid-loops-and-the-art-of-keeping-systems-stable

In particular, starting at 22:13 until 28:07

I can't count the number of customers I've gone through control systems with and they told me, "We have this system that pushes out some states, some configuration and sometimes it doesn't do it." They don't really know why, but they have built this other button that they press that basically starts everything all over, and it gets there the next time, and in some cases, they even have their support personnel at the end of a phone line. That's what they do, they get a complaint from one of their customers saying, “I took an action, I set a setting and it didn't happen.” and they have this magic button, they press it and it syncs all the config out again and it's fixed.

I find that scary, because what it's saying is nothing's actually monitoring the system. Nothing's really checking that everything is as it should be. Already every day they're getting this creep from what the state should be, and if they ever had a really big problem, like a big shock in the system, it clearly wouldn't be able to self-repair healthily, which is not what you want. Another common reason for Open Loops is when actions are just infrequent. If it's an action that you're not taking really often, odds are it's just an Open Loop. It's relying on people to fix things, not the system itself.

[...]

The magic to fixing these Open Loops is to really think about measuring first, like I said with that earliest example about just taking feedback and integrating that, but approaching systems design as, “I'm not going to write a script or a control system that just does X, then Y, then Z.” Instead, I'm going to approach it as, “I'm going to describe my desired state, so it's a bit more declarative, and then I'm going to write a system that drives everything to that desired state.” Those are very different shapes of systems. You'll just write your code very differently when you've got that mental model.

In my experience, that model is far better, because it is a closed-loop from day one, far better because they tend to be more succinct ways to just describe these systems, far better because it can also be dual purpose. Often your provisioning system can be the same as a control system.

@christianh814
Copy link
Member

My instinct is to include it in the principles. I don't think it's exclusionary and it solidifies it as an "actual thing".

@scottrigby
Copy link
Member

@christianh814 I personally lean that way too. I believe this also helps explain how declarative configurations for things like notification on divergence, rollbacks, and even intelligent agent responses like progressive delivery, can fit into GitOps principles when done properly.

@moshloop Thanks yes, IMO raising barrier of entry is a very important concern for principles, but I think if addressed properly it does not need to raise the barrier, but rather help differentiate from systems where zero feedback from the system (including a count of previous reconciliation attempts) is taken into account by the software agents.

🙏 Also requesting response from @chrispat @csand-msft @todaywasawesome @jlbutler @murillodigital, all other WG members, and anyone else who wants to weigh in on this today or tomorrow, before the v1.0.0 release is scheduled. Thanks!

OK, here was the wording we were debating for RC 1 (see “items left out of this PR”) but did not yet include in that or RC 2 due to not enough group response by that point. How do you think this reads?:

The desired state of a GitOps managed system must be:
...
5. Operated in a closed loop
Software agents observe desired state and meta-state to take actions based on policy when the desired state cannot be reached. This may include things like notifications, rollbacks, etc.

@moshloop
Copy link
Contributor

moshloop commented Oct 7, 2021

Even if this is made a principle it would still need to distinguish between the 2 different loops in operation.

  1. The reconciliation loop between the actual state and the agent applying changes (of which I agree that it can be included as a principle)
  2. The loop between the agent and the desired state

i.e. if the principle includes both loops then a Flux deployment without the notification controller would not be "GitOps compliant" and that would be pretty confusing

@todaywasawesome
Copy link
Member Author

@scottrigby ah, sorry I was working on a PR and didn't see your comment. @moshloop checkout the PR and see if that resolves your concern. I'm not completely clear on what you mean. #31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants