Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(GLOSSARY.md): reintroduce Break Glass #42

Closed
wants to merge 1 commit into from

Conversation

lloydchang
Copy link
Contributor

• reintroduce Break Glass from RC1 at #21
• prepend with · · · — — — · · · 🆘 which:
1. make this idiom/analogy more accessible for a global audience
2. alphabetize to the bottom of glossary

• reintroduce Break Glass from RC1 at open-gitops#21
• prepend with · · · — — — · · · 🆘 which:
1. make this idiom/analogy more accessible for a global audience
2. alphabetize to the bottom of glossary

Signed-off-by: lloydchang <[email protected]>
@lloydchang

This comment has been minimized.

@grmhay
Copy link

grmhay commented Oct 27, 2021

Lloyd I'm happy to prepend something so it doesn't become a negative to GitOps adoption. The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.

@lloydchang
Copy link
Contributor Author

lloydchang commented Oct 30, 2021

TL;DR:

• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:

❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?

❔ Are we defining source of truth very differently? (please see below)


@grmhay wrote at #37 (comment):

We (Morgan Stanley) believe that the situation where the source of truth for desired state (e.g. github.com or a git-equivalent that an enterprise may run) is less available than your users' expected SLA for making configuration changes is being left by the community as an issue for the implementer to overcome.

Put succinctly, if Github is unavailable and you want to make changes to your System State, there should be one approach and a set of tooling to allow reconciliation after the fact.

This will both harm adoption of gitops and is inefficient as I believe we shared a common challenge that we can solve once within the project.

The first step, as this project has so well established, is a glossary of terms to allow us to describe the problem and a draft principle to add. I have included these in this PR.

Hi @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome

I don't know which GitOps tool @grmhay is using. If @grmhay is using Flux CD v2, then there is flux suspend source git and flux resume source git

One plausible approach and set of tooling could look like:

  1. Optionally(?), suspend reconciliation of a GitRepository resource, i.e. flux suspend source git
  2. Run git commit locally
  3. Temporarily, run kubectl apply
  4. Eventually run git push locally after remote git is available
  5. Optionally(?), resume a suspended GitRepository, i.e. flux resume source git

Instead of GitOps, remote git, i.e. github.com and GitHub Enterprise (Server and/or Cloud) seems to be one of many(?) root causes of the problem that @grmhay described.

❔ Can the problem be better solved at the remote git implementation level?

❔ Curiously, would @grmhay have a conversation about Service Level Availability (SLA), High Availability (HA), Active-active, Geo-replication, etc. with the administrator(s) of @grmhay's GitHub Enterprise on-premise, GitHub Enterprise Cloud Support, and/or github.com Support?

Git (originating from git.kernel.org) is designed to be distributed. Out of the box, git works fine locally, e.g. git commit in offline mode. Online mode is needed for git push from one local git to many remote gits.

Remote gits can be managed as a cloud service, or hosted and replicated on-premise, e.g.

• GH: github.com: "We expect that most of these monthly updates will recap periods of time where GitHub was >99% available"

• GHEC: GitHub Enterprise Cloud: "99.95% uptime SLA"

• GHE: Geo-replication on GitHub Enterprise Server

• BDC: Bitbucket Data Center

• WGM: WANDisco Git Multisite

• GL: GitLab active-active git replication

Source of truth:

@grmhay wrote at #42 (comment)

The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.

❔ Perhaps @grmhay's "source of truth" refers to a managed service specifically because the word "unavailability" is used?

• From my perspective, the source of truth isn't because of a remote git managed service and its uptime availability at all.

• While there can be a github.com with uptime availability of >99%, or GitHub Enterprise Cloud with uptime availability of 99.95%, their uptime availability is unrelated to the source of truth in the format of a unique SHA hash.

• There is a source of truth because each Git commit has a unique SHA hash across all Git repositories in the Universe.

I empathize with folks so they don't get stuck. For any single point of failure (SPOF), folks will still need to think through on their own depending on exactly where the root causes of the problem are, and why?

Points of failure can happen in many places — from one central system lacking active-active high availability, to federated identity, to distributed computer networking (BGP hijacking or DNS hijacking).

I don't know if @grmhay's specific setup is air-gapped or not. For what it's worth, there is an air-gapped use case described at How the U.S. Army Software Factory and Enterprise Cloud Management Agency are using Carvel and Cluster API to declaratively manage Kubernetes workloads and clusters in secure air-gapped environments.

GitHub, Git, GitOps are different things:

If the root causes of many(?) are with GitHub Enterprise Cloud, or GitHub Enterprise on-premise, or github.com, then that is at least two orders of magnitude between:

  1. GitHub with pull requests
  2. Git with commits
  3. GitOps with principles

Concretely, if a pull request cannot happen because github.com is unavailable, then the root causes are directly at GitHub. At this point, the root causes aren't directly at local gits on end users' computers, nor directly at GitOps controllers running in Kubernetes and their own set of gits.

That being said, it appears that one of the GitOps tool implementations, Flux CD v2, provides flux suspend source git and flux resume source git — Optionally(?), they may be applicable to the situation: GitHub is unavailable.

To recap:

• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:

❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?

❔ Are we defining source of truth very differently? (please see above)

Thank you @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome for your time 🙂

@lloydchang lloydchang closed this Dec 11, 2021
@lloydchang
Copy link
Contributor Author

lloydchang commented Dec 15, 2021

Above #42 (comment)

relates to

https://cloud-native.slack.com/archives/C01G9DEE88M/p1639546167386800?thread_ts=1639072194.295800

Sociotechnical Considerations for GitOps:

While Availability is a basic principle of Information Security…

Why do companies centralize Git? ... for a lot of companies, it doesn’t make sense to spend resources on re-engineering Git hosting for higher availability to mitigate issues that will most likely not affect their business.

The trade-offs between High Availability versus CVCS are unique to each organization’s nuances.

Is it frugality?

DVCS with auditable code reviews, multi-master replication & multi-site exist for organizations that spend resources.

Solutions exist in free and open-source software, and from vendors, for example:

Gerrit Multi-Master Configuration: With multiple Gerrit masters it is possible to mitigate server load by allowing users to access a server which has more free resources, and it is also possible to provide higher availability by allowing service to be transferred to any remaining masters when a master fails.

Gerrit Multi-master and Multi-site: an OpenSource solution:
• In 2018, Qualcomm went live with a Gerrit multi-master setup
• In 2019, GerritHub.io went multi-master and multi-site

and

https://cloud-native.slack.com/archives/C01G9DEE88M/p1639553689387200?thread_ts=1639072194.295800

• Git DVCS communities already have solutions for DVCS replication
• GitOps community doesn’t need to reinvent the wheel at all

The fundamental issues are sociotechnical — When many organizations want both Frugality and to Think Big, there is conflict with (un-)healthy tension.

These fundamental & sociotechnical issues are far beyond the scope of the GitOps Working Group Charter.

Thank you all 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants