This branch is 1 commit ahead of, 33 commits behind operate-first/operations:master.

Name	Name	Last commit message	Last commit date
Latest commit HumairAK Update owners file. Feb 3, 2021 a5322c5 · Feb 3, 2021 History 23 Commits
dashboards	dashboards	Add monitoring dashboards (operate-first#54 )	Feb 2, 2021
incident-management	incident-management	Add incident management docs (operate-first#35 )	Jan 29, 2021
runbooks	runbooks	Add CI files and fix linting errors (operate-first#29 )	Dec 15, 2020
sli-slo	sli-slo	Add Readme	Oct 7, 2020
.aicoe-ci.yaml	.aicoe-ci.yaml	Add CI files and fix linting errors (operate-first#29 )	Dec 15, 2020
.pre-commit-config.yaml	.pre-commit-config.yaml	Add CI files and fix linting errors (operate-first#29 )	Dec 15, 2020
LICENSE	LICENSE	Initial commit	Sep 30, 2020
OWNERS	OWNERS	Update owners file.	Feb 3, 2021
README.md	README.md	Update README (operate-first#53 )	Feb 1, 2021
yamllint-config.yaml	yamllint-config.yaml	Add CI files and fix linting errors (operate-first#29 )	Dec 15, 2020

Repository files navigation

Site Reliability Engineering (SRE) Support

This repository contains all the SRE (Site Reliability Engineering) principles and guidelines for managing the Operate First services.

What is SRE?

SRE is a software engineering approach to manage operations for systems, applications and services. We use software as a tool to manage systems, solve problems, and automate operations tasks.

Services Monitored

We have Open Data Hub applications deployed and running in a MOC (Mass Open Cloud) cluster. Open Data Hub is an end-to-end AI/ML platform on top of OpenShift Container Platform which provides various tools for Data Scientists and Engineers.

The components we currently have available are:

JupyterHub

SLI and SLOs

For each of the Operate First managed services, we define Service Level Indicators (SLI) and Service Level Objectives (SLO) to help us monitor the availability of the service and improve its reliability.

The sli-slo folder here lists out the SLI/SLO for each of the applications in separate markdown files. For eg, you can find the SLI/SLO defined for the JupyterHub application here.

Operational Tools

Prometheus - Monitoring tool capturing time series metrics for available Operate First services
Prometheus Alertmanager - Alertmanager for handling alerts and routing them to the appropriate receiver integration such as Email, Slack, PagerDuty etc
Grafana - Visualization tool for creating monitoring dashboards
GitHub Receiver - Incident reporting tool for handling outages/incidents

Incident Management

All of the Operate First services are monitored by Prometheus. We use Prometheus metrics to identify and define alerts on possible outages/incidents.

The GitHub Alertmanager Receiver is our chosen tool for handling and reporting outages/incidents to our users.

The incident-management folder contains more information on how to setup GitHub alertmanager receiver and configure Prometheus alerts.

Runbooks

For each of the services being monitored, we aim to define runbooks for providing information on how to administer, debug and effectively troubleshoot common problems of a service.

These runbooks (when created), will be defined in the runbooks folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Reliability Engineering (SRE) Support

What is SRE?

Services Monitored

SLI and SLOs

Operational Tools

Incident Management

Runbooks

About

Releases

Packages

License

HumairAK/SRE

Folders and files

Latest commit

History

Repository files navigation

Site Reliability Engineering (SRE) Support

What is SRE?

Services Monitored

SLI and SLOs

Operational Tools

Incident Management

Runbooks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages