Skip to content

HumairAK/SRE

This branch is 1 commit ahead of, 33 commits behind operate-first/operations:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

a5322c5 · Feb 3, 2021

History

23 Commits
Feb 2, 2021
Jan 29, 2021
Dec 15, 2020
Oct 7, 2020
Dec 15, 2020
Dec 15, 2020
Sep 30, 2020
Feb 3, 2021
Feb 1, 2021
Dec 15, 2020

Repository files navigation

Site Reliability Engineering (SRE) Support

This repository contains all the SRE (Site Reliability Engineering) principles and guidelines for managing the Operate First services.

What is SRE?

SRE is a software engineering approach to manage operations for systems, applications and services. We use software as a tool to manage systems, solve problems, and automate operations tasks.

Services Monitored

We have Open Data Hub applications deployed and running in a MOC (Mass Open Cloud) cluster. Open Data Hub is an end-to-end AI/ML platform on top of OpenShift Container Platform which provides various tools for Data Scientists and Engineers.

The components we currently have available are:

SLI and SLOs

For each of the Operate First managed services, we define Service Level Indicators (SLI) and Service Level Objectives (SLO) to help us monitor the availability of the service and improve its reliability.

The sli-slo folder here lists out the SLI/SLO for each of the applications in separate markdown files. For eg, you can find the SLI/SLO defined for the JupyterHub application here.

Operational Tools

  • Prometheus - Monitoring tool capturing time series metrics for available Operate First services
  • Prometheus Alertmanager - Alertmanager for handling alerts and routing them to the appropriate receiver integration such as Email, Slack, PagerDuty etc
  • Grafana - Visualization tool for creating monitoring dashboards
  • GitHub Receiver - Incident reporting tool for handling outages/incidents

Incident Management

All of the Operate First services are monitored by Prometheus. We use Prometheus metrics to identify and define alerts on possible outages/incidents.

The GitHub Alertmanager Receiver is our chosen tool for handling and reporting outages/incidents to our users.

The incident-management folder contains more information on how to setup GitHub alertmanager receiver and configure Prometheus alerts.

Runbooks

For each of the services being monitored, we aim to define runbooks for providing information on how to administer, debug and effectively troubleshoot common problems of a service.

These runbooks (when created), will be defined in the runbooks folder.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published