This repository contains all the SRE (Site Reliability Engineering) principles and guidelines for managing the Operate First services.
SRE is a software engineering approach to manage operations for systems, applications and services. We use software as a tool to manage systems, solve problems, and automate operations tasks.
We have Open Data Hub applications deployed and running in a MOC (Mass Open Cloud) cluster. Open Data Hub is an end-to-end AI/ML platform on top of OpenShift Container Platform which provides various tools for Data Scientists and Engineers.
The components we currently have available are:
For each of the Operate First managed services, we define Service Level Indicators (SLI) and Service Level Objectives (SLO) to help us monitor the availability of the service and improve its reliability.
The sli-slo
folder here lists out the SLI/SLO for each of the applications in separate markdown files. For eg, you can find the SLI/SLO defined for the JupyterHub
application here.
- Prometheus - Monitoring tool capturing time series metrics for available Operate First services
- Prometheus Alertmanager - Alertmanager for handling alerts and routing them to the appropriate receiver integration such as Email, Slack, PagerDuty etc
- Grafana - Visualization tool for creating monitoring dashboards
- GitHub Receiver - Incident reporting tool for handling outages/incidents
All of the Operate First services are monitored by Prometheus. We use Prometheus metrics to identify and define alerts on possible outages/incidents.
The GitHub Alertmanager Receiver is our chosen tool for handling and reporting outages/incidents to our users.
The incident-management
folder contains more information on how to setup GitHub alertmanager receiver and configure Prometheus alerts.
For each of the services being monitored, we aim to define runbooks for providing information on how to administer, debug and effectively troubleshoot common problems of a service.
These runbooks (when created), will be defined in the runbooks
folder.