This repository implements an AI-powered alerting system that uses a Hugging Face BERT model to classify and prioritize log alerts based on severity, specifically notifying only for critical alerts. The system integrates with Prometheus for metrics collection and Grafana for visualization and alerting, and is built with Python for log processing.
- Introduction
- Features
- Prerequisites
- Project Structure
- Installation
- Docker-Related Files
- Configuration
- Usage
- Testing and Alerts
- Prometheus and Grafana Setup
- Demo
- Additional Improvements
- Roadmap: Next Steps for Improvements
This project demonstrates how to classify log events using Hugging Face's BERT model to filter critical log messages and trigger alerts only when critical issues arise. Prometheus is used to scrape the log metrics, and Grafana is used for visualization and alert notifications. This approach reduces noise by ensuring that only critical logs are flagged and alerted.
AI-Based Log Classification
: Uses machine learning to classify log messages based on severity.Critical Alerts
: Alerts are triggered only for critical logs, reducing noise and improving response time.Prometheus & Grafana Integration
: Real-time metrics collection and visualization.Production-Ready Deployment
: Uses Gunicorn to run the Flask app in a production environment.Kubernetes Support
: Kubernetes manifests for deploying the system in a scalable environment.Lazy Loading
: The system optimizes resource usage with lazy loading of machine learning models.
Before starting, make sure you have the following tools installed:
- Python 3.8+: The application is built using Python.
- Prometheus: For metrics collection. Prometheus will scrape metrics from the Python app.
- Grafana: For data visualization and alerting. Grafana is used to monitor log metrics from Prometheus.
- Gunicorn: For running the Python app in a production environment. It replaces the Flask development server.
- Docker (Optional but recommended): Simplifies the setup for Prometheus, Grafana, and the Python app, and is useful for running the services in containers.
- Kubernetes (Optional): If you plan to deploy the app in a Kubernetes cluster, ensure you have a working Kubernetes environment.
- Pip: For managing Python packages and installing dependencies.
- Pipenv (Optional): For virtual environment and dependency management, if you prefer using Pipenv over pip.
You'll need to install the following Python libraries:
- transformers: For Hugging Face's BERT model, which is used to classify log events based on their content.
- prometheus-client: For exposing log metrics to Prometheus.
- torch: The PyTorch library is used to run the Hugging Face BERT model. It provides an efficient and flexible way to run and classify log events.
- flask: The Flask web framework is used to create a simple web API for the AI-powered alerting system. The API allows you to send log messages for classification and trigger alerts if needed.
- gunicorn: Gunicorn is a WSGI HTTP server for running Python web applications like Flask in a production environment. It allows handling multiple requests efficiently, providing better performance and scalability compared to Flask's built-in development server.
- requests: For sending Slack notifications (optional).
- smtplib: For sending email notifications (optional).
-
Flask: Flask is a lightweight web framework, perfect for building and exposing APIs, especially in a development or small-scale environment. It provides easy setup and flexibility for defining routes and handling requests.
-
Gunicorn: Flask's built-in development server is not suitable for production, as it can only handle a single request at a time and is not designed for high-performance workloads. Gunicorn, a robust WSGI server, is typically used in production environments. It allows Flask to run as a more efficient, multi-threaded, and scalable web application, handling concurrent requests more effectively.
ποΈ CONCLUSION |
---|
In short, Flask handles the logic of the web application, while Gunicorn ensures that the application can serve requests at scale in a production environment. |
Hereβs the structure of the project:
.
βββ docker-compose.yml
βββ k8s
β βββ grafana-deployment.yaml
β βββ grafana-service.yaml
β βββ prometheus-configmap.yaml
β βββ prometheus-deployment.yaml
β βββ prometheus-pvc.yaml
β βββ prometheus-service.yaml
β βββ python-app-deployment.yaml
β βββ python-app-service.yaml
βββ LICENSE
βββ my_app
β βββ app.py
β βββ Dockerfile.app
β βββ requirements.txt
β βββ start_app.py
β βββ static
β βββ favicon.ico
βββ prometheus-grafana
β βββ alert_rules.yml
β βββ Dockerfile.grafana
β βββ prometheus.yml
βββ Prometheus_Grafana_Python_Hugging_Face.png
βββ README.md
5 directories, 20 files
git clone https://github.com/meleksabit/ai-powered-alerting-system.git
cd ai-powered-alerting-system
Step 2: Install Python dependencies(if you choose Manual Installation, without using Docker or Docker Compose)
Install the required Python libraries using pip:
pip install -r my_app/requirements.txt
You can run the application using Docker or Docker Compose.
This will set up both the Python app, Prometheus, and Grafana services in containers.
Run the following command to start the services:
docker-compose up --build
docker-compose up
: Starts the services based on thedocker-compose.yml
file.--build
: Forces Docker to rebuild the images even if nothing has changed. You can skip--build
for subsequent runs if no changes are made to the Dockerfiles or dependencies.
The services will be available at:
- Prometheus: Accessible at
http://localhost:9090
- Grafana: Accessible at
http://localhost:3000
- Python app:
- Flask app running on
http://localhost:5000
- Prometheus metrics exposed at
http://localhost:8000/metrics
- Flask app running on
You can also manually install Prometheus and Grafana on your local machine. Follow the links below for instructions:
# Use a slim version of Python to reduce image size
FROM python:3.11-slim-buster
# Install necessary system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Set the working directory in the container
WORKDIR /app
# Copy requirements file and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Preload Hugging Face models to avoid downloading on startup
RUN python -c "from transformers import AutoModelForSequenceClassification, AutoTokenizer; \
AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'); \
AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')"
# Copy the rest of the application code
COPY . .
# Expose necessary ports for Flask (5000) and Prometheus metrics (8000)
EXPOSE 5000
EXPOSE 8000
# Run the application (starting both Prometheus and Gunicorn from Python)
CMD ["python", "start_app.py"]
# Use Official Grafana image
FROM grafana/grafana:latest
# Set environment variables if needed
ENV GF_SECURITY_ADMIN_PASSWORD=admin
# Expose Grafana port
EXPOSE 3000
# Set the default command
CMD ["grafana-server", "--homepath=/usr/share/grafana", "--config=/etc/grafana/grafana.ini"]
Hereβs the docker-compose.yml
that sets up both Prometheus, Grafana, and the Python app:
services:
# Prometheus service
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus-grafana/prometheus.yml:/etc/prometheus/prometheus.yml # Mount config
- ./prometheus-grafana/alert_rules.yml:/etc/prometheus/alert_rules.yml # Mount alert rules
- /home/angel3/data/:/etc/prometheus/data # Data storage
user: "65534" # Run Prometheus as `nobody`
ports:
- "9090:9090" # Expose Prometheus on port 9090
command: ["--config.file=/etc/prometheus/prometheus.yml", "--storage.tsdb.path=/etc/prometheus/data"]
restart: unless-stopped
networks:
- monitor-net
# Grafana service
grafana:
build:
context: ./prometheus-grafana
dockerfile: Dockerfile.grafana
ports:
- "3000:3000" # Expose Grafana on port 3000
restart: unless-stopped
networks:
- monitor-net
# Python Flask app service
python-app:
build:
context: ./my_app
dockerfile: Dockerfile.app
ports:
- "5000:5000" # Expose Flask app on port 5000
- "8000:8000" # Expose Prometheus metrics on port 8000
volumes:
- ./my_app:/app # Mount app source code
restart: unless-stopped
depends_on:
- prometheus
- grafana
networks:
- monitor-net
# Define a shared network
networks:
monitor-net:
driver: bridge
# Define a volume for Prometheus data storage
volumes:
prometheus_data:
Edit the prometheus-grafana/prometheus.yml
file to add a scrape config for your Python app that exposes metrics on localhost:8000
:
# Global settings
global:
scrape_interval: 15s # Scrape every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
# Alertmanager configuration (if using Alertmanager)
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093'] # Define Alertmanager target
# Reference to rule files
rule_files:
- "/etc/prometheus/alert_rules.yml" # Points to your alert rules file
# Scrape configurations
# Global settings
global:
scrape_interval: 15s # Scrape every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
# Alertmanager configuration (if using Alertmanager)
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093'] # Define Alertmanager target
# Reference to rule files
rule_files:
- "/etc/prometheus/alert_rules.yml" # Points to your alert rules file
# Scrape configurations
scrape_configs:
# Scrape Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Scrape metrics from the Python AI-powered alerting app (now on port 8000)
- job_name: "ai-powered-alerting-app"
static_configs:
- targets: ["python-app:8000"] # Python app exposing metrics on port 8000
- Scrapes metrics from your Python app (
ai-powered-alerting-app
) atlocalhost:8000
. - Includes the
alert_rules.yml
file for Prometheus to evaluate alert rules.
Create the alert_rules.yml
file in your Prometheus configuration directory (/etc/prometheus/
).
alert_rules.yml:
groups:
- name: critical_alert_rules
rules:
- alert: CriticalLogAlert
expr: log_severity{level="critical"} > 0 # Alert when critical logs are detected
for: 1m
labels:
severity: "critical"
annotations:
summary: "Critical log detected"
description: "A critical log event was detected in the AI-powered alerting system."
prometheus.yml
: This file tells Prometheus to scrape metrics from both Prometheus itself and the AI-powered alerting app (your Python app).alert_rules.yml
: This file defines alerting rules that notify you when a critical log event is detected (based on thelog_severity
metric exposed by the Python app).
In the my_app/app.py file
, weβll load the BERT model from Hugging Face and classify log messages.
from transformers import pipeline
# Load Hugging Face's BERT model (sentiment analysis as a placeholder)
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
def classify_log_event(log_message):
"""
Classify log messages using Hugging Face DistilBERT model for sentiment analysis.
Lazily loads the model and tokenizer if they are not already loaded.
"""
lazy_load_model()
result = classifier(log_message)
# Determine severity based on sentiment
if result[0]['label']
'POSITIVE':
severity = 'not_critical'
else:
severity = 'critical'
log_severity.labels(severity=severity).inc()
logging.info(f"Classified log '{log_message}' as {severity}")
return severity
Now you can run the AI-powered alerting system:
docker-compose up --build
The Python app will expose Prometheus metrics at http://localhost:8000/metrics
. Prometheus will scrape these metrics to monitor the log severity levels (e.g., critical
, not_critical
).
- Metrics URL:
http://localhost:8000/metrics
Prometheus will automatically scrape this endpoint based on the scrape configuration.
You can test the log classification functionality by generating various log messages through the app's HTTP API.
- Use the
/log/<message>
endpoint to send log messages to be classified by the Hugging Face BERT model. - The model will classify each log as either critical or not critical, based on the message's sentiment (this uses a sentiment analysis model as a placeholder).
Example log classifications:
- Test Log 1: Classifying a user log-in message as "not critical":
curl http://localhost:5000/log/User%20logged%20in%20successfully
- Test Log 2: Classifying an SQL injection attempt as "critical":
curl http://localhost:5000/log/SQL%20injection%20attempt%20detected%20in%20API
- Test Log 3: Classifying a critical vulnerability detection as "critical":
curl http://localhost:5000/log/Critical%20vulnerability%20found%20in%20package%20xyz
Each of these log messages will be classified by the AI-powered system, and the classification will be reflected in the Prometheus metrics.
The Python app automatically updates the Prometheus metric log_severity
with the corresponding severity label (critical or not_critical), which Prometheus will scrape.
You can now set up Grafana to visualize and alert based on the log_severity
metrics.
-
Open Grafana: Access Grafana by navigating to
http://localhost:3000
in your browser. -
Add Data Source: Add Prometheus as the data source in Grafana:
- Name: Prometheus
- Type: Prometheus
- URL:
http://prometheus:9090
(Use the container name if Grafana and Prometheus are running in Docker, i.e.,http://prometheus:9090
)
- Create a Dashboard:
- Build a dashboard in Grafana to visualize the log severity metrics being scraped from Prometheus.
- For example, create a time series graph to display the metric
log_severity
with labels forcritical
andnot_critical
logs.
- Set Up Alerts:
- Create an alert rule in Grafana to send notifications when the
log_severity
metric forcritical
logs exceeds 0.
Example Grafana alert rule:
# Condition: Trigger an alert if any critical logs are detected
expr: log_severity{severity="critical"} > 0
# Condition: Trigger an alert if any critical logs are detected
expr: log_severity{severity="critical"} > 0
for: 1m
labels:
severity: "critical"
annotations:
summary: "Critical log detected"
description: "A critical log was detected in the application"
After setting up Prometheus and Grafana with the Python AI-powered alerting system, youβll be able to:
- Monitor Logs:
- View the log severity metrics in Grafana to monitor the number of critical and non-critical logs processed by the system.
- Trigger Alerts:
-
Grafana will trigger alerts based on the
log_severity
metric. -
Only logs classified as
critical
by the BERT model will trigger alerts, reducing noise and focusing on important events.
You can also deploy the system using Kubernetes. This section includes the Kubernetes manifests for deploying Prometheus, Grafana, and the Python app.
- Apply the Kubernetes manifests:
kubectl apply -f k8s/
- Scale the Python app: If you want to scale the Python app deployment, run:
kubectl scale deployment python-app --replicas=3
Below are the Kubernetes manifest files located in the k8s/ directory:
apiVersion: apps/v1
kind: Deployment
metadata:
name: python-app
spec:
replicas: 1
selector:
matchLabels:
app: python-app
template:
metadata:
labels:
app: python-app
spec:
containers:
- name: python-app
image: angel3/ai-powered-alerting-system:v1.0.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "400m"
memory: "512Mi"
env:
- name: SENDER_EMAIL
valueFrom:
secretKeyRef:
name: email-secrets
key: sender-email
- name: NOTIFICATION_RECEIVER
valueFrom:
secretKeyRef:
name: email-secrets
key: notification-receiver
- name: SLACK_BOT_TOKEN
valueFrom:
secretKeyRef:
name: email-secrets
key: SLACK_BOT_TOKEN
- name: SLACK_SIGNING_SECRET
valueFrom:
secretKeyRef:
name: email-secrets
key: SLACK_SIGNING_SECRET
ports:
- containerPort: 5000
startupProbe:
httpGet:
path: /startup
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 5
readinessProbe:
httpGet:
path: /readiness
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
apiVersion: v1
kind: Service
metadata:
name: python-app-service
spec:
type: NodePort
selector:
app: python-app
ports:
- protocol: TCP
port: 5000
targetPort: 5000
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- "--config.file=/etc/prometheus/prometheus.yml"
ports:
- containerPort: 9090
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus/
volumes:
- name: config-volume
configMap:
name: prometheus-config # Reference the ConfigMap
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
spec:
selector:
app: prometheus
ports:
- protocol: TCP
port: 9090
targetPort: 9090
type: NodePort
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: default
data:
prometheus.yml: |
# Global settings
global:
scrape_interval: 15s # Scrape every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
# Alertmanager configuration (if using Alertmanager)
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093'] # Define Alertmanager target if in use
# Reference to rule files
rule_files:
- "/etc/prometheus/alert_rules.yml" # Points to the alert rules file
# Scrape configurations
scrape_configs:
# Scrape Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Scrape metrics from the Python AI-powered alerting app via localhost (requires port-forwarding)
- job_name: "ai-powered-alerting-app"
static_configs:
- targets: ["localhost:8000"] # Python app exposing metrics, accessible on localhost via port-forwarding
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi # Adjust storage size as needed
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
labels:
app: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
apiVersion: v1
kind: Service
metadata:
name: grafana-service
spec:
selector:
app: grafana
ports:
- protocol: TCP
port: 3000
targetPort: 3000
type: NodePort
Note
These manifest files help you set up the Python app, Prometheus, and Grafana in a Kubernetes cluster.
Tip
Handling Gunicorn Worker Timeouts
If you encounter issues such as worker timeouts in Gunicorn (e.g., WORKER TIMEOUT
errors in the logs), you can adjust the worker timeout directly in the start_app.py
script. The current configuration in start_app.py
sets a timeout of 30 seconds, which can be increased if necessary to prevent premature worker timeouts during long-running processes or slow startup times.
The configuration looks like this:
options = {
'bind': '0.0.0.0:5000',
'workers': 4,
'timeout': 30, # Default timeout set to 30 seconds
}
If needed, you can increase the timeout by modifying the timeout value in this script.
This configuration ensures that the Gunicorn workers have enough time to handle requests, especially during long-running processes or slow startup times.