Prometheus Monitoring & Alterting

Monitor
Published

January 19, 2016

Modified

May 16, 2024

Prometheus 1 …open-source systems monitoring & alerting toolkit

Installation

Installation using containers 2

mkdir -p /etc/prometheus /srv/prometheus
# Simple Prometheus configuration …scraps from the local node-exporter
cat > /etc/prometheus/prometheus.yml <<EOF
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'localhost'
    static_configs:
      - targets:
        - localhost:9100
EOF

# Create a pod and expose the Prometheus server port
podman pod create --publish '9090:9090' --name prometheus

# Add the node-exporter to the pod
podman run --pod=prometheus --name=node-exporter -d \
           --volume=/:/host:ro,rslave \
           quay.io/prometheus/node-exporter:v1.7.0

# Add the Prometheus server to the pod
podman run --pod=prometheus --name=prometheus-server -d \
           --volume=/etc/prometheus:/etc/prometheus:ro \
           --volume=/srv/prometheus:/prometheus:rw,U,Z \
           quay.io/prometheus/prometheus:v2.45.4

Basic life-cycle…

podman pod list --ctr-names --filter name=prometheus
podman pod <restart|stop|start> prometheus

# Resource 
pod_id=$(podman pod list --quiet --filter name=prometheus)
podman pod stats $pod_id
podman pod top $pod_id

User systemd service units:

mkdir -p $HOME/.config/systemd/user && cd $HOME/.config/systemd/user
podman generate systemd --new --files --name prometheus
systemctl --user daemon-reload
systemctl --user enable --now pod-prometheus.service

Find a complete Vagrant example using Podman Pods on GitHub 3.

Configuration

Configuration file …option --config.file

  • …file written in YAML format …example prometheus.yml
  • …details about the configuration in the Prometheus documentation 4
  • Top-level configuration sections:
    • global …set defaults
    • scrape_config* …which targets to monitor

Scrape

When Prometheus scraps a target following labels are attached automatically:

  • job …configured job_name that the target belongs to
  • instance<host>:<port> part of the target’s URL that was scraped

List scrape targets in the configuration…

  • …statically configured with static_configs
  • …dedicated files with scrape_config_files
cat > /etc/prometheus/prometheus.yml <<EOF
global:
  scrape_interval: 15s 
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'localhost'
    static_configs:
    - targets:
      - localhost:9090
      - localhost:9100

scrape_config_files:
  - '/etc/prometheus/scrape_config.d/*.yml'
EOF

Example of a dedicated scrape configuration file:

cat > /etc/prometheus/scrape_config.d/promlab.yml <<EOF
scrape_configs:
  - job_name: 'promlab'
    static_configs:
    - targets:
      - demo.promlabs.com:10000
      - demo.promlabs.com:10001
      - demo.promlabs.com:10002
EOF

Usage

After start the web-interface should be available at port 9090:

Endpoint Description
/targets List of endpoints to scrape
/graph Expression browser
/metrics Metrics from Prometheus

Metric Types

Gauge …value that goes up & down …single time series

# HELP queue_length The number of elements in a queue
# TYPE queue_length gauge
queue_lenght 42

Counter …accumulative count over time (only go up, never down)

  • …rate of increase averaged over preceding time window
  • …counter can only reset 0
  • PromQL rate functions …rate(), irate(), increase()
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total 123987

Summary …distribution of numeric values

  • …as a percentile or quantile
  • …along with error margins
  • …collection of gauge and counter metrics
# HELP http_request_duration_seconds A summery of the HTTP request duration per second
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.052
http_request_duration_seconds{quantile="0.90"} 0.564
http_request_duration_seconds{quantile="0.99"} 2.376
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 223423

Histogram …distribution of numeric values counted in a set of ranged buckets

  • …Prometheus uses cumulative histograms …each bucket includes the previous
  • le label (less then or equal) …indicates upper boundary of a specific bucket
  • …each bucket creates one output time series …trade off between cost & resolution
  • PromQL function histogram_quantile()
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

Metrics Names & Labels

Notation, Metric and Label Naming

  • Metric names
    • …must match regex [a-zA-Z_:][a-zA-Z0-9_:]*
    • …identifies a time series, specifies a measured feature
    • …should allow good guess as to what a metric means
    • …names for applications should generally be prefixed
    • …apply to exactly one subsystem and should be named accordingly
    • …should never be procedurally generated
  • Labels
    • …key-value pairs used by the query language for filtering and aggregation
    • …label name must match regex [a-zA-Z_][a-zA-Z0-9_]*
    • …label value may contain any Unicode character

Graph

PromQL (non-SQL) designed to read and compute metrics
https://prometheus.io/docs/querying/basics

Vector selectors:

<metric_name>                                    # select all time series for metric name
{__name__=~"<metic_name_regex>"}                 # use a regex to match for metric names
<metric_name>{<label_name>=<label_value>, ...}   # filter for label=value, supports = != =~ !~ regex
<metric_name>{...}[<range>]                      # integer time range selector with suffix s,m,h,d,w,y
<metric_name>{...} offset [<range>]              # ^^ relative to the current query evaluation time

Aggregation Operators & Query Functions
https://prometheus.io/docs/querying/functions

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)] [keep_common]

sum (calculate sum over dimensions)
min (select minimum over dimensions)
max (select maximum over dimensions)
avg (calculate the average over dimensions)
stddev (calculate population standard deviation over dimensions)
stdvar (calculate population standard variance over dimensions)
count (count number of elements in the vector)
count_values (count number of elements with the same value)
bottomk (smallest k elements by sample value)
topk (largest k elements by sample value)
quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)

Exporters

What is an exporter?

  • Libraries & services used to export monitoring metrics to Prometheus
  • Some exporters maintained by the Prometheus community on GitHub…

Among them the Prometheus Node-Exporter 5

# /usr/lib/systemd/system/prometheus-node-exporter.service
[Unit]
Description=Prometheus Node Exporter

[Service]
EnvironmentFile=/etc/default/prometheus-node-exporter
ExecStart=/usr/bin/node_exporter $OPTIONS
User=prometheus

[Install]
WantedBy=multi-user.target
  • …local configuration /etc/default/prometheus-node-exporter
  • …after start the interface is available at port 9100

Textfile Collector

  • Include custom Prometheus formatted metrics from text files
    • …query responses via the integrated Textfile Collector
    • …set the --collector.textfile.directory flag on the node_exporter command-line
    • …parses all files in the specified directory matching the glob *.prom

Make sure to adjust the configuration of the node-exporter to read textfiles:

# /etc/default/prometheus-node-exporter
OPTIONS='--collector.textfile.directory /var/lib/node_exporter/textfile_collector'

Example

cat > dummy.prom <<EOF
# HELP dummy_counter Some dummy counter
# TYPE dummy_counter gauge
dummy_counter 123
EOF

# start the exporter on a non-default port
node_exporter \
      --web.listen-address=":9999" \
      --web.disable-exporter-metrics \
      --collector.disable-defaults \
      --collector.textfile \
      --collector.textfile.directory $PWD

# query the metrics
curl localhost:9999/metrics

Exposition Format

The format of the text files uses a text-based exposition format

  • …line oriented …separated by a line feed
  • Each line describes a sample …uses extended Backus–Naur form (EBNF)
  • Comments …# as the first non-whitespace character …ignored unless
    • # HELP …metric name followed by documentation line
    • # TYPE …metric name followed the metric type
  • Cf. text collector examples from the Prometheus community
# HELP metric_name Text describing the metric
# TYPE metric_name metric_type
metric_name {label_name=label_value, ...} metric_value

Systemd Units

Create a Systemd service unit to execute a script to collect metrics

  • …metrics printed to STDOUT …redirected StandardOutput
  • …written to a file …set WorkingDirectory
# /usr/lib/systemd/system/prometheus-storcli-textfile.service
[Unit]
Description=MegaRAID metrics for the Prometheus Node-Exporter

[Service]
Type=simple
ExecStart=/usr/sbin/prometheus-storcli-textfile
StandardOutput=file:/var/lib/node_exporter/textfile_collector/storecli.prom

[Install]
WantedBy=multi-user.target
# /usr/lib/systemd/system/prometheus-slurm-textfile.service
[Unit]
Description=Slurm metrics for the Prometheus Node-Exporter

[Service]
Type=oneshot
ExecStart=/usr/sbin/prometheus-slurm-sinfo-textfile
ExecStart=/usr/sbin/prometheus-slurm-squeue-textfile
ExecStart=/usr/sbin/prometheus-slurm-sprio-textfile
ExecStart=/usr/sbin/prometheus-slurm-sdiag-textfile
ExecStart=/usr/sbin/prometheus-slurm-sshare-textfile
WorkingDirectory=/var/lib/prometheus/node-exporter/textfile_collector

[Install]
WantedBy=multi-user.target

Use a Systemd timer unit to run these scrips periodically

# /usr/lib/systemd/system/prometheus-storcli-textfile.timer
[Unit]
Description=Periodically generate MegaRAID metrics for the Prometheus Node-Exporter
RefuseManualStart=no
RefuseManualStop=no

[Timer]
Persistent=true
OnBootSec=120
OnUnitActiveSec=120
Unit=prometheus-storcli-textfile.service

[Install]
WantedBy=timers.target