Prometheus Monitoring & Alterting
Prometheus 1 …open-source systems monitoring & alerting toolkit
- …time-series database …stores operation and service metrics
- …metrics collected by HTTP requests from a central server
- …targets configuration read from service discovery (or local configuration)
- …push gateway acts as a metric cache if entities require to push information
- …an endpoint you can scrape is called an instance
Installation
Installation using containers 2…
mkdir -p /etc/prometheus /srv/prometheus
# Simple Prometheus configuration …scraps from the local node-exporter
cat > /etc/prometheus/prometheus.yml <<EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'localhost'
static_configs:
- targets:
- localhost:9100
EOF
# Create a pod and expose the Prometheus server port
podman pod create --publish '9090:9090' --name prometheus
# Add the node-exporter to the pod
podman run --pod=prometheus --name=node-exporter -d \
--volume=/:/host:ro,rslave \
quay.io/prometheus/node-exporter:v1.7.0
# Add the Prometheus server to the pod
podman run --pod=prometheus --name=prometheus-server -d \
--volume=/etc/prometheus:/etc/prometheus:ro \
--volume=/srv/prometheus:/prometheus:rw,U,Z \
quay.io/prometheus/prometheus:v2.45.4
Basic life-cycle…
podman pod list --ctr-names --filter name=prometheus
podman pod <restart|stop|start> prometheus
# Resource
pod_id=$(podman pod list --quiet --filter name=prometheus)
podman pod stats $pod_id
podman pod top $pod_id
User systemd service units:
mkdir -p $HOME/.config/systemd/user && cd $HOME/.config/systemd/user
podman generate systemd --new --files --name prometheus
systemctl --user daemon-reload
systemctl --user enable --now pod-prometheus.service
Find a complete Vagrant example using Podman Pods on GitHub 3.
Configuration
Configuration file …option --config.file
- …file written in YAML format …example
prometheus.yml
- …details about the configuration in the Prometheus documentation 4
- Top-level configuration sections:
global
…set defaultsscrape_config*
…which targets to monitor
Scrape
When Prometheus scraps a target following labels are attached automatically:
job
…configuredjob_name
that the target belongs toinstance
…<host>:<port>
part of the target’s URL that was scraped
List scrape targets in the configuration…
- …statically configured with
static_configs
- …dedicated files with
scrape_config_files
cat > /etc/prometheus/prometheus.yml <<EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'localhost'
static_configs:
- targets:
- localhost:9090
- localhost:9100
scrape_config_files:
- '/etc/prometheus/scrape_config.d/*.yml'
EOF
Example of a dedicated scrape configuration file:
cat > /etc/prometheus/scrape_config.d/promlab.yml <<EOF
scrape_configs:
- job_name: 'promlab'
static_configs:
- targets:
- demo.promlabs.com:10000
- demo.promlabs.com:10001
- demo.promlabs.com:10002
EOF
Usage
After start the web-interface should be available at port 9090:
Endpoint | Description |
---|---|
/targets |
List of endpoints to scrape |
/graph |
Expression browser |
/metrics |
Metrics from Prometheus |
Metric Types
Gauge …value that goes up & down …single time series
# HELP queue_length The number of elements in a queue
# TYPE queue_length gauge
queue_lenght 42
Counter …accumulative count over time (only go up, never down)
- …rate of increase averaged over preceding time window
- …counter can only reset
0
- PromQL rate functions …
rate()
,irate()
,increase()
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total 123987
Summary …distribution of numeric values
- …as a percentile or quantile
- …along with error margins
- …collection of gauge and counter metrics
# HELP http_request_duration_seconds A summery of the HTTP request duration per second
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.052
http_request_duration_seconds{quantile="0.90"} 0.564
http_request_duration_seconds{quantile="0.99"} 2.376
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 223423
Histogram …distribution of numeric values counted in a set of ranged buckets
- …Prometheus uses cumulative histograms …each bucket includes the previous
- …
le
label (less then or equal) …indicates upper boundary of a specific bucket - …each bucket creates one output time series …trade off between cost & resolution
- PromQL function
histogram_quantile()
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
Metrics Names & Labels
Notation, Metric and Label Naming…
- Metric names…
- …must match regex
[a-zA-Z_:][a-zA-Z0-9_:]*
- …identifies a time series, specifies a measured feature
- …should allow good guess as to what a metric means
- …names for applications should generally be prefixed
- …apply to exactly one subsystem and should be named accordingly
- …should never be procedurally generated
- …must match regex
- Labels…
- …key-value pairs used by the query language for filtering and aggregation
- …label name must match regex
[a-zA-Z_][a-zA-Z0-9_]*
- …label value may contain any Unicode character
Graph
PromQL (non-SQL) designed to read and compute metrics
https://prometheus.io/docs/querying/basics
Vector selectors:
<metric_name> # select all time series for metric name
{__name__=~"<metic_name_regex>"} # use a regex to match for metric names
<metric_name>{<label_name>=<label_value>, ...} # filter for label=value, supports = != =~ !~ regex
<metric_name>{...}[<range>] # integer time range selector with suffix s,m,h,d,w,y
<metric_name>{...} offset [<range>] # ^^ relative to the current query evaluation time
Aggregation Operators & Query Functions
https://prometheus.io/docs/querying/functions
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)] [keep_common]
sum (calculate sum over dimensions)
min (select minimum over dimensions)
max (select maximum over dimensions)
avg (calculate the average over dimensions)
stddev (calculate population standard deviation over dimensions)
stdvar (calculate population standard variance over dimensions)
count (count number of elements in the vector)
count_values (count number of elements with the same value)
bottomk (smallest k elements by sample value)
topk (largest k elements by sample value)
quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
Exporters
What is an exporter?
- Libraries & services used to export monitoring metrics to Prometheus
- Some exporters maintained by the Prometheus community on GitHub…
Among them the Prometheus Node-Exporter 5…
# /usr/lib/systemd/system/prometheus-node-exporter.service
[Unit]
Description=Prometheus Node Exporter
[Service]
EnvironmentFile=/etc/default/prometheus-node-exporter
ExecStart=/usr/bin/node_exporter $OPTIONS
User=prometheus
[Install]
WantedBy=multi-user.target
- …local configuration
/etc/default/prometheus-node-exporter
- …after start the interface is available at port 9100
Textfile Collector
- Include custom Prometheus formatted metrics from text files…
- …query responses via the integrated Textfile Collector
- …set the
--collector.textfile.directory
flag on thenode_exporter
command-line - …parses all files in the specified directory matching the glob
*.prom
Make sure to adjust the configuration of the node-exporter
to read textfiles:
# /etc/default/prometheus-node-exporter
OPTIONS='--collector.textfile.directory /var/lib/node_exporter/textfile_collector'
Example
cat > dummy.prom <<EOF
# HELP dummy_counter Some dummy counter
# TYPE dummy_counter gauge
dummy_counter 123
EOF
# start the exporter on a non-default port
node_exporter \
--web.listen-address=":9999" \
--web.disable-exporter-metrics \
--collector.disable-defaults \
--collector.textfile \
--collector.textfile.directory $PWD
# query the metrics
curl localhost:9999/metrics
Exposition Format
The format of the text files uses a text-based exposition format…
- …line oriented …separated by a line feed
- Each line describes a sample …uses extended Backus–Naur form (EBNF)
- Comments …
#
as the first non-whitespace character …ignored unless- …
# HELP
…metric name followed by documentation line - …
# TYPE
…metric name followed the metric type
- …
- Cf. text collector examples from the Prometheus community
# HELP metric_name Text describing the metric
# TYPE metric_name metric_type
metric_name {label_name=label_value, ...} metric_value
Systemd Units
Create a Systemd service unit to execute a script to collect metrics
- …metrics printed to STDOUT …redirected
StandardOutput
- …written to a file …set
WorkingDirectory
# /usr/lib/systemd/system/prometheus-storcli-textfile.service
[Unit]
Description=MegaRAID metrics for the Prometheus Node-Exporter
[Service]
Type=simple
ExecStart=/usr/sbin/prometheus-storcli-textfile
StandardOutput=file:/var/lib/node_exporter/textfile_collector/storecli.prom
[Install]
WantedBy=multi-user.target
# /usr/lib/systemd/system/prometheus-slurm-textfile.service
[Unit]
Description=Slurm metrics for the Prometheus Node-Exporter
[Service]
Type=oneshot
ExecStart=/usr/sbin/prometheus-slurm-sinfo-textfile
ExecStart=/usr/sbin/prometheus-slurm-squeue-textfile
ExecStart=/usr/sbin/prometheus-slurm-sprio-textfile
ExecStart=/usr/sbin/prometheus-slurm-sdiag-textfile
ExecStart=/usr/sbin/prometheus-slurm-sshare-textfile
WorkingDirectory=/var/lib/prometheus/node-exporter/textfile_collector
[Install]
WantedBy=multi-user.target
Use a Systemd timer unit to run these scrips periodically
# /usr/lib/systemd/system/prometheus-storcli-textfile.timer
[Unit]
Description=Periodically generate MegaRAID metrics for the Prometheus Node-Exporter
RefuseManualStart=no
RefuseManualStop=no
[Timer]
Persistent=true
OnBootSec=120
OnUnitActiveSec=120
Unit=prometheus-storcli-textfile.service
[Install]
WantedBy=timers.target
Footnotes
Prometheus Project
https://github.com/prometheus/prometheus
https://prometheus.io/docs↩︎Installation, Prometheus Documentation
https://prometheus.io/docs/prometheus/latest/installation/↩︎Vagrant Prometheus Example, GitHub
https://github.com/vpenso/vagrant-playground/tree/master/prometheus/podman↩︎Configuration, Prometheus Documentation
https://prometheus.io/docs/prometheus/latest/configuration/configuration↩︎Prometheus Node Exporter
https://github.com/prometheus/node_exporter↩︎