Trovandoci in tema metriche e Prometheus, vorrei inziare questo articolo con “Sì… l’ennessimo articolo di DevOpsTRibe, il Deadman’s switch degli accolli DevOps”.

Ultimamente ho dovuto usare alcuni tool di Chaos Engineering per portare a meta un deliverable molto interessante.

Scenario

Cluster Openshift 4.10 con Prometheus Stack. Prometheus espone tantissime metriche che potrebbero confondere eventuali team di operation, e generare dubbi su quali dati siano più significativi di altri. Misurare e tenere sotto controllo tutti gli oggetti è utile ma a volte ci si rende conto di volere selezionare solo quelle metriche che esprimono l’operatività di nodi compute, nodi master e pod in generale.

Chi lavora su tecnologie di questo tipo, in generale su K8S e sue distribuzioni, sa più o meno quali sono gli allarmi più importanti, ma è utile strutturare un processo deterministico che generi una lista in modo automatico.

Perchè il Chaos Engineering può essere di aiuto

Il Chaos Engineering è quella disciplina che genera entropia in un sistema per verficarne la resilienza. Ma nel mio caso è utile una sua declinazione, ovvero il rapporto causa-effetto tra Chaos experiment e produzione controllata di allarmi secondo le rules definite in Prometheus.

Quale Tool Scegliere?

Ce ne sono veramente tanti. Sicuramente è sempre bene scegliere qualcosa di supportato e presente nel landscape del CNCF. Nel mio caso ho scelto il mitico Chaos Mesh.

Naturalmente ci tengo a precisare che tutto quello è possibile fare con tool di Chaos Engineering è implementabile tramite alcuni dei comandi già esistenti. Basti pensare a: dd, kill, stress-ng, fio e altri. Il punto è che manca tutto quello strato che facilita la comprensione dei dati emessi da un esperimento. È un po’ come usare troppi statement “shell” in un playbook Ansible quando esistono già dei moduli ad-hoc che si prendono cura di aspetti importanti come l’idempotenza.

Di seguito qualche esperimento di Chaos Engineering e gli alert Promethes prodotti.

MEM Attack (/usr/local/chaosd-v1.0.0-linux-amd64/tools/stress-ng –vm 2 –vm-bytes 15G)

Generazione di un elevato utilizzo di memoria RAM.

Allarmi

name: etcdMembersDown
expr: max without(endpoint) (sum without(instance) (up{job=~".*etcd.*"} == bool 0) or count without(To) (sum without(instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[2m])) > 0.01)) > 0
for: 10m
labels:
severity: critical
annotations:
description: etcd cluster "{{ $labels.job }}": members are down ({{ $value }}).
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdMembersDown.md
summary: etcd cluster members are down.

name: etcdNoLeader
expr: etcd_server_has_leader{job=~".*etcd.*"} == 0
for: 1m
labels:
severity: critical
annotations:
description: etcd cluster "{{ $labels.job }}": member {{ $labels.instance }} has no leader.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdNoLeader.md
summary: etcd cluster has no leader.

name: TargetDown
expr: 100 * (count by(job, namespace, service) (up == 0 unless on(node) max by(node) (kube_node_spec_unschedulable == 1)) / count by(job, namespace, service) (up unless on(node) max by(node) (kube_node_spec_unschedulable == 1))) > 10
for: 15m
labels:
severity: warning
annotations:
description: {{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
summary: Some targets were not reachable from the monitoring server for an extended period of time.

name: KubeClientErrors
expr: (sum by(instance, job, namespace) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(instance, job, namespace) (rate(rest_client_requests_total[5m]))) > 0.01
for: 15m
labels:
severity: warning
annotations:
description: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'
summary: Kubernetes API server client is experiencing errors.

Disk Attack – ./chaosd attack disk fill -s95G -p /var/lib/containers/foo.bar

Praticamente un dd contrallato 😀

Allarmi

name: NodeFilesystemAlmostOutOfSpace
expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
for: 30m
labels:
severity: warning
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemAlmostOutOfSpace.md
summary: Filesystem has less than 5% space left.

CPU Attack – ./chaosd attack stress cpu -w 4

Allarmi

name: etcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.15
for: 10m
labels:
severity: warning
annotations:
description: etcd cluster "{{ $labels.job }}": member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.
summary: etcd cluster member communication is slow.

name: etcdHighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25
for: 10m
labels:
severity: warning
annotations:
description: etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.
summary: etcd cluster 99th percentile commit durations are too high.

name: KubePodNotReady
expr: sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0
for: 15m
labels:
severity: warning
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubePodNotReady.md
summary: Pod has been in a non-ready state for more than 15 minutes.

name: HighOverallControlPlaneCPU
expr: sum(100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) and on(instance) label_replace(kube_node_role{role="master"}, "instance", "$1", "node", "(.+)")) / count(kube_node_role{role="master"}) > 60
for: 10m
labels:
namespace: openshift-kube-apiserver
severity: warning
annotations:
description: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md
summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node outage may cause a cascading failure; increase available CPU.Confermo, sotto c'è il mitico stress-ng

Net Attack – Network Fault – Delay 3s

Allarmi

name: TargetDown
expr: 100 * (count by(job, namespace, service) (up == 0 unless on(node) max by(node) (kube_node_spec_unschedulable == 1)) / count by(job, namespace, service) (up unless on(node) max by(node) (kube_node_spec_unschedulable == 1))) > 10
for: 15m
labels:
severity: warning
annotations:
description: {{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

name: KubeAPIErrorBudgetBurn
expr: sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01)
for: 1h
labels:
long: 1d
severity: warning
short: 2h
annotations:
description: The API server is burning too much error budget. This alert fires when too many requests are failing with high latency. Use the 'API Performance' monitoring dashboards to narrow down the request states and latency. The 'etcd' monitoring dashboards also provides metrics to help determine etcd stability and performance.
summary: The API server is burning too much error budget.

Naturalmente questi sono solo un set di allarmi generati ma lo scopo dell’articolo è mostrare come alcuni di tool di Chaos Engineering possono tornarci utili. Non dimenticate il Deadman’s switch!