Базовые правила для создания событий#

Базовые правила оповещений созданы на основе метрик node-exporter. Соответственно, этот экспортер должен быть установлен, автоматически запускаться и корректно работать.

Базовые правила объединены в группу правил General, файл helm/alert-rules/general.yaml или docker-compose/vmalert/config/general.yaml.

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

Node_Down

Node <hostname> is possibly down

up{instance=~».*:9100»} == 0

5m

critical

Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down

Node_Reboot

Node <hostname> has been restarted

node_time_seconds - node_boot_time_seconds < 600

critical

Node <hostname> (<instance>) has been restarted (uptime < 10m)

CPU_Utilization

High CPU utilization on <hostname> (<value>% idle)

avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) < 20

5m

warning

<hostname> (<instance>) has high CPU utilization for more than 5 minuites

CPU_Utilization

Critical CPU utilization on <hostname> (<value>% idle)

avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) < 10

5m

critical

<hostname> (<instance>) has critical CPU utilization for more than 5 minuites

CPU_HighIOwait

High CPU iowait on<hostname> (<value>%)

(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

CPU iowait > 10%. A high iowait means that you are disk or network bound

Memory_Utilization

High Memory utilization on <hostname> (<value>% available)

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20

5m

warning

<hostname> (<instance>) has high Memory utilization for more than 5 minuites

Memory_Utilization

Critical Memory utilization on <hostname> (<value>% available)

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

5m

critical

<hostname> (<instance>) has critical Memory utilization for more than 5 minuites

DiskSpace_Utilization

Host out of disk space (instance <instance>)

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is almost full (< 20% left)

DiskSpace_Utilization

Host out of disk space (instance <instance>)

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

critical

Disk is almost full (< 10% left)

HostOutOfInodes

Host out of inodes (instance <instance>)

(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is almost running out of available inodes (< 10% left)

HostFilesystemDeviceError

Host <hostname> filesystem <mountpoint> device error

node_filesystem_device_error == 1

critical

<hostname> (<instance>): Device error with the <mountpoint> filesystem

Сводные правила оповещений для AIC#

Общие правила (general.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

Node_Down

Node <hostname> is possibly down

up{instance=~».*:9100»} == 0

5m

critical

Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down

Node_Reboot

Node <hostname> has been restarted

node_time_seconds - node_boot_time_seconds < 600

critical

Node <hostname> (<instance>) has been restarted (uptime < 10m)

CPU_Utilization

High CPU utilization on <hostname> (<value>% used)

100 - avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) > 60

5m

warning

<hostname> (<instance>) has high CPU utilization for more than 5 minuites

CPU_Utilization

Critical CPU utilization on <hostname> (<value>% used)

100 - avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) > 85

5m

critical

<hostname> (<instance>) has critical CPU utilization for more than 5 minuites

CPU_System_Utilization

Critical System CPU utilization on <hostname> (<value>% used)

avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»system»}[1m]) * 100) > 50

5m

critical

<hostname> (<instance>) has more than 50% CPU utilization at the system (kernel) level for more than 5 minutes.

CPU_HighIOwait

High CPU iowait on <hostname> (<value>%)

(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

CPU iowait > 10%. A high iowait means that you are disk or network bound

CPU_CritIOwait

Critical CPU iowait on <hostname> (<value>%)

(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 30) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

critical

CPU iowait > 30%. A high iowait means that you are disk or network bound.

Memory_Utilization

High Memory utilization on <hostname> (<value>% used)

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 80

5m

warning

<hostname> (<instance>) has high Memory utilization for more than 5 minuites

Memory_Utilization

Critical Memory utilization on <hostname> (<value>% used)

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 95

5m

critical

<hostname> (<instance>) has critical Memory utilization for more than 5 minuites

DiskSpace_Utilization

Host out of disk space (instance <instance>)

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is almost full (< 20% left)

DiskSpace_Utilization

Host out of disk space (instance <instance>)

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

critical

Disk is almost full (< 10% left)

HostOutOfInodes

Host out of inodes (instance <instance>)

(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is almost running out of available inodes (< 10% left)

HostFilesystemDeviceError

Host <hostname> filesystem <mountpoint> device error

node_filesystem_device_error == 1

critical

<hostname> (<instance>): Device error with the <mountpoint> filesystem

NetworkInterfaceDown

Network interface <device> on <hostname> is in „Down“ state

node_network_info{operstate=»down», device!~»eno[0-9]+»} == 1

critical

<hostname> (<instance>): network interface <device> is in „Down“ state

Правила для подсистемы виртуализации (brest.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

Brest_vCPU_usage

High vCPU usage on Brest cluster <cluster>: >80%

(one_cluster_cpuusage  / one_cluster_totalcpu) * 100 > 80

5m

critical

Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down

Brest_virtualization_service_status

Service <service> on Brest virtualization server <hostname> is possibly down

systemd_unit_state_id{name=~»libvirtd.*|postgresql@.*|chrony.*|sssd.*|opennebula.*», product=»brest»} != 1

critical

On Brest virtualization server <hostname> (<instance>) service <service> is not running.

Brest_front_service_status

Service <service> on Brest front server <hostname> is possibly down

systemd_unit_state_id{name=~»libvirtd.*|postgresql@.*|chrony.*|sssd.*|opennebula.*», product=»brest»} != 1

critical

On Brest front server <hostname> (<instance>) service <service> is not running.

Brest_RAFT_status

Brest server <hostname> has an issue with RAFT status

one_zone_raft{} == 10

1m

critical

Brest server <hostname> (<instance>) has an issue with RAFT status.

Brest_API_status

Brest server <hostname> has an issue with API connection

one_api_connect{} != 1

1m

critical

Brest server <hostname> (<instance>) has an issue with API connection.

Brest_web_portal_status

Brest has an issue with web portal connection

one_web_connect{} != 200

2m

warning

Brest has an issue with web portal <hostname> connection (using <instance> exporter)

Brest_web_portal_duration

On Brest server <hostname> duration of time for web portal connection is too long

one_web_connect_duration{} >= 2000

5m

warning

On Brest server <hostname> (<instance>) duration of time for web portal connection is more than 2 seconds.

Brest_front_host_status

Brest front server <hostname> is possibly down

node_exporter_build_info{product=»brest», component=»front»} != 1

5m

critical

Brest front server <hostname> (<instance>) is not responding. It may be down.

Brest_virtualization_host_error

Brest virtualization server <hostname> is in ERROR state

one_host_state == 3

2m

critical

Brest virtualization server <hostname> (<instance>) is in ERROR state.

Brest_virtualization_host_init

Brest virtualization server <hostname> is in INIT state

one_host_state == 1

2m

critical

Brest virtualization server <hostname> (<instance>) is in INIT state.

Brest_virtualization_host_disabled

Brest virtualization server <hostname> is in DISABLED state

one_host_state == 4

2m

critical

Brest virtualization server <hostname> (<instance>) is in DISABLED state.

Brest_virtualization_host_offline

Brest virtualization server <hostname> is in OFFLINE state

one_host_state == 8

2m

critical

Brest virtualization server <hostname> (<instance>) is in OFFLINE state.

Brest_virtualization_host_monitored

Brest virtualization server <hostname> is in MONITORED state

one_host_state == 2

2m

info

Brest virtualization server <hostname> (<instance>) is in MONITORED state.

Brest_changes_RAFT_status

On Brest server <hostname> status RAFT has changed

sum by() (changes(one_zone_raft{}[5m])) > 0

warning

On Brest server <hostname> (<instance>) status RAFT has changed in last 5 minutes.

Brest_new_running_VMs

A lot of new VMs on <hostname> in last 10 minutes

delta(sum(one_vms_states_count{}))[10m] > 50

warning

More than 50 new VMs were created on <hostname> in last 10 minutes

Brest_new_running_VMs

More than 500 new VMs on <hostname> in last 10 minutes

delta(sum(one_vms_states_count{}))[10m] > 500

warning

More than 500 new VMs were created on <hostname> in last 10 minutes

Brest_RAFT_issues

More than 50% of Brest fronts are in error state

(count(one_zone_raft{} == 10) or vector(0)) / count(one_zone_raft{} ) * 100 > 50

5m

critical

More than 50% of Brest front servers are in „Issue“ RAFT status

Правила IPMI (ipmi_exporter.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

IPMI_temperature_celsius

Temperature sensor „<name>“ on server <hostname> has high temperature

ipmi_temperature_celsius >= 75

5m

warning

Temperature sensor „<name>“ on server <hostname> has temperature <value> degrees Celsius.

IPMI_temperature_celsius

Temperature sensor „<name>“ on server <hostname> has critical temperature

ipmi_temperature_celsius >= 90

5m

critical

Temperature sensor „<name>“ on server <hostname> has critical temperature <value> degrees Celsius.

IPMI_chassis_power_state

Chassis of server <hostname> is powered off

ipmi_chassis_power_state != 1

critical

Chassis power on server <hostname> is switched off or failed

IPMI_collector_status

IPMI collector on server <hostname> possibly is not working

ipmi_up != 1

2m

critical

Status of IPMI collector on server <hostname> is down.

IPMI_current_state

Problem with <name> current state on server <hostname>

ipmi_current_state != 0

critical

Status of <name> current on server <hostname> is not OK

IPMI_voltage_state

Problem with <name> voltage sensor on server <hostname>

ipmi_voltage_state != 0

critical

Status of voltage sensor <name> on server <hostname> is not OK

IPMI_voltage_volts

Voltage failure <name> on server <hostname>

ipmi_voltage_volts == 0

critical

<name> voltage input or output on server <hostname> is 0

Правила SNMP (snmp_exporter.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

Device_Down

Network device <hostname> <instance> is possibly down

up{instance=~».*:9116»} == 0

2m

critical

Network device <hostname> <instance> does not respond by SNMP, so the host is possibly down

IfOperStatus

Interface <interface> on <instance> is DOWN

(ifOperStatus * on(ifIndex, job) group_left (ifName) ifName != 1) and (ifAdminStatus * on(ifIndex, job) group_left (ifName) ifName == 1)

warning

Interface <interface> on <instance> is in DOWN state while AdminState configured as UP

IfErrors

Too many errors on interface <interface> of <instance>

(rate(ifOutErrors[1m]) * on(ifIndex, job) group_left (ifName) ifName > 0) or (rate(ifInErrors[1m]) * on(ifIndex, job) group_left (ifName) ifName > 0)

2m

warning

Too many errors on interface <interface> of <instance> for last 2 minutes

Device_Restart

Network device <instance> was restarted

sysUpTime / 100 < 600

critical

Network device <instance> was restarted in last 10 minutes

If_Admin_Status

ifAdminStatus on host <hostname> is down

ifAdminStatus{product=»SNMP»} != 1

5m

warning

ifAdminStatus on server <hostname> ({{ $labels.instance }}) is down.

Правила для СХД Tatlin (tatlin.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

TatlinDiskState

Disk <disk_id> ERROR in slot <disk_slot> on <device>

(tatlinHwDiskState * on(tatlinHwDiskDiskId, job) group_left (tatlinHwDiskSlot) tatlinHwDiskSlot) * on (tatlinHwDiskDiskId, job) group_left (tatlinHwDiskModel) tatlinHwDiskModel != 1

warning

Disk <disk_id> (<disk_model>) in slot <disk_slot> on <device> is in ERROR state

TatlinEthDown

Network interface <port_name> on <device> is DOWN

tatlinHwEthState != 1

critical

Network interface <port_name> (<sp_name>) on <device> is in DOWN state

TatlinSPDown

Storage processor <sp_name> on <device> is DOWN

tatlinHwSpState != 1

critical

Storage processor <sp_name> on <device> is in DOWN state

Правила для UserGate (usergate.yaml)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

usergate_powerSupply1Status

Power supply 1 on <hostname> is down

usergate_powerSupply1Status != 1

critical

Power supply 1 on <hostname> (<instance>) is down.

usergate_powerSupply2Status

Power supply 2 on <hostname> is down

usergate_powerSupply2Status != 1

critical

Power supply 2 on <hostname> (<instance>) is down.

usergate_haStatus

Usergate <hostname> HA state is changing

changes(usergate_haStatus) > 0

5m

warning

HA status is changing in Usergate <hostname> for 5 minutes

usergate_cpuLoad

CPU load on Usergate <hostname> is too high

usergate_cpuLoad > 60

2m

warning

CPU load on Usergate <hostname> is too high (<value>)

usergate_raidStatus

Problem with RAID status on Usergate <hostname>

usergate_raidStatus != 1

warning

Problem with RAID status on Usergate <hostname>

usergate_memoryUsed

Memory usage on Usergate <hostname> is too high

usergate_memoryUsed > 60

2m

warning

Memory usage on Usergate <hostname> is too high (<value>)

Правила Node Exporter (node-exporter.yaml)#

Расширенный набор по сравнению с базовыми правилами, возможны пересечения.

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

HostOutOfMemory

Host <hostname> out of memory

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Node memory is filling up (< 10% left)

HostMemoryUnderMemoryPressure

Host memory under memory pressure on <hostname>

(rate(node_vmstat_pgmajfault[1m]) > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

The node is under heavy memory pressure. High rate of major page faults

HostMemoryIsUnderutilized

Host Memory is underutilized on <hostname>

(100 - (avg_over_time(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

1w

info

Node memory is < 20% for 1 week. Consider reducing memory space. (instance <instance>)

HostUnusualNetworkThroughputIn

Unusual network input throughput on <hostname>

(rate(node_network_receive_bytes_total[2m]) / 1024 / 1024 > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Host network interfaces are probably receiving too much data (> 1000 MB/s) for 5 minutes

HostUnusualNetworkThroughputOut

Host unusual network output throughput on <hostname>

(rate(node_network_transmit_bytes_total[2m]) / 1024 / 1024 > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Host network interfaces are probably sending too much data (> 1000 MB/s) for 5 minutes

HostUnusualDiskReadRate

Unusual disk read rate on <hostname>

(rate(node_disk_read_bytes_total[2m]) / 1024 / 1024 > 200) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is probably reading too much data (> 200 MB/s) for 5 minutes

HostUnusualDiskWriteRate

Unusual disk write rate on <hostname>

(rate(node_disk_written_bytes_total[2m]) / 1024 / 1024 > 200) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

5m

warning

Disk is probably writing too much data (> 200 MB/s) for 5 minutes

HostOutOfDiskSpaceWarn

Host <hostname> is near out of disk space

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Disk is almost full (< 20% left)

HostOutOfDiskSpaceCrit

Host <hostname> is out of disk space

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

critical

Disk is almost full (< 10% left)

HostDiskWillFillIn24Hours

Host disk will fill in 24 hours on <hostname>

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~»tmpfs»}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Filesystem is predicted to run out of space within the next 24 hours at current write rate

HostOutOfInodes

Host <hostname> is out of inodes

(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Disk is almost running out of available inodes (< 10% left)

HostInodesWillFillIn24Hours

Host inodes will fill in 24 hours on <hostname>

(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!=»msdosfs»}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!=»msdosfs»} == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Filesystem is predicted to run out of inodes within the next 24 hours at current write rate

HostUnusualDiskReadLatency

Unusual disk read latency on <hostname>

(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Disk latency is growing (read operations > 100ms)

HostUnusualDiskWriteLatency

Unusual disk write latency on <hostname>

(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Disk latency is growing (write operations > 100ms)

HostHighCpuLoad

High CPU load on <hostname>

(sum by (instance, hostname, job, group) (avg by (mode, instance, hostname) (rate(node_cpu_seconds_total{mode!=»idle»}[2m]))) > 0.8) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

CPU load is > 80%

HostCpuIsUnderutilized

CPU is underutilized on <hostname>

(100 - (rate(node_cpu_seconds_total{mode=»idle»}[30m]) * 100) < 20) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

1w

warning

CPU load is < 20% for 1 week. Consider reducing the number of CPUs.

HostCpuStealNoisyNeighbor

CPU steal noisy neighbor on <hostname>

(avg by(instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»steal»}[5m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}“

warning

CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.

HostCpuHighIowait

High CPU iowait on <hostname>

(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[5m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

warning

CPU iowait > 10%. A high iowait means that you are disk or network bound.

HostUnusualDiskIo

Unusual disk IO on <hostname>

(rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Time spent in IO is too high on <hostname>. Check storage for issues.

HostContextSwitching

High context switching on <hostname>

((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode=»idle»})) > 10000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

warning

Context switching is growing on the node (> 10000 / CPU / s)

HostSwapIsFillingUp

Host swap is filling up on <hostname>

((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Swap is filling up (>80%)

HostSystemdServiceCrashed

Service <name> crashed on <hostname>

(node_systemd_unit_state{state=»failed»} == 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

warning

systemd service <name> is crashed

HostPhysicalComponentTooHot

Host physical component too hot on <hostname>

((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node, sensor) node_hwmon_sensor_label{label!=»tctl»} > 75)) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Physical hardware component too hot

HostNodeOvertemperatureAlarm

Host node overtemperature alarm on <hostname>

(node_hwmon_temp_crit_alarm_celsius == 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

critical

Physical node temperature alarm triggered

HostKernelVersionDeviations

Host kernel version deviations on <hostname>

(count(sum(label_replace(node_uname_info, «kernel», «$1», «release», «([0-9]+.[0-9]+.[0-9]+).*»)) by (kernel)) > 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

6h

warning

Different kernel versions are running

HostOomKillDetected

Host OOM kill detected on <hostname>

(increase(node_vmstat_oom_kill[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

warning

OOM kill detected

HostEdacCorrectableErrorsDetected

Host EDAC Correctable Errors detected on <hostname>

(increase(node_edac_correctable_errors_total[5m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

info

Host <hostname> has had <value> correctable memory errors reported by EDAC in the last 5 minutes

HostEdacUncorrectableErrorsDetected

Host EDAC Uncorrectable Errors detected on <hostname>

(node_edac_uncorrectable_errors_total > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

warning

Host <hostname> has had <value> uncorrectable memory errors reported by EDAC in the last minutes.

HostNetworkReceiveErrors

Host Network Receive Errors on <hostname>

(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Host <hostname> interface <device> has encountered <value> receive errors in the last two minutes.

HostNetworkTransmitErrors

Host Network Transmit Errors on <hostname>

(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Host <hostname> interface <device> has encountered <value> transmit errors in the last two minutes.

HostNetworkBondDegraded

Host Network Bond Degraded on <hostname>

((node_bonding_active - node_bonding_slaves) != 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Bond «<master>» degraded on «<hostname>»

HostConntrackLimit

Host conntrack limit on <hostname>

(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

10m

warning

The number of conntrack is approaching limit

HostClockSkew

Host clock skew on <hostname>

((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

10m

warning

Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.

HostClockNotSynchronising

Host clock not synchronising on <hostname>

(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

2m

warning

Clock not synchronising. Ensure NTP is configured on this host.

HostRequiresReboot

Host <hostname> requires reboot

(node_reboot_required > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}

4h

warning

<instance> requires a reboot.

HostCPUCountChanged

Host <hostname> CPU count is changed

changes( count(count(node_cpu_seconds_total != 0) by (cpu, hostname,group, job, instance)) by (hostname, group, job, instance)) > 0

critical

CPU count is changed on <hostname> (<instance>)

Правила Vector (обнаружение в логах)#

Название события

Краткое описание

Условие создания события

Минимальная длительность

Уровень критичности

Описание

event_megaraid_error

MEGARAID VD <device> failure

создание - при обнаружении в /var/log/syslog выражения: kernel: megaraid_sas .* - VD (.+) is now (PARTIALLY DEGRADED|DEGRADED|OFFLINE)

восстановление по строке kernel: megaraid_sas .* - VD (.+) is now OPTIMAL

critical

Изменение состояния RAID-массива для серверов с установленным megaraid_sas

event_aldpro_error

Not listening for new connections

создание - при обнаружении в /var/log/dirsrv/slapd-<DOMAIN>/errors выражения: ERR - .*? - Not listening for new connections - too many fds open

восстановление по строке ERR - .*? - Listening for new connections again

critical