Базовые правила для создания событий#
Базовые правила оповещений созданы на основе метрик node-exporter. Соответственно, этот экспортер должен быть установлен, автоматически запускаться и корректно работать.
Базовые правила объединены в группу правил General, файл helm/alert-rules/general.yaml
или docker-compose/vmalert/config/general.yaml
.
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
Node_Down |
Node <hostname> is possibly down |
up{instance=~».*:9100»} == 0 |
5m |
critical |
Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down |
Node_Reboot |
Node <hostname> has been restarted |
node_time_seconds - node_boot_time_seconds < 600 |
critical |
Node <hostname> (<instance>) has been restarted (uptime < 10m) |
|
CPU_Utilization |
High CPU utilization on <hostname> (<value>% idle) |
avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) < 20 |
5m |
warning |
<hostname> (<instance>) has high CPU utilization for more than 5 minuites |
CPU_Utilization |
Critical CPU utilization on <hostname> (<value>% idle) |
avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) < 10 |
5m |
critical |
<hostname> (<instance>) has critical CPU utilization for more than 5 minuites |
CPU_HighIOwait |
High CPU iowait on<hostname> (<value>%) |
(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
CPU iowait > 10%. A high iowait means that you are disk or network bound |
Memory_Utilization |
High Memory utilization on <hostname> (<value>% available) |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20 |
5m |
warning |
<hostname> (<instance>) has high Memory utilization for more than 5 minuites |
Memory_Utilization |
Critical Memory utilization on <hostname> (<value>% available) |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 |
5m |
critical |
<hostname> (<instance>) has critical Memory utilization for more than 5 minuites |
DiskSpace_Utilization |
Host out of disk space (instance <instance>) |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is almost full (< 20% left) |
DiskSpace_Utilization |
Host out of disk space (instance <instance>) |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
critical |
Disk is almost full (< 10% left) |
HostOutOfInodes |
Host out of inodes (instance <instance>) |
(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is almost running out of available inodes (< 10% left) |
HostFilesystemDeviceError |
Host <hostname> filesystem <mountpoint> device error |
node_filesystem_device_error == 1 |
critical |
<hostname> (<instance>): Device error with the <mountpoint> filesystem |
Сводные правила оповещений для AIC#
Общие правила (general.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
Node_Down |
Node <hostname> is possibly down |
up{instance=~».*:9100»} == 0 |
5m |
critical |
Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down |
Node_Reboot |
Node <hostname> has been restarted |
node_time_seconds - node_boot_time_seconds < 600 |
critical |
Node <hostname> (<instance>) has been restarted (uptime < 10m) |
|
CPU_Utilization |
High CPU utilization on <hostname> (<value>% used) |
100 - avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) > 60 |
5m |
warning |
<hostname> (<instance>) has high CPU utilization for more than 5 minuites |
CPU_Utilization |
Critical CPU utilization on <hostname> (<value>% used) |
100 - avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»idle»}[1m]) * 100) > 85 |
5m |
critical |
<hostname> (<instance>) has critical CPU utilization for more than 5 minuites |
CPU_System_Utilization |
Critical System CPU utilization on <hostname> (<value>% used) |
avg by (hostname, instance, job, group) (irate(node_cpu_seconds_total{mode=»system»}[1m]) * 100) > 50 |
5m |
critical |
<hostname> (<instance>) has more than 50% CPU utilization at the system (kernel) level for more than 5 minutes. |
CPU_HighIOwait |
High CPU iowait on <hostname> (<value>%) |
(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
CPU iowait > 10%. A high iowait means that you are disk or network bound |
CPU_CritIOwait |
Critical CPU iowait on <hostname> (<value>%) |
(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[1m])) * 100 > 30) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
critical |
CPU iowait > 30%. A high iowait means that you are disk or network bound. |
Memory_Utilization |
High Memory utilization on <hostname> (<value>% used) |
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 80 |
5m |
warning |
<hostname> (<instance>) has high Memory utilization for more than 5 minuites |
Memory_Utilization |
Critical Memory utilization on <hostname> (<value>% used) |
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 95 |
5m |
critical |
<hostname> (<instance>) has critical Memory utilization for more than 5 minuites |
DiskSpace_Utilization |
Host out of disk space (instance <instance>) |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is almost full (< 20% left) |
DiskSpace_Utilization |
Host out of disk space (instance <instance>) |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
critical |
Disk is almost full (< 10% left) |
HostOutOfInodes |
Host out of inodes (instance <instance>) |
(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is almost running out of available inodes (< 10% left) |
HostFilesystemDeviceError |
Host <hostname> filesystem <mountpoint> device error |
node_filesystem_device_error == 1 |
critical |
<hostname> (<instance>): Device error with the <mountpoint> filesystem |
|
NetworkInterfaceDown |
Network interface <device> on <hostname> is in „Down“ state |
node_network_info{operstate=»down», device!~»eno[0-9]+»} == 1 |
critical |
<hostname> (<instance>): network interface <device> is in „Down“ state |
Правила для подсистемы виртуализации (brest.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
Brest_vCPU_usage |
High vCPU usage on Brest cluster <cluster>: >80% |
(one_cluster_cpuusage / one_cluster_totalcpu) * 100 > 80 |
5m |
critical |
Node-exporter on <hostname> (<instance>) does not respond, so the host is possibly down |
Brest_virtualization_service_status |
Service <service> on Brest virtualization server <hostname> is possibly down |
systemd_unit_state_id{name=~»libvirtd.*|postgresql@.*|chrony.*|sssd.*|opennebula.*», product=»brest»} != 1 |
critical |
On Brest virtualization server <hostname> (<instance>) service <service> is not running. |
|
Brest_front_service_status |
Service <service> on Brest front server <hostname> is possibly down |
systemd_unit_state_id{name=~»libvirtd.*|postgresql@.*|chrony.*|sssd.*|opennebula.*», product=»brest»} != 1 |
critical |
On Brest front server <hostname> (<instance>) service <service> is not running. |
|
Brest_RAFT_status |
Brest server <hostname> has an issue with RAFT status |
one_zone_raft{} == 10 |
1m |
critical |
Brest server <hostname> (<instance>) has an issue with RAFT status. |
Brest_API_status |
Brest server <hostname> has an issue with API connection |
one_api_connect{} != 1 |
1m |
critical |
Brest server <hostname> (<instance>) has an issue with API connection. |
Brest_web_portal_status |
Brest has an issue with web portal connection |
one_web_connect{} != 200 |
2m |
warning |
Brest has an issue with web portal <hostname> connection (using <instance> exporter) |
Brest_web_portal_duration |
On Brest server <hostname> duration of time for web portal connection is too long |
one_web_connect_duration{} >= 2000 |
5m |
warning |
On Brest server <hostname> (<instance>) duration of time for web portal connection is more than 2 seconds. |
Brest_front_host_status |
Brest front server <hostname> is possibly down |
node_exporter_build_info{product=»brest», component=»front»} != 1 |
5m |
critical |
Brest front server <hostname> (<instance>) is not responding. It may be down. |
Brest_virtualization_host_error |
Brest virtualization server <hostname> is in ERROR state |
one_host_state == 3 |
2m |
critical |
Brest virtualization server <hostname> (<instance>) is in ERROR state. |
Brest_virtualization_host_init |
Brest virtualization server <hostname> is in INIT state |
one_host_state == 1 |
2m |
critical |
Brest virtualization server <hostname> (<instance>) is in INIT state. |
Brest_virtualization_host_disabled |
Brest virtualization server <hostname> is in DISABLED state |
one_host_state == 4 |
2m |
critical |
Brest virtualization server <hostname> (<instance>) is in DISABLED state. |
Brest_virtualization_host_offline |
Brest virtualization server <hostname> is in OFFLINE state |
one_host_state == 8 |
2m |
critical |
Brest virtualization server <hostname> (<instance>) is in OFFLINE state. |
Brest_virtualization_host_monitored |
Brest virtualization server <hostname> is in MONITORED state |
one_host_state == 2 |
2m |
info |
Brest virtualization server <hostname> (<instance>) is in MONITORED state. |
Brest_changes_RAFT_status |
On Brest server <hostname> status RAFT has changed |
sum by() (changes(one_zone_raft{}[5m])) > 0 |
warning |
On Brest server <hostname> (<instance>) status RAFT has changed in last 5 minutes. |
|
Brest_new_running_VMs |
A lot of new VMs on <hostname> in last 10 minutes |
delta(sum(one_vms_states_count{}))[10m] > 50 |
warning |
More than 50 new VMs were created on <hostname> in last 10 minutes |
|
Brest_new_running_VMs |
More than 500 new VMs on <hostname> in last 10 minutes |
delta(sum(one_vms_states_count{}))[10m] > 500 |
warning |
More than 500 new VMs were created on <hostname> in last 10 minutes |
|
Brest_RAFT_issues |
More than 50% of Brest fronts are in error state |
(count(one_zone_raft{} == 10) or vector(0)) / count(one_zone_raft{} ) * 100 > 50 |
5m |
critical |
More than 50% of Brest front servers are in „Issue“ RAFT status |
Правила IPMI (ipmi_exporter.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
IPMI_temperature_celsius |
Temperature sensor „<name>“ on server <hostname> has high temperature |
ipmi_temperature_celsius >= 75 |
5m |
warning |
Temperature sensor „<name>“ on server <hostname> has temperature <value> degrees Celsius. |
IPMI_temperature_celsius |
Temperature sensor „<name>“ on server <hostname> has critical temperature |
ipmi_temperature_celsius >= 90 |
5m |
critical |
Temperature sensor „<name>“ on server <hostname> has critical temperature <value> degrees Celsius. |
IPMI_chassis_power_state |
Chassis of server <hostname> is powered off |
ipmi_chassis_power_state != 1 |
critical |
Chassis power on server <hostname> is switched off or failed |
|
IPMI_collector_status |
IPMI collector on server <hostname> possibly is not working |
ipmi_up != 1 |
2m |
critical |
Status of IPMI collector on server <hostname> is down. |
IPMI_current_state |
Problem with <name> current state on server <hostname> |
ipmi_current_state != 0 |
critical |
Status of <name> current on server <hostname> is not OK |
|
IPMI_voltage_state |
Problem with <name> voltage sensor on server <hostname> |
ipmi_voltage_state != 0 |
critical |
Status of voltage sensor <name> on server <hostname> is not OK |
|
IPMI_voltage_volts |
Voltage failure <name> on server <hostname> |
ipmi_voltage_volts == 0 |
critical |
<name> voltage input or output on server <hostname> is 0 |
Правила SNMP (snmp_exporter.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
Device_Down |
Network device <hostname> <instance> is possibly down |
up{instance=~».*:9116»} == 0 |
2m |
critical |
Network device <hostname> <instance> does not respond by SNMP, so the host is possibly down |
IfOperStatus |
Interface <interface> on <instance> is DOWN |
(ifOperStatus * on(ifIndex, job) group_left (ifName) ifName != 1) and (ifAdminStatus * on(ifIndex, job) group_left (ifName) ifName == 1) |
warning |
Interface <interface> on <instance> is in DOWN state while AdminState configured as UP |
|
IfErrors |
Too many errors on interface <interface> of <instance> |
(rate(ifOutErrors[1m]) * on(ifIndex, job) group_left (ifName) ifName > 0) or (rate(ifInErrors[1m]) * on(ifIndex, job) group_left (ifName) ifName > 0) |
2m |
warning |
Too many errors on interface <interface> of <instance> for last 2 minutes |
Device_Restart |
Network device <instance> was restarted |
sysUpTime / 100 < 600 |
critical |
Network device <instance> was restarted in last 10 minutes |
|
If_Admin_Status |
ifAdminStatus on host <hostname> is down |
ifAdminStatus{product=»SNMP»} != 1 |
5m |
warning |
ifAdminStatus on server <hostname> ({{ $labels.instance }}) is down. |
Правила для СХД Tatlin (tatlin.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
TatlinDiskState |
Disk <disk_id> ERROR in slot <disk_slot> on <device> |
(tatlinHwDiskState * on(tatlinHwDiskDiskId, job) group_left (tatlinHwDiskSlot) tatlinHwDiskSlot) * on (tatlinHwDiskDiskId, job) group_left (tatlinHwDiskModel) tatlinHwDiskModel != 1 |
warning |
Disk <disk_id> (<disk_model>) in slot <disk_slot> on <device> is in ERROR state |
|
TatlinEthDown |
Network interface <port_name> on <device> is DOWN |
tatlinHwEthState != 1 |
critical |
Network interface <port_name> (<sp_name>) on <device> is in DOWN state |
|
TatlinSPDown |
Storage processor <sp_name> on <device> is DOWN |
tatlinHwSpState != 1 |
critical |
Storage processor <sp_name> on <device> is in DOWN state |
Правила для UserGate (usergate.yaml)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
usergate_powerSupply1Status |
Power supply 1 on <hostname> is down |
usergate_powerSupply1Status != 1 |
critical |
Power supply 1 on <hostname> (<instance>) is down. |
|
usergate_powerSupply2Status |
Power supply 2 on <hostname> is down |
usergate_powerSupply2Status != 1 |
critical |
Power supply 2 on <hostname> (<instance>) is down. |
|
usergate_haStatus |
Usergate <hostname> HA state is changing |
changes(usergate_haStatus) > 0 |
5m |
warning |
HA status is changing in Usergate <hostname> for 5 minutes |
usergate_cpuLoad |
CPU load on Usergate <hostname> is too high |
usergate_cpuLoad > 60 |
2m |
warning |
CPU load on Usergate <hostname> is too high (<value>) |
usergate_raidStatus |
Problem with RAID status on Usergate <hostname> |
usergate_raidStatus != 1 |
warning |
Problem with RAID status on Usergate <hostname> |
|
usergate_memoryUsed |
Memory usage on Usergate <hostname> is too high |
usergate_memoryUsed > 60 |
2m |
warning |
Memory usage on Usergate <hostname> is too high (<value>) |
Правила Node Exporter (node-exporter.yaml)#
Расширенный набор по сравнению с базовыми правилами, возможны пересечения.
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
HostOutOfMemory |
Host <hostname> out of memory |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Node memory is filling up (< 10% left) |
HostMemoryUnderMemoryPressure |
Host memory under memory pressure on <hostname> |
(rate(node_vmstat_pgmajfault[1m]) > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
The node is under heavy memory pressure. High rate of major page faults |
HostMemoryIsUnderutilized |
Host Memory is underutilized on <hostname> |
(100 - (avg_over_time(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
1w |
info |
Node memory is < 20% for 1 week. Consider reducing memory space. (instance <instance>) |
HostUnusualNetworkThroughputIn |
Unusual network input throughput on <hostname> |
(rate(node_network_receive_bytes_total[2m]) / 1024 / 1024 > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Host network interfaces are probably receiving too much data (> 1000 MB/s) for 5 minutes |
HostUnusualNetworkThroughputOut |
Host unusual network output throughput on <hostname> |
(rate(node_network_transmit_bytes_total[2m]) / 1024 / 1024 > 1000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Host network interfaces are probably sending too much data (> 1000 MB/s) for 5 minutes |
HostUnusualDiskReadRate |
Unusual disk read rate on <hostname> |
(rate(node_disk_read_bytes_total[2m]) / 1024 / 1024 > 200) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is probably reading too much data (> 200 MB/s) for 5 minutes |
HostUnusualDiskWriteRate |
Unusual disk write rate on <hostname> |
(rate(node_disk_written_bytes_total[2m]) / 1024 / 1024 > 200) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
5m |
warning |
Disk is probably writing too much data (> 200 MB/s) for 5 minutes |
HostOutOfDiskSpaceWarn |
Host <hostname> is near out of disk space |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Disk is almost full (< 20% left) |
HostOutOfDiskSpaceCrit |
Host <hostname> is out of disk space |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
critical |
Disk is almost full (< 10% left) |
HostDiskWillFillIn24Hours |
Host disk will fill in 24 hours on <hostname> |
((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~»tmpfs»}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Filesystem is predicted to run out of space within the next 24 hours at current write rate |
HostOutOfInodes |
Host <hostname> is out of inodes |
(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Disk is almost running out of available inodes (< 10% left) |
HostInodesWillFillIn24Hours |
Host inodes will fill in 24 hours on <hostname> |
(node_filesystem_files_free{fstype!=»msdosfs»} / node_filesystem_files{fstype!=»msdosfs»} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!=»msdosfs»}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!=»msdosfs»} == 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Filesystem is predicted to run out of inodes within the next 24 hours at current write rate |
HostUnusualDiskReadLatency |
Unusual disk read latency on <hostname> |
(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Disk latency is growing (read operations > 100ms) |
HostUnusualDiskWriteLatency |
Unusual disk write latency on <hostname> |
(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Disk latency is growing (write operations > 100ms) |
HostHighCpuLoad |
High CPU load on <hostname> |
(sum by (instance, hostname, job, group) (avg by (mode, instance, hostname) (rate(node_cpu_seconds_total{mode!=»idle»}[2m]))) > 0.8) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
CPU load is > 80% |
HostCpuIsUnderutilized |
CPU is underutilized on <hostname> |
(100 - (rate(node_cpu_seconds_total{mode=»idle»}[30m]) * 100) < 20) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
1w |
warning |
CPU load is < 20% for 1 week. Consider reducing the number of CPUs. |
HostCpuStealNoisyNeighbor |
CPU steal noisy neighbor on <hostname> |
(avg by(instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»steal»}[5m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»}“ |
warning |
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. |
|
HostCpuHighIowait |
High CPU iowait on <hostname> |
(avg by (instance, hostname, job, group) (rate(node_cpu_seconds_total{mode=»iowait»}[5m])) * 100 > 10) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
warning |
CPU iowait > 10%. A high iowait means that you are disk or network bound. |
|
HostUnusualDiskIo |
Unusual disk IO on <hostname> |
(rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Time spent in IO is too high on <hostname>. Check storage for issues. |
HostContextSwitching |
High context switching on <hostname> |
((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode=»idle»})) > 10000) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
warning |
Context switching is growing on the node (> 10000 / CPU / s) |
|
HostSwapIsFillingUp |
Host swap is filling up on <hostname> |
((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Swap is filling up (>80%) |
HostSystemdServiceCrashed |
Service <name> crashed on <hostname> |
(node_systemd_unit_state{state=»failed»} == 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
warning |
systemd service <name> is crashed |
|
HostPhysicalComponentTooHot |
Host physical component too hot on <hostname> |
((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node, sensor) node_hwmon_sensor_label{label!=»tctl»} > 75)) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Physical hardware component too hot |
HostNodeOvertemperatureAlarm |
Host node overtemperature alarm on <hostname> |
(node_hwmon_temp_crit_alarm_celsius == 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
critical |
Physical node temperature alarm triggered |
|
HostKernelVersionDeviations |
Host kernel version deviations on <hostname> |
(count(sum(label_replace(node_uname_info, «kernel», «$1», «release», «([0-9]+.[0-9]+.[0-9]+).*»)) by (kernel)) > 1) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
6h |
warning |
Different kernel versions are running |
HostOomKillDetected |
Host OOM kill detected on <hostname> |
(increase(node_vmstat_oom_kill[1m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
warning |
OOM kill detected |
|
HostEdacCorrectableErrorsDetected |
Host EDAC Correctable Errors detected on <hostname> |
(increase(node_edac_correctable_errors_total[5m]) > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
info |
Host <hostname> has had <value> correctable memory errors reported by EDAC in the last 5 minutes |
|
HostEdacUncorrectableErrorsDetected |
Host EDAC Uncorrectable Errors detected on <hostname> |
(node_edac_uncorrectable_errors_total > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
warning |
Host <hostname> has had <value> uncorrectable memory errors reported by EDAC in the last minutes. |
|
HostNetworkReceiveErrors |
Host Network Receive Errors on <hostname> |
(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Host <hostname> interface <device> has encountered <value> receive errors in the last two minutes. |
HostNetworkTransmitErrors |
Host Network Transmit Errors on <hostname> |
(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Host <hostname> interface <device> has encountered <value> transmit errors in the last two minutes. |
HostNetworkBondDegraded |
Host Network Bond Degraded on <hostname> |
((node_bonding_active - node_bonding_slaves) != 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Bond «<master>» degraded on «<hostname>» |
HostConntrackLimit |
Host conntrack limit on <hostname> |
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
10m |
warning |
The number of conntrack is approaching limit |
HostClockSkew |
Host clock skew on <hostname> |
((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
10m |
warning |
Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host. |
HostClockNotSynchronising |
Host clock not synchronising on <hostname> |
(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
2m |
warning |
Clock not synchronising. Ensure NTP is configured on this host. |
HostRequiresReboot |
Host <hostname> requires reboot |
(node_reboot_required > 0) * on(instance, hostname, job, group) group_left (nodename) node_uname_info{nodename=~».+»} |
4h |
warning |
<instance> requires a reboot. |
HostCPUCountChanged |
Host <hostname> CPU count is changed |
changes( count(count(node_cpu_seconds_total != 0) by (cpu, hostname,group, job, instance)) by (hostname, group, job, instance)) > 0 |
critical |
CPU count is changed on <hostname> (<instance>) |
Правила Vector (обнаружение в логах)#
Название события |
Краткое описание |
Условие создания события |
Минимальная длительность |
Уровень критичности |
Описание |
---|---|---|---|---|---|
event_megaraid_error |
MEGARAID VD <device> failure |
создание - при обнаружении в /var/log/syslog выражения: kernel: megaraid_sas .* - VD (.+) is now (PARTIALLY DEGRADED|DEGRADED|OFFLINE) восстановление по строке kernel: megaraid_sas .* - VD (.+) is now OPTIMAL |
critical |
Изменение состояния RAID-массива для серверов с установленным megaraid_sas |
|
event_aldpro_error |
Not listening for new connections |
создание - при обнаружении в /var/log/dirsrv/slapd-<DOMAIN>/errors выражения: ERR - .*? - Not listening for new connections - too many fds open восстановление по строке ERR - .*? - Listening for new connections again |
critical |
||