aboutsummaryrefslogtreecommitdiffstats
path: root/roles/nagios_server/README.md
blob: ffdbde67169fcb372000e3cef4b45c479083f77c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
Nagios Server
=============

Description
-----------

The `nagios_server` role installs and configures the [Nagios](https://www.nagios.org/)
monitoring system. The check configuration is generated from the Ansible inventory
using various `nagios_`-prefixed host variables.

This role does not install a webserver; it configures the Nagios application only.

The config file templates are tightly coupled to the group names in the [example inventory](../../inventory-example/).
If you change the group names, then you'll also need to modify the [object templates](templates/etc/nagios/objects/).

### Why Nagios?

I admit, Nagios is long in the tooth, and the interface leaves much to be
desired. In my view, it has three major advantages:

  - You can just `dnf install nagios`, and you're ready to go. No supporting
    infrastructure required.

  - The configuration syntax is extremely simple, making it easy to
    automatically generate the config files.

  - Extending it with your own plugins is trivial: you just write a script that
    returns `0`, `1`, or `2`.

You _can_ use Nagios for metrics gathering, but its not well-suited to the task.
In this project, its used purely for health checks.

I would have preferred to use [Icinga](https://icinga.com/) for its slick
interface, but they sadly put RPM packages behind a paywall recently.


Variables
---------

This role **accepts** the following variables:

Variable                              | Default                   | Description
--------------------------------------|---------------------------|------------
`nagios_admin_email`                  | `root@{{ email_domain }}` | Administrator's email address
`nagios_admin_pager`                  | `root@{{ email_domain }}` | Administrator's "pager" (not really used)
`nagios_access_group`                 | `role-nagios-access`      | FreeIPA group of users allowed to access web interface (will be created)
`nagios_email`                        | `root@{{ email_domain }}` | Default contact email for alerts
`nagios_reboot_window`                | `03:00-05:00`             | Daily [Time Period](https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html#timeperiod) for host reboots
`nagios_ssh_privkey`                  |                      | SSH private key for `nagios` user
`nagios_excluded_groups`              | `[]`                      | List of Ansible group names to exclude from checks
`nagios_ssh_control_persist`          | `20m`                     | Timeout of persistent SSH connection
`nagios_snmp_max_size`                | 10000                     | Maximum size of SNMP responses (bytes)
`nagios_manubulon_version`            | `master`                  | Git version of [Manubulon](https://github.com/SteScho/manubulon-snmp) to install
`nagios_check_dns`                    | `[]`                      | DNS checks to perform (see [format](#nagios_check_dns) below)
`nagios_connectivity_check_host`      | `8.8.8.8`                 | Host to use for upstream connectivity check
`nagios_connectivity_check_count`     | 20                        | Number of ICMP packets to use for connectivity check
`nagios_connectivity_check_rtt_warn`  | 50.0                      | Round-trip-time warning threshold for connectivity check (ms)
`nagios_connectivity_check_rtt_crit`  | 100.0                     | Round-trip-time critical threshold for connectivity check (ms)
`nagios_connectivity_check_loss_warn` | `5%`                      | Packet loss warning threshold for connectivity check
`nagios_connectivity_check_loss_crit` | `20%`                     | Packet loss critical threshold for connectivity check


### nagios\_check\_dns

The `nagios_check_dns` variable lists DNS checks to perform. It should contain
a list of dictionaries of the following format:

Variable | Default | Description
---------|---------|------------
name     |    | FQDN to query
qtype    | `A`     | Query type
server   |    | Upstream DNS server to query
expect   |    | Expected response


### Exports

This role **exports** the following variables:

Variable               | Description
-----------------------|------------
`nagios_html_dir`      | Nagios webroot path
`nagios_apache_config` | Apache config block for Nagios CGI application


Host Variables
--------------

In addition to variables for the `nagios_server` role itself, you can set various
`nagios_`-prefixed hostvars to influence the check behavior for each host.
Defaults for these host-specific variables are set in [group\_vars/all/nagios.yml](../../inventory-example/group_vars/all/nagios.yml)
in the example inventory.

Variable                          | Description
----------------------------------|------------
`nagios_snmp_user`                | SNMPv3 username
`nagios_snmp_community`           | SNMP community string
`nagios_snmp_auth_proto`          | SNMPv3 authentication protocol
`nagios_snmp_priv_proto`          | SNMPv3 encryption protocol
`nagios_snmp_auth_pass`           | SNMPv3 authentication password
`nagios_snmp_priv_pass`           | SNMPv3 encryption password
`nagios_ping_count`               | ICMP packet count for `hostalive` check
`nagios_ping_rtt_warn`            | Round-trip time warning threshold for `hostalive` check
`nagios_ping_rtt_crit`            | Round-trip time critical threshold for `hostalive` check
`nagios_ping_loss_warn`           | Packet loss warning threshold for `hostalive` check
`nagios_ping_loss_crit`           | Packet loss critical threshold for `hostalive` check
`nagios_temp_warn`                | Temperature warning threshold (C)
`nagios_temp_crit`                | Temperature critical threshold (C)
`nagios_power_draw_warn`          | Power draw warning threshold (%)
`nagios_power_draw_crit`          | Power draw critical threshold (%)
`nagios_load_1m_warn`             | 1m load average (warn)
`nagios_load_5m_warn`             | 5m load average (warn)
`nagios_load_15m_warn`            | 15m load average (warn)
`nagios_load_1m_crit`             | 1m load average (crit)
`nagios_load_5m_crit`             | 5m load average (crit)
`nagios_load_15m_crit`            | 15m load average (crit)
`nagios_mem_warn`                 | Memory usage warning threshold (%)
`nagios_mem_crit`                 | Memory usage critical threshold (%)
`nagios_swap_warn`                | Swap usage warning threshold (%)
`nagios_swap_crit`                | Swap usage critical threshold (%)
`nagios_interface_bandwidth_warn` | Interface bandwith warning threshold (Mbps)
`nagios_interface_bandwidth_crit` | Interface bandwith critical threshold (Mbps)
`nagios_interface_discard_warn`   | Interface discards warning threshold (per second)
`nagios_interface_discard_crit`   | Interface discards critical threshold (per second)
`nagios_interface_error_warn`     | Interface errors warning threshold (per second)
`nagios_interface_error_crit`     | Interface errors critical threshold (per second)
`nagios_interfaces`               | Per-interface threshold overrides (see [format](#nagios_interfaces) below)
`nagios_disk_warn`                | Disk usage warning threshold (%)
`nagios_disk_crit`                | Disk usage critical threshold (%)
`nagios_disks`                    | Per-filesystem threshold overrides (see [format](#nagios_disks) below)
`nagios_certificate_warn`         | Certificate validity days remaining (warning)
`nagios_certificate_crit`         | Certificate validity days remaining (critical)
`nagios_smtp_warn`                | SMTP response time warning threshold (seconds)
`nagios_smtp_crit`                | SMTP response time critical threshold (seconds)
`nagios_mailq_warn`               | Mail queue warning size
`nagios_mailq_crit`               | Mail queue critical size
`nagios_imap_warn`                | IMAP response time warning threshold (seconds)
`nagios_imap_crit`                | IMAP response time warning threshold (seconds)
`nagios_http_warn`                | HTTP response time warning threshold (seconds)
`nagios_http_crit`                | HTTP response time warning threshold (seconds)


### nagios\_interfaces

The `nagios_interfaces` variable is used to specify check thresholds for each
network interface independently. It should contain a list of dictionaries of
the following format:

Variable         | Default                                 | Description
-----------------|-----------------------------------------|-----------
`name`           |                                    | Interface name
`regex`          |                                    | Regular expression matching one or more interfaces
`description`    | `interface name`                        | Nagios check name
`down_ok`        | no                                      | Don't alert when interface is down
`bandwidth_warn` | `{{ nagios_interface_bandwidth_warn }}` | Bandwidth warning threshold (Mbps)
`bandwidth_crit` | `{{ nagios_interface_bandwidth_crit }}` | Bandwidth critical threshold (Mbps)
`discard_warn`   | `{{ nagios_interface_discard_warn }}`   | Discard warning threshold (per second)
`discard_crit`   | `{{ nagios_interface_discard_crit }}`   | Discard critical threshold (per second)
`error_warn`     | `{{ nagios_interface_error_warn }}`     | Error warning threshold (per second)
`error_crit`     | `{{ nagios_interface_error_crit }}`     | Error critical threshold (per second)

The `nagios_interfaces` variable can also contain a simple list of interface
names, in which case the default check thresholds will be used.


### nagios\_disks

The `nagios_disks` variable is used to specify check thresholds for each
filesystem independently. It should contain a list of dictionaries of
the following format:

Variable      | Default                  | Description
--------------|--------------------------|------------
`path`        |                     | Path of the disk's mountpoint
`regex`       |                     | Regular expression matching one or more mountpoints
`description` | `mount path`             | Nagios check name
`exclude`     | no                       | Treat mountpoint as exclusion pattern
`terse`       | no                       | Use shorter check output
`warn`        | `{{ nagios_disk_warn }}` | Disk usage warning threshold (%)
`crit`        | `{{ nagios_disk_crit }}` | Disk usage critical threshold (%)

The `nagios_disks` variable can also contain a simple list of mountpoints, in
which case the default check thresholds will be used.


Usage
-----

Example playbook:

````yaml
- hosts: nagios_servers
  roles:
    - role: nagios_server
      vars:
        nagios_check_dns:
          - name: example.com
            qtype: A
            server: 8.8.8.8
            expect: 1.2.3.4

    - role: apache_vhost
      vars:
        apache_document_root: '{{ nagios_html_dir }}'
        apache_config: '{{ nagios_apache_config }}'
````