Lets say you’ve got a fairly extensive Nagios configuration, and have multiple nagios services depending on specific services, such as a network link.
Occasionally, external providers schedule outages, or internally you arrange for outage periods for services or infrastructure to go offline.
In nagios, the best way to handle this is to use the “schedule downtime” feature on the affected service.
However, sometimes there are multiple services, and it can be tedious to schedule them all for downtime – and doing it that way doesn’t accurately show that there is a single outage.
Matt’s solution? Lets create a ‘downtime’ service, that is ‘OK’ normally, but goes into ‘WARNING’ when downtime is scheduled. We can do that with the following pieces of Nagios configuration.
define host {
use generic-host
host_name downtime
alias downtime
check_command return-ok
}
define service {
host_name downtime
service_description downtime 1
check_command return-numeric!$SERVICEDOWNTIME$
use generic-service
max_check_attempts 1
normal_check_interval 5
}
That ‘check_command’ basically means “If we haven’t scheduled downtime for this service, everything is good”.
Now, we can use a servicedependency from your normal services, to depend on the “downtime” service… and “bingo” – scheduling an outage on the ‘downtime’ service will have a cascading effect.
I’m currently using this for some external provider network links (when I get a Planned Maintenance Event Notice I can schedule that in nagios, then forget about it – nagios will remember it for me, and if I look during the outage, it’ll show me that it is in downtime) and for some power circuits.
One of the main reasons I currently like is approach is it agrees with my “can we see what is going wrong and why” view, and can show more clearly in nagios-dashboard applications what the cause of a problem is.
It would be sensible to extend this further – I have replaced the 'return-numeric' check_command with a check script that checks $SERVICEDOWNTIME$ as well as checking for upcoming scheduled downtime in the nagios database.