[Cadre-politics] Mon monitoring service( alert procedure)
Dan MacNeil
dan at thecsl.org
Sun Jun 4 11:11:17 EDT 2006
Under what circumstances people are woken up @ 3am seems an important
political question, worth careful consideration.
Specifications are generally the hardest part to get right and
specifications are an area, I have particular trouble with.
Having things written down to look at, even crudely gives people another
way to look at the information. --It is also a good way to clarifify
thinking, if you can't write it down, your thinking probably isn't that
clear.
> Once detection of failure occurs the mon project will alert said person:
>
> Immediately through email notification
> then another email within 5 minutes
> finally after 10 minutes from initial failure a page will be
> sent( also not ready)
>
> continuing notifications are still up for debate
============== version 2 features ============
Paging has the advantage of not requiring a network connection to work.
Paging has the disadvantages of requiring us to install and configure a
modem and to ask the university for a modem capable line.
We can get the advantage of working when the University's connection is
down by running mon (the monitoring software) offsite. We have to run
from offsite to account for stuff like firewall changes. My inclination
is to leave paging to version 2.0 of the project.
We'd looked at AIM notification but the notifier that comes with mon is
old and uses an old version of the AIM protocol that isn't supported any
more.
Jabber is an open protocol and is supported by google, writing a
notification plug-in for googleIM/Jabber doesn't look too hard.
apt-cache show libnet-jabber-perl
IM is an ideal notification method as it only bugs the people who have
indicated their availability for bugging and isn't filtered like email
and phone.
============== end version 2 features ============
==============my when to notify thoughts ============
> Immediately through email notification
> then another email within 5 minutes
Odds are good that the first email will go through and the second email
will be an annoyance.
Maybe:
For crucial non-redundant systems immediate email to:
dan at thecsl.org, cadre-config at lists.thecsl.org
5 minute email to:
9784551885 < at > vtext.com
9783498209 < at > tmomail.net
For redundant systems that go offline between 10:00am and 11:59pm
--same as above
For redundant systems that go offline between 10:00am and 11:59pm, omit
the email to cell phones.
crucial non-redudent would be:
postgresql , imap, pop3, http, ldap
redundent or non-critical
mysql, smtp, dns, backup servers
===================== notification things are back up =================
From past experience with a free service that is no longer available,
there is a lot of flicker in UML network, it is up/down/up rapidly in a
short period of time.
Getting notified when something comes back up is helpful as it lets
people turn around and go back to the barbeque.
More information about the Cadre-politics
mailing list