[Cadre-politics] Mon monitoring service( alert procedure)

Dan MacNeil dan at thecsl.org
Sun Jun 4 11:11:17 EDT 2006


Under what circumstances people are woken up @ 3am seems an important 
political question, worth careful consideration.

Specifications are generally the hardest part to get right and 
specifications are an area, I have particular trouble with.

Having things written down to look at, even crudely gives people another 
way to look at the information. --It is also a good way to clarifify 
thinking, if you can't write it down, your thinking probably isn't that 
clear.

 > Once detection of failure occurs the mon project will alert said person:
 >
 >        Immediately through email notification
 >        then another email within 5 minutes
 >        finally after 10 minutes from initial failure a page will be
 > sent( also not ready)
 >
 >        continuing notifications are still up for debate

============== version 2 features ============

Paging has the advantage of not requiring a network connection to work. 
Paging has the disadvantages of requiring us to install and configure a 
modem and to ask the university for a modem capable line.

We can get the advantage of working when the University's connection is 
down by running mon (the monitoring software) offsite. We have to run 
from offsite to account for stuff like firewall changes. My inclination 
is to leave paging to version 2.0 of the project.

We'd looked at AIM notification but the notifier that comes with mon is 
old and uses an old version of the AIM protocol that isn't supported any
more.

Jabber is an open protocol and is supported by google, writing a 
notification plug-in for googleIM/Jabber doesn't look too hard.

	apt-cache show libnet-jabber-perl

IM is an ideal notification method as it only bugs the people who have 
indicated their availability for bugging and isn't filtered like email 
and phone.

============== end version 2 features ============

==============my when to notify thoughts  ============

 >        Immediately through email notification
 >        then another email within 5 minutes

Odds are good that the first email will go through and the second email 
will be an annoyance.

Maybe:

For crucial non-redundant systems immediate email to:

		dan at thecsl.org, cadre-config at lists.thecsl.org

5 minute email to:
		9784551885 < at > vtext.com
		9783498209 < at > tmomail.net

For redundant systems that go offline between 10:00am and 11:59pm

	--same as above

For redundant systems that go offline between 10:00am and 11:59pm, omit 
the email to cell phones.

crucial non-redudent would be:

	postgresql , imap, pop3, http, ldap

redundent or non-critical
	mysql, smtp, dns, backup servers

===================== notification things are back up =================

 From past experience with a free service that is no longer available, 
there is a lot of flicker in UML network, it is up/down/up rapidly in a 
short period of time.

Getting notified when something comes back up is helpful as it lets 
people turn around and go back to the barbeque.


More information about the Cadre-politics mailing list