Presenters: Michael Tate & Galen Charleton, Equinox

(focused on Nagios, but not specifically limited to it)

Intro / Scope

  • what is availability monitoring?
  • what to monitor
  • when to monitor
  • recommended to run your monitoring on a standalone server -- either virtual or physical
  • commonly used nagios checks:
    • check_procs (check processes that are running)
    • check_tcp (check by making a tcp connection)

What to Monitor

  • this presentation focusses on monitoring processes within the evergreen stack

Presentation Layer

  • presentation layer: OPAC, Staff Client, SIP server, Z39.50
  • load balancer (ldirectord)
  • apache instances
    • good idea to set both a minimum & maximum of processes for this one -- alert on runaway apache forkbombs
  • in the case of bricks: are they in rotation?
  • SIP: check tcp
  • Z39.50: check tcp; port 210

Logic Layer

  • OpenSRF, clark_kent
  • check the number of opensrf drones & listeners
  • is running and is the lockfile in place?
  • are there action trigger events pending?
    • does /tmp/action-trigger-LOCK exist? if so, is the process running? for how long?
  • does /tmp/hold-targeter-LOCK exist? if so, is the process running? for how long?
  • does /tmp/fine-generator-LOCK exist? if so, is the process running? for how long?
  • note -- when checking a file's age -- the standard plugin that does this will throw an error if the file doesn't exist

Data Layer

  • postgres; pgpool & slony for multi-db-server setups
  • is postgres running & responding on its port?
  • is pgpool running & responding on its port?
  • how many db backends are available? (available connections)
  • is slony running? is there any replication lag?
  • are there any long-running queries?
  • are the write-ahead-log archives current?
  • is the nightly backup snapshot current?

Meta Layers

  • memcached, ejabberd
  • logs:
    • are there any NOT CONNECTED TO THE NETWORK!@#$%&* errors in the logs?
    • are there any "Returning NULL" errors in the logs?

Platform Layers

  • nfs mounts in place (if they are being used)
  • how's the system load
  • how's swap doing? how about memory? disk free space?
  • are there any messages from the OOM killer in the logs?

When to Monitor?

  • watch for False Positives -- your worst enemy
    • pager going off at 3am
    • eventually you start ignoring those alerts...and you get nailed when one of them is real
  • causes of false positives
    • thresholds set too low
    • known events: db snapshots, log housekeeping, intensive reports
  • monitoring vs alerting
    • check_period: defines when to monitor
    • notification_period: defines when to alert


Equinox will be publishing the scripts and tools they've written for Evergreen/Nagios? v.soon

