Monitoring:
Using a combination of in house
developed and purchased monitoring tools, the
PCA could identify the smallest problems across
the network at any time. PCAs would be required
to check alarm summaries and answer incoming alerts
from Big Brother, Sitescope and Platinum.
Monitoring tools were finely
tuned to eliminate false alarms but at times,
increased load or server updates would cause alerts.
In order to expedite the process of resolving
these problems I also developed spray tools and
set procedures in place to verify exactly were
the issues were occurring.
Procedures:
The most important function
of the PCA and the reason that the NOC was staffed
24/7 was to eliminate the impact of any server/network/application
outage on the endusers.
With over 1000 servers running
on almost every platform the procedures section
was vital to Lycos' success. I specifically tailored
each procedure and cataloged them with the same
grouping headings that the monitoring systems
used. Within two clicks from the monitoring page,
PCAs could view the correct procedure to fix problems.
Escalation:
When first level support could
not resolve the issue, the NOC was expected to
escalate the problem to senior level system administrators.
Again NOC documentation allowed
PCAs to have the contact information needed within
one click from the procedures.
As the amount of servers increased
and escalation became more involved with multiple
offices and overseas contacts, we developed a
tool to continue escalating once the PCA began
them. With two clicks and a brief explanation
of the issue, PCAs could page the on call administrator.