Dave Crandall Portfolio

 
Home About Us Friends and Family Professionally

Network Operations Center Documentation

When I began my career at Lycos as a Production Control Analyst (PCA), the limited existing documentation for first level response procedures was contained in a 3 ring binder. The same binder was shared by all NOC personal and only updated by handwritten notes on scrap paper. There were no universal monitoring tools and PCAs watched screens of log reports looking for issues.

As I created procedures to be used by the NOC for first level problem resolution I also created an all encompassing living document containing all documentation needed to keep Lycos and it's properties running. I also worked with engineers and external vendors to identify and incorporate a set of monitoring tools to watch the performance of all properties under Lycos.

 

Format

The NOC documentation was divided into 3 major sections. Each section addressed one major activity required by the PCAs.

Monitoring:

Using a combination of in house developed and purchased monitoring tools, the PCA could identify the smallest problems across the network at any time. PCAs would be required to check alarm summaries and answer incoming alerts from Big Brother, Sitescope and Platinum.

Monitoring tools were finely tuned to eliminate false alarms but at times, increased load or server updates would cause alerts. In order to expedite the process of resolving these problems I also developed spray tools and set procedures in place to verify exactly were the issues were occurring.

Procedures:

The most important function of the PCA and the reason that the NOC was staffed 24/7 was to eliminate the impact of any server/network/application outage on the endusers.

With over 1000 servers running on almost every platform the procedures section was vital to Lycos' success. I specifically tailored each procedure and cataloged them with the same grouping headings that the monitoring systems used. Within two clicks from the monitoring page, PCAs could view the correct procedure to fix problems.

Escalation:

When first level support could not resolve the issue, the NOC was expected to escalate the problem to senior level system administrators.

Again NOC documentation allowed PCAs to have the contact information needed within one click from the procedures.

As the amount of servers increased and escalation became more involved with multiple offices and overseas contacts, we developed a tool to continue escalating once the PCA began them. With two clicks and a brief explanation of the issue, PCAs could page the on call administrator.

Design Decisions

I coded the entire documentation site in HTML originally, but for optimal performance and ease of editing, I converted NOC documentation into php code in 2001.

With the ease of use and reference abilities of the site I developed, the NOC documentation is still used by PCAs today.

 

 

Screen Shots

Click on the screen shots for a larger image

Monitoring Start Page

Spray Tool Page and Example

Procedures Start Page

Escalation Information and Procedures

 

Hosting and Design by Dave Crandall