What do I do for a Living? – Step 3: Fault/Event management
This is a follow up to a series of posts titled, “What I do for a Living.” That question was the precursor to lead me to write a series of blog posts detailing my perspective on a network & systems management (NSM) maturity model. The NSM pecking order in terms of monitoring maturity:
1. Availability Monitoring (discussed here)
2. Performance Monitoring (discussed here)
3. Fault/Event Management
4. Correlation & Enrichment
5. Discovery & Mapping
6. Real Time IT Dashboarding
7. Service Level Management
This blog post will discuss Step 3 of my NSM pecking order – fault/event management. In my career I have worked for Cisco Systems and did a lot of network infrastructure consulting with clients. Many times during my career I have been called upon to help client troubleshoot a network outage. After getting the usual explanation of how the network suddenly went down, nothing has changed, my rear is on the line here if this isn’t fixed right away, etc. I generally ask to see their event console to try to identify what has happened.
Rarely do networks just mysteriously go down. One of the positives of network infrastructure gear is they like to communicate. There is so much that they want to tell you. All you have to do is listen. What do you need to listen? A centralized event & log management platform is ideal. Any organization that relies upon their network infrastructure to conduct business should have one. It is THE place that your devices send their traps and Syslogs to.
Often times we can find logs (events) in the event consoles that explain exactly what happened and when it happened. The good news is the logs generally exist. The bad news is your management isn’t to happy to find out that the outage would have been preventable had your organization been reviewing events in your console. The network infrastructure vendors even go so far as to present these event messages with severity definitions. The messages are generally classified as critical, major, minor, warning, and information.
Seems kind of silly to ignore all of these important messages from that network infrastructure that your organization paid so much money for. Rarely do I see organizations save money on their networks acquisition costs by purchasing ‘unmanageable’ devices – even though they do cost less. It is often quite confusing though, when organizations pay for ‘manageable’ devices but then do not do anything with those capabilities.
For those of you who do review these logs I know you are probably thinking to yourselves, “Yes, Jeff is right that the devices do tell you a lot. What he fails to tell you is that they tell you too much.” I do agree with this assessment. If you are using a simple syslog server or an event console that does not support de-duplication of events, then the event volume can be overwhelming. That is why organizations utilize systems like Monolith’s Event Manager to allow your organization to find the signal(s) in all the noise.
A good fault/event management system should have the ability to do the following:
- Process any type of event (trap, syslog, TL1, ascii, etc)
- Perform de-duplication of events
- Present events based on severity designations
- Allow for easy searching of events
- Ability to create event filter view to show only relevant data based upon the person/groups role
- Integrate with ticketing or service desk systems
- Present event based dashboard views
- Support multi-tenancy and role-based access control
- Display information via a standard browser interface with no plug-ins required
- Scale to support extremely high event volumes
- Store data for historical purposes
- Support basic and advanced notification templates
- Support event enrichment
- Support various types of correlation (pre and post processing)
- Support rules logic to override standard event severities and multiple ways of handling events
Next blog post will focus on correlation & enrichment.
Cheers!
Technorati Tags:
availability monitoring, performance monitoring, fault management, event management, correlation, enrichment, discovery, mapping, monitoring, application monitoring, IT monitoring, IT Ops, DNS, DHCP, HTTP, HTTPS, up down monitoring, IT operations, synthetic transaction, passive application response, performance monitoring, NSM software, monitoring software, monolith software, monolith-software.com
