What do I do for a Living? – Step 4: Correlation & Enrichment

2009 July 17

This is a follow up to series of posts titled, “What I do for a Living.”  That question was the precursor to lead me to write a series of blog posts detailing my perspective on a network & systems management (NSM) maturity model.  The NSM pecking order in terms of monitoring maturity:

1.     Availability Monitoring (discussed here)

2.     Performance Monitoring (discussed here)

3.     Fault/Event Management (discussed here)

4.     Correlation & Enrichment

5.     Discovery & Mapping

6.     Real Time IT Dashboarding

7.     Service Level Management

This  post will discuss Step 4 of my NSM pecking order – enrichment & correlation.  Hopefully readers have been able to see the wisdom of my previous steps of my NSM pecking order.  Start with availability to know what is up and down.  Increment that with performance monitoring to enable you to get proactive in your monitoring efforts.  Layer on fault/event management to capture the plethora of information from your devices.  Your gear wants to talk to you; you just have to listen!

Your next logical step is to add in some advanced logic to enable you to perform correlation & enrichment.  I have bundled the concepts of correlation & enrichment together because they both attempt to make better use of the information that you have captured in Step 3 – fault/event management.  Receiving and displaying the events is a good first step.  This does present users of event management platforms with an interesting challenge however.  How do you make sense of the sea of information now in front of you?  Many times event management systems seem as though they provide too much information.  It is easy to get lost in the noise.

Correlation & enrichment help to identify the signal from the noise.  Let’s start with correlation.  Everyone says they want it.  What is interesting is the descriptions you will get if you were to ask a bunch of folks from the industry to describe what it means.  The fact is, there are many, many different types of correlation.  I will try to highlight some of the more useful features that you may want to consider if you are on the market for such technology.

  • Correlation can begin to occur at the collection layer.  This is what Monolith calls our aggregators (Netcool refers to them as probes).  This is the first layer of attack where the management software first encounters the event.  At this stage you can listen for and collect events, process events, perform pre-insertion filtering of events to allow you to dump garbage information.  With more advanced systems you also have the capability of leveraging a rules engine to perform deterministic actions of how you’d like to treat incoming events.  For example, you may receive certain events from your Cisco wireless controller blades that come in as critical.  After researching the event on Cisco’s site you determine that the event is of little importance and is really just informational.  Within the rules you can include logic to reduce the severity of the event from critical to informational or dump the event all together.  This provides an ability for the system to become smarter and more tuned to meet your needs.
  • Enrichment can begin at the collection layer as well.  Within your collection layer you may want to utilize a hash or lookup file to enrich the event on the fly so that it contains more meaningful information once it hits the presentation layer.  With Monolith we can not only support lookups via flat file but also database, socket, snmp, and command line.  This really allows your organization make more meaningful decisions at the events entry point.  I have seen customers enrich events to provide the store number in a retail environment, the customer name associated with the event, location information, maintenance contract information, etc.  This is also where logic like priority scoring can come into play whereby certain devices or types of devices have higher priorities than others.  Instead of just basing the event upon event severity you could base the incident upon event severity multiplied by device priority for example.
  • A next level of correlation can occur at the database layer.  This occurs after the collection layer processes the event and inserts it into the database.  The database layers is where your real-time events go after they have been processed.  Technology has changed quite a bit over the last 15-20 years, so you may want to test the software’s ability to support high volume event feeds.  We have seen other technology were performance routinely degrades significantly after about 15,000 unique events/rows in their system.  The database layers is also where de-duplication of events occurs.  De-duplication should be considered a must have in this day and age.  This is also the land of stored procedures.  What Monolith calls mechanizations.  In Netcool land you would call these automations.
  • Post event collection processing gets quite interesting.  This is where things can get quite interesting.  Monolith has connectors which allow our event engine to connect to various other information sources (ticketing, provisioning, billing, RCA) in order to enrich or utilize more advanced logic to make decisions regarding incident impact.  Monolith also has agents that allow us to perform logic such as time of day correlation or time of day absence or the presence of event checking.
  • State-based correlation is another post collection type of correlation.  For you old timers you might think of the NerveCenter like correlation logic.  This is the ability to build in if, then, else type logic.  This provides excellent automation capabilities for NOCs.
  • Other types of post collection correlation include concepts such as heartbeating, stacked events, disparate event correlation (otherwise know as compound event correlation) and event thresholding (aka X in Y correlation).
  • An interesting niche in the correlation market has always been Topology based correlation.  This is where we have seen companies like Smarts and Riversoft come into existence.  The idea here is that if you can understand a topologies connectivity, then you can now do unique types of correlation based upon that topology in order to provide downstream suppression of events for example.  Monolith delivers this capability with our Topology Manager.
  • Hierarchy based correlation is something that we introduced in 3.3 of our software.  If you can down Probable Root Cause Analysis (PRCA) based upon a network connectivity model, then we reckoned we could perform PRCA based upon any hierarchy.  With that thought we created our Hierarchy Storage Engine (HSE).  The HSE allows us to perform advanced root cause analysis on any type of hierarchy.  This can be especially useful for service providers looking for correlation of their SONET or optical infrastructures.  It can also be used for application or service hierarchies.

Today’s  post has been start and stop driven with a plethora of phone calls and interruptions, so hopefully my message flowed reasonably well.  The key message of Step 4 is that correlation & enrichment can play a large role in making your events more useful and reducing your MTTR.  If you have not explored the benefits of such technology, then you are missing an opportunity to take your NOC capability to the next level.

Next post – Step 5 Discovery & Mapping.  For now, I bid you adieu.

Technorati Tags:
, , , , , , , , , , , , , , , , , , , , , , , ,

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS