Best Practices, KPIs for Network Management
In the process of working with our customers, particularly during requirements analysis, we often get questions about “best practices.” In my experience, best practices are a great baseline, but need to be augmented as new technology (like Virtualization/VoIP/VoD) and outage post-mortems reveal additional KPI needs. Which begs the question, “What is a KPI?” A KPI, or Key Performance Indicator, is a metric that provides necessary information to verify that a mission-critical service is monitored so that its availability, performance, and scalability is ensured. It’s also a metric needed for possible troubleshooting. This three-part blog series discusses the best practices KPIs for three common monitoring disciplines: Network, Systems, and Applications. These best practices come from years of experience and are by no means the end-all-be-all, but they may provide some insight to anybody looking for a baseline to compare against — your Comments/Stories are welcome!
The first part of this series focuses on Network Management. A company’s network is usually the lifeblood of their service offerings (and for some Service Providers, is their offering). Some people prefer to break up the network by OSI layer — physical layer, data link layer, network layer etc. In my opinion, a network is more easily understood if broken up into three key areas: the network devices (routers/switches/etc), the links (Ethernet, SONET, ATM, Serial, etc), and the services provided over them (routing protocols, mainly). These areas need to monitored closely, because if any one of them are down/degraded the network services are impacted. Below are some of the KPIs I recommend for network performance monitoring.
Device Availability
- Source: Must be Agent-less, because if the device is down it will not typically tell you
- Method: Majority of people use ICMP Ping of the IP Address, but using SNMP monitoring will verify that the device’s CPU can respond which is more accurate
- Use: Knowing that a device is down or inaccessible is key to knowing an outage may be occurring. It’s also important for availability reporting in general.
Interface Availability
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP MIB2 because it’s so commonly supported
- Use: Knowing that a interface is down indicates that the link attached to the interface is being interrupted
Link Availability
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use Cisco’s IPSLA or Juniper’s RPM – however, vendors like Brix Networks (now EXFO) have appliances that duplicate their functionality
- Use: Knowing that a link’s ability to pass traffic is interrupted is vital to knowing network health
Bandwidth Utilization
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP MIB2 because it’s so commonly supported
- Use: Knowing the quantity of traffic over a link is vital information for Network Capacity Planning and network infrastructure management
Link Performance
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use Cisco’s IPSLA or Juniper’s RPM – however products like Brix Networks have appliances that duplicate their functionality
- Use: The quality of a link’s health, like its packet loss percentage and round trip time latency, is key to detecting potential problems with that link
Beyond these five performance indicators, there are a number of secondary KPIs that some network directors need to observe. For others, these metrics may just be useful information for troubleshooting, but I still recommend including these KPIs in your list.
Device Latency/Packet Loss
- Source: Must be Agent-less, because if the device is down it will not typically tell you
- Method: Majority of people use ICMP Ping of the IP Address, but accuracy requires multiple polling points which is too costly. IPSLA is a better choice
- Use: Device latency and packet loss is useful information, but sometimes very inaccurate because of placement of the pollers. However, this is valuable information for your monitoring system because it provides scaling data your monitoring administrator may need.
Interface Statistics
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP MIB2 because it’s so commonly supported
- Use: Understanding the errors, discards, and packet throughput may greatly assist catching and troubleshooting problems. Usually only useful on those hard-to-find problems
Link Quality Statistics
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use Cisco’s IPSLA or Juniper’s RPM – however products like Brix Networks have appliances that duplicate the functionality
- Use: Traffic information like latency jitter and MOS/RFactor scores are absolutely necessary for services like VoIP or VoD. However, for those networks not running these services, they still might be useful metrics to monitor
Device Health
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP Mibs that are vendor specific, which can cause a lot of problems for your monitoring systems
- Use: Most network devices use some local resources (CPU/Memory) to monitor their own health, though it might not be directly service-impacting. Monitoring these resources is good idea in general, as they may save your bacon one day
CoS/QoS Statistics
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP Mibs that are vendor specific, which can cause a lot of problems for your monitoring systems
- Use: With the latest generation of network architectures (MPLS for example), different classes of services can be deployed and thus needs to be monitored. An example of this is bandwidth by CoS, which is vital capacity information. These metric are not useful to most businesses who do not deploy a wide variety of services (VOIP, VoD, teleconferencing, etc.)
RMON/Netflow Statistics
- Source: Usually Agent (Probe) based, however some RMON/Flow statistics are available via SNMP on some devices
- Method: Majority of people use Cisco’s Netflow, but J-Flow from Juniper, C-Flow for Alcatel, and other standards-based options are available like RMON, IPFIX, sFlow, etc
- Use: These statistics describe packet/bandwidth counts by port/protocol on your network. They allow you to see inside the standard bandwidth stats and see who is using what, where. Traditionally they have a lot of overhead which outweigh their usefulness, but the data may replace the need to use sniffers to troubleshoot problems, which reduce your MTTRs.
The following optional KPIs are metrics that some people use because the environment they are monitoring requires it, either due to architecture or process requirements. Traditionally most people don’t use them, but they do have value.
System UpTime
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP MIB2, it’s a standard counter available to most SNMP agents
- Use: System up time shows how long since the device has been rebooted. This is key to some devices as they become more unstable if they are not consistently rebooted. It’s also valuable to track device instability if they are rebooting themselves
Routing Statistics
- Source: Agent-less is required because most network devices do not actively report metrics via onboard agents
- Method: Majority of people use SNMP Mibs that are vendor specific, which can cause a lot of problems for you monitoring systems
- Use: Depending upon the architecture of your network, tracking things like the OSP/BGP neighbors may be extremely valuable. If the routing is constantly changing it may be degrading your performance, but usually the devices will send faults indicating this as well.
Your organization’s KPIs must be created and decided internally, but hopefully this list can be helpful when determining them. Here at Monolith, we know these KPIs well because we deliver the capability to monitor them for our customers. You can look over our network monitoring datasheet, where we list all of the functionality required to collect, display, and report these KPIs.
One other factor to consider is the rapid increase in complexity that’s challenging most CSPs and large enterprises. Expansion of products/services, consolidation and M&As, and increased customer demand for service assurance will continue to drive escalating needs for monitoring network performance. At Monolith, we utilize a unified approach to IT infrastructure management that greatly simplifies this process. Simplification through unification…that’s the ideal for KPI.
Read Part Two of the Series: KPIs for Systems Management –>
Technorati Tags:
Network Monitoring, Availability, Bandwidth, Network Performance, Device Health, Link Quality Statistics, CoS, QoS, Netflow, UpTime, Monolith Software

Trackbacks & Pingbacks