KPIs Continued: Systems Management
KPIs for Systems Management
This three-part blog series discusses the best practices Key Performance Indicators (KPIs) for three common monitoring disciplines: Network, Systems, and Applications. These best practices come from years of experience and are by no means the end-all-be-all, but they may provide some insight to anybody looking for a baseline — Comments/Stories are welcome!
The second part of this series focuses on Systems Management. Most businesses rely heavily upon the services provided by their systems group to accomplish everything from running vital revenue-generating applications to simple email hosting and messaging. This discipline is by far the most fragmented of the three we’re going to discuss, so I have included the following areas under this discipline: Server Hardware, Operating Systems/Platforms, Virtualization, System Services (Email, Web, DNS, DHCP, Active Directory, etc), and Middleware applications (Database, JMX, Web Services, etc).
System Management is always changing and increasing in complexity, adding services like cloud computing which present greater management and monitoring challenges than ever before. Below are some of the KPIs I recommend while monitoring systems.
These primary KPIs are, in my opinion, the absolute must-monitor KPIs.
Device Availability
- Source: Agent-less is required because if the server is down, it usually will not tell you
- Method: Majority of people use ICMP Ping of the IP Address, but using SNMP or WMI will verify that the device’s CPU can respond, which is more accurate
- Use: Knowing that a server is down or inaccessible is key to knowing an outage may be occurring. It’s also important for availability reporting in general
Server/OS Health Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Various methods are available
- Use: Knowing basic information like total CPU and memory utilization as well as disk capacity are the “barebones” of OS monitoring. Without basic health stats, you are not accurately monitoring the resource
Process/Service Availability
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Various methods are available from SNMP for processes or Synthetic Transaction tests for services
- Use: Tracking the process and service availability is key to making sure that the server is doing its job (being up/available is only half its job!). Every system service will use a layer 4 port and usually have a process running by the server, it is extremely important that you know when they are not available
These secondary KPIs are those that some data centers require having, but for others they may be just useful information for troubleshooting, but I still recommend including these KPIs in your list.
Advanced CPU Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Various methods are available from SNMP/WMI agent-less and most proprietary agents should have this capability
- Use: Being able to track the load averages, CPU (Wait, Kernel, User, System %) distributions, and process CPU/Memory usages may provide key information when troubleshooting. Knowing that a single process is no longer using memory/cpu may mean that its hung or using too much; it may be leaking memory. Lots of uses for this type of information, most are used for troubleshooting
Service Response Times
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Most are done via synthetic transaction tests, either locally or remotely
- Use: Being able to track the service response times allow for trending for abnormal behavior or for capacity management. Time-outs would be tracked via availability
General Server Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Various methods are available from SNMP/WMI agent-less and most proprietary agents should have this capability
- Use: Being able to track things like # of users/processes/connections may provide vital clues when troubleshooting or doing capacity management. These may be used for application management as well
Network Interface Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain (see a common theme here?)
- Method: Majority of people use SNMP MIB2 because it’s so commonly supported
- Use: Understanding the bandwidth, errors, discards, and packet throughput may greatly assist catching and network problems during troubleshooting. Most servers are dual attached, and I have seen my fair share of network problems between the server and the access switch. Usually only useful on those hard-to-find problem, though
System UpTime
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain
- Method: Majority of people use SNMP MIB2, it’s a standard counter availability to most snmp agents
- Use: System up time shows how long since the device has been rebooted. This is key to some devices as they become more unstable if they are not consistently rebooted. It’s also valuable to track device instability if they are rebooting themselves
RMON/Netflow Statistics
- Source: Usually Agent (Probe) based, however some RMON/Flow statistics are available via SNMP on some devices
- Method: Majority of people use Cisco’s Netflow, but J-Flow from Juniper, C-Flow for Alcatel, and other standards-based options are available like RMON, IPFIX, sFlow, etc
- Use: These stats describe packet/bandwidth counts by port/protocol on your network and via VMWare or open source host agent. They allow you to see inside the standard bandwidth stats and see who is using what, where. Traditionally they have a lot of overhead which outweighs their usefulness, but the data may replace the need to use sniffers to troubleshoot problems, which reduces your MTTRs.
Virtualization Statistics
- Source: Available both via VMWare ESX SNMP or proprietary host agent
- Method: Collection methods vary depending the level of detailed required. The VMWare API, Windows WMI for Virtual Server, as well as a host of proprietary agents provide tons of good tracking data
- Use: Data like number of guest hosts configured and the status/performance of those hosts provides valuable information to monitoring your virtualized environment
Disk I/O Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less as its easier to maintain
- Method: Various methods are available from SNMP/WMI agent-less and most proprietary agents should have this capability
- Use: Tracking I/O issues like disk read/write rates and number of I/O operations during a polled interval will help find rogue processes before they start crashing hard-drives or slowing applications
The optional KPIs listed below are all dependent on the types of services and middleware provided by the systems department. I count middleware as being everything from traditional databases, to JMX brokers or web services. Your middleware vendor should provide you with a list of the KPIs in addition to the ones listed below. I have listed the most common middleware you might need to monitor, which is the database.
Database Statistics
- Source: Agent or Agent-less options are available. I usually prefer agent-less, as it’s easier to maintain, but for some database vendors this is not possible
- Method: Majority of people use custom vendor API agents and then integrate them into their monitoring solution
- Use: Database stats such as locks, connections, transactions, hit/cache ratios, and others are all valuable troubleshooting data. This should not be confused with availability/response time which is included above as a must
In conclusion, your KPIs must be created and decided internally, but hopefully this list can be helpful when determining them. Here at Monolith, we know these KPIs well because we deliver the capability to monitor them for our customers. Here is a link to our systems monitoring datasheet, where we list all of the functionality required to collect, display, and report these KPIs.
Continue to Part Three: KPIs for Application Management–>
<– Back to Part One: KPIs for Network Management
Technorati Tags:
Systems Monitoring, Servers, WMI, SNMP, Infrastructure Monitoring, Availability, Utilization, KPI, Agent, Agent-less

Trackbacks & Pingbacks