Linux Instrumentation issues under z/VM

The issues are simple, your Linux performance monitor reports bad CPU numbers, requires too much overhead to run, and will become your performance problem. And until you go production, it won't matter....


Linux CPU Reporting under VM is WRONG! When running in a virtual environment such as z/VM, ALL linux CPU accounting is wrong. This presentation has been given many times at SHARE and IBM technical conferences, but yet people want to depend on such programs as "top" or "rmfpm" to provide their CPU measurements. These tools just do NOT work.

ALL agents and performance programs in Linux get their data from the same /proc file system. It is easy to show that this data is wrong by design. Linux measures CPU using a time of day sampling technique that in a dedicated environment works sufficiently to meet the requirements.

When running in a virtual environment, Linux allocates the processor as if it was dedicated. In an idle system, where there is only one linux server running under z/VM, the difference in reporting is not usually noticed. But under load when there is two or more servers active, Linux does not know if there are delays getting cpu or not. This results in larger reporting of CPU utilization than what was actually used.

To test this, run "top" in one server, then logon multiple servers that are in loops. The more servers on, the more that "top" reports itself using - but it's requirements are unchanged.

Imagine one of your servers goes into a loop. Any server that you logon to and run "top" will show very high CPU utilization. How can you correctly determine the real problem? Or use this data for capacity planning? The VM data is required.

Velocity Software corrects the Linux CPU numbers using a unique data capture method that absolutely requires z/VM performance data for the same interval as the Linux data. This method is unique to ESALPS. No other vendor or product can collect the data concurrently and correct the data. Ask them and ask for proof....


How much CPU should your Idle servers require for instrumentation? The worst performance monitor is the one that becomes the performance problem. Most of the agents that work today in Linux, Unix or NT environments are not efficient, but in those environments, it does not matter. With all the cycles available on those platforms, if an idle server is measuring itself, there is no problem.

In the shared resource environment with Linux running under z/VM, there are two issues to address.
The first issue is the cost of the instrumentation, usually in the form of an agent. If this cost is 5% of a processor, and you expect to run 100 servers, you have just allocated 5 processors to instrumentation. Not Good. NETSNMP is a VERY low cost agent and readily available from sourceforge.
The second issue is the cost of measuring idle servers. Why wake up a server to see what it is doing when it is not doing anything? Waking up an idle server involves CPU, storage and paging, all of which are unnecessary, and all of which take resource away from the other servers.

Velocity Software's ESALPS is the Low Cost Performance Monitor! With ESALPS, the VM data tells us that the server is idle, thus there is no need to request the performance data. As ESALPS uses NETSNMP, the agent is passive and only wakes up when data is requested - unlike other available agents in this environment.


More information about ESALPS can be found on the Velocity Software web site.