|
Performance and capacity are topics often discussed in a virtual environment, but understanding is often limited to what can and should be measured and how capacity planning takes place. There is an obvious inter-relationship between capacity and performance (and availability). It is quite common to see clients with alleged performance issues but using the basic metrics of CPU and memory usage as the (only) key indicators. Ultimately, what is important is application delivery, productivity and usability by the customers, as this is essentially the reason for being for computer systems. Applications are typically managed by application teams and systems by systems teams. It follows that service requests (around performance) are primarily initiated by these end users who have visibility of application responsiveness and usability in general. The reality is that capacity shortages cause a large amount of all outages, and often has a direct relationship with performance. One (non-IT) related example I’ve always remembered was during my engineering days visiting an aluminium smelter. The electricity supply (capacity) was so crucial to the business that a lack of supply for more than a few minutes meant the potlines would solidify, and cause an outage of between months and years. In the virtual environment, the upside is that visibility of the key metrics is much simpler because of the shared nature of the technology, and visibility is of course a key to management. The VMware client readily exposes fundamental items such as datastore used/free space, custer/host/VM memory and CPU utilisation which we are all familiar with. Under the hood, the performance counters (and hence API’s) expose many more metrics (around 150 in total). Some key ones are as follows: CPU ready (cpu.ready.summation) – The amount of time spent waiting for a CPU (core) to become available. With ready times, VMware presents this in milliseconds, whist using esxtop displays as a percentage. This sometimes causes confusion, but the conversion is straight forward: simply divide the value (say 3,500) over the number of milliseconds in the interval (20 seconds @ 1000ms) and multiply by 100 : (3500/(20 * 1000) ) * 100 = 17.5%. CPU usage (cpu.usage.average) – Expressing CPU utilisation as a percentage of the total presented resources (i.e. for a 2 vCPU machine, 100% would represent full utilisation of both vCPUs, but not necessarily the same 2 physical host cores). This is what is visible in the VI client. Memory swap-in (mem.swapin.average) – The rate at which VM memory is reclaimed from physical disk Memory swap-out (mem.swapout.average) – The rate at which VM memory is put to disk. Both swap in and swap out are excellent indicators of insufficient host memory, more so than just swap utilisation. Memory usage (mem.usage.average) – This is what is displayed in the VI client, and is expressed as a percentage of granted (assigned) memory. Disk read latency (disk.totalreadlatency.average) – The round trip time (in milliseconds) from ESX to the platter for a read request to be serviced. Disk write latency (disk.totalwritelatency.average) – The round trip time (in milliseconds) from ESX to the platter for a write request to be serviced. Both read and write latency is a good indicator of storage health, but should never be used as the sole indicator, and this holds true for all performance. One important thing to note is when looking at performance is regarding clusters – the VI client and API both present CPU and memory objects for Clusters as well as Hosts. Reporting on cluster performance is simply an aggregate of each host currently in the cluster, so this will skew depending on what host is currently present in which cluster. This will have a drastic impact on historical reporting on cluster performance if the cluster nodes are changed significantly or frequently. The VMware acquisition of B-Hive in 2008 was no doubt to provide a higher and more orchestrated management approach to application performance, rather than simply systems performance and to align those performance characteristics with SLA’s. The big picture is portraying a virtual world where we have increased visibility and understanding of performance and the relationship to the physical hosting infrastructure to help us plan, manage, integrate and report better. Written by Simon Price of Technical Architecture Solutions Also published on the TAS blog Here. |