Programs often run out of memory or grind to a halt. Chris Oldwood demonstrates how to add instrumentation to your code to track its performance.
The days when an application was a simple process that ran entirely locally are long gone. Most systems these days are complex beasts that reach out to a variety of services which can be hosted both locally and remotely. As the number of links in the chain grows, so does the potential for a performance problem to surface. And as we delegate more and more work to 3rd party services instead of owning them, sometimes the best we can hope for is to keep a very close eye on how they’re behaving.
The internal components we also rely on will continually get upgraded with new features and bug fixes so we need to make sure that any change in their performance characteristics are still within our expected tolerances. Even during day-to-day operation there will be spikes in volumes and anomalous behaviour in both our own code and those of our upstream and downstream dependencies that will need to be investigated.
If you’ve ever had a support query from a user that just says “ the system is running like a dog ” and you have no performance data captured from within the bowels of your application then you’ve got an uphill battle. This article explores what it takes to add some basic instrumentation to your code so that you can get a good idea of what is happening in and around your system at any time.
System-scale monitoring
Your operations team will no doubt already be monitoring the infrastructure, so why isn’t that enough? Well, for starters they are almost certainly only monitoring your production real estate – your development and test environments are likely to be of little importance to them. And yet those are the environments where you want to start seeing any performance impact from your changes so that they can be dealt with swiftly before reaching production.
The second reason is that they only have a 1,000 foot view of the system. They can tell you about the peaks and troughs of any server, but when an anomaly occurs they can’t tell you about which service calls might have invoked that behaviour. Databases in particular have a habit of behaving weirdly as the load spikes and so knowing which query is being blocked can give you clues about what’s causing the spike.
What to measure
Sir Tony Hoare tells us that “ premature optimisation is the root of all evil ” [ Hoare ] and therefore the first rule of thumb is to measure everything we can afford to. In a modern system where there are umpteen network calls and file I/O there is very little for which the act of measuring will in any way significantly degrade the performance of the operation being measured.
Of course if you write an instrumentation message to a log file for every disk I/O request you’ll probably spend more time measuring than doing useful work and so the level of measurement needs to be granular enough to capture something useful without being too intrusive or verbose. This may itself depend on how and where you are capturing your data as you probably don’t want to be contending with the resource that you’re trying to measure.
As a starting point I would expect to measure every call out to a remote service, whether that is a web service, database query, etc. Every remote call has so many other moving parts in play that it’s almost certainly going to act up at some point. Reading and writing files to disk, especially when that disk is based remotely on a NAS/SAN/DFS is also a must, although the granularity would be to read/write the entire file. On top of those any intensive memory or CPU hogs should also get instrumented as a matter of course.
There is one scenario when I would consider discarding the measurements for a task and that’s when an exception has been thrown. If an error has occurred you cannot tell how much of the operation was performed, which for a lost connection could be none, and so having lots of dubious tiny points in your data will only succeed in distorting it.
Types of instrument
You only need a few basic instruments to help you capture most of the key metrics. Some of those instruments are ‘real’, whereas others may need to be synthesized by using API’s from the OS or language runtime.
Wall clock
By far the simplest instrument is the good old fashioned clock. The term ‘wall clock’ is often used to distinguish it from other internal clocks as it measures elapsed time as perceived by the user (or operator). These are probably the easiest to implement as virtually every runtime has the ability to report the current date and time, and so therefore you can calculate the difference.
The main thing you need to watch out for is the difference between the precision and the accuracy of the clock. Although you may be able to format the date and time down to 1 nanosecond that doesn’t mean it’s captured at that resolution. For example the .Net
System.DateTime
type has a precision measured in nanoseconds but is only accurate to around 16 ms [
Lippert
]. As things stand today I have found this good enough to capture the vast majority of ‘big picture’ measurements.
There are often other platform specific alternatives to the classic wall clock, some are old fashioned like the original
GetTickCount()
API in Windows which also only has a resolution of 10–16 ms on the NT lineage (it was a whopping 50 ms under 3.x/95). The Windows
QueryPerformanceCounter()
API in contrast has a much, much higher resolution, due to the hardware used to derive it, and it’s from this that the .Net
StopWatch
type gains its sub-microsecond precision
and
accuracy [
Lippert
]. However, be warned that this impressive feat is heavily dependent on the hardware and is therefore more usable server-side than client-side.
Specialised clocks
Internally operating systems often track metrics around process and thread scheduling. One such bit of data Windows provides is the amount of time spent in the OS kernel as opposed to user space. Coupled with the elapsed time you can discover some useful facts about your process.
If it’s a high-performance computing (HPC) style engine that is expecting to be CPU bound, you might expect all its time to be spent in user space, but if it’s in neither you’re probably doing more I/O than you expect. I have observed a high amount of time in the kernel when there are a large number of exceptions bouncing around (and being ignored) or when the process is doing an excessive amount of logging. The converse can also be true about an I/O bound task that seems to be consuming too much CPU, such as during marshalling.
Network latency
Prior to Windows 2000 the clocks on Windows servers were notoriously bad, but with the addition of support for Kerberos the situation got much better (it had to) as servers now synchronised their clocks with a domain controller. With servers having a much better idea of what the current time is, you can begin to capture the network latency incurred by a request – this is the time it takes for the request to be sent and for the response to the received.
You can measure the latency by timing the operation from the client’s perspective (the remote end, server-wise) and also within the service handling the request (the local end, server-wise) and then calculating the difference. For example, if the client sees the request taking 10 seconds and the service knows it only spent 8 seconds processing it, the other 2 seconds must be spent in and around the infrastructure used to transport the request and response. Although you might not be able to correctly apportion the time to the request or response sides (as that relies on the absolute times) you can at least calculate the total latency which only relies on the relative differences.
This kind of metric is particularly useful anywhere you have a queue (or non-trivial buffering) in place as it will tell you how long a message has been stuck in the pending state. For example in a grid computing scenario there is usually a lot of infrastructure between your client code and your service code (which is executed by a remote engine process) that will queue your work and marshal any data to and from the client. However, if the queue has the ability to retry operations you will have to allow for the fact that the latency might also include numerous attempts. This is one reason why handling failures and retries client-side can be beneficial.
Memory footprint
Aside from the use or idleness of your CPUs, you’ll also want to keep track of the memory consumed by your processes. Sadly this is probably not tracked at a thread level like CPU times because a global heap is in use. However, single-threaded processes are often used for simplicity and resilience and so you might be in a position to track this accurately in some parts of the system.
On Windows you generally have to worry about two factors – virtual address consumption and committed pages. Due to the architecture of Windows’ heaps you can end up in the mysterious position of appearing to have oodles of free memory and yet still receive an out-of-memory condition. See my previous Overload article on breaking the 4 GB limit with 32-bit Windows processes for more details [ Memory ]. Suffice to say that process recycling is a common technique used with both managed and native code to work around a variety of heap problems.
For services with some form of caching built-in memory footprint metrics will help you discover if your cache eviction policy is working efficiently or not.
Custom measurements
One of the problems with any measurement taken in isolation is that you often need a little more knowledge about what the ‘size’ of the task actually was. If your workload varies greatly then you’ll not know if 1 second is fast, acceptable or excessive unless you also know that it had thousands of items to process. Then when you’re analysing your data you can scale it appropriately to account for the workload.
To help provide some context to the instrument measurements it’s worth accumulating some custom measurements about the task. These might include the number of items in the collection you’re reading/writing or the number of bytes you’re sending/receiving.
Causality
Every similar set of measurements you capture need to be related in some way to an operation and so you also need to capture something about what that is and also, if possible, some additional context about the instance of that operation. Once you start slicing and dicing your data you’ll want to group together the measurements for the same operation so that you can see how it’s behaving over time.
Additionally when you start to detect anomalies in your systems’ behaviour you’ll likely want to investigate them further to see what led to the scenario. Whilst the date and time the data was captured is a good start, it helps if have some kind of ‘transaction identifier’ that you can use to correlate events across the whole system. I described one such approach myself in a previous edition of Overload [ Causality ].
Output sinks
Once you’ve started accumulating all this data you’ll want to persist it so that you can chew over it later. There are various options, none of which are mutually exclusive, and so it might make sense to publish to multiple targets for different reasons – usually based around active monitoring or passive offline investigation.
One thing you’ll definitely want to bear in mind is how you handle errors when writing to an output sink fails. You almost certainly don’t want the failure to cause the underlying operation to also fail. If the sink is remote you could try buffering the data for a short while, but then you run the risk of destabilising the process, and eventually the machine [ Memory ], if you buffer too much for too long.
Log file
The humble (text based) log file is still with us, and for good reason. Many of us have honed our skills at manipulating these kinds of files and as long as you keep the format simple, they can provide a very effective medium.
The single biggest benefit I see in storing your instrumentation data in your log files is the locality of reference – the data about the operation’s performance is right there alongside the diagnostic data about what it was doing. When it comes to support queries you’ve already got lots of useful information in one place.
One method to help with singling out the data is to use a custom severity level, such as ‘PRF’ (for performance), as this will use one of the key fields to reduce the need for any complex pattern matching in the initial filtering. The operation itself will need a unique name, which could be derived from the calling method’s name, so that you can aggregate values. And then you’ll need to the data encoded in a simple way, such as key value pairs. Here is an example log message using that format:
2013-06-26 17:54:05.345 PRF [Fetch Customers] Customer:Count=1000;WallClock:DurationInMs=123;
Here we have the operation name –
Fetch Customers
– and two measurements – the number of items returned followed by the elapsed time in milliseconds. Although the simpler key names
Count
and
Duration
may appear to suffice at first, it’s wise to qualify them with a namespace as you may have more than one value captured with the same name. Including the units also helps if you capture the same value in two different formats (e.g. milliseconds and
TimeSpan
). This might seem redundant, but when reading a log for support reasons it helps if you don’t have to keep doing arithmetic to work out whether the values are significant or not.
Classic database
Where better to store data than a database? Your system may already capture some significant timings in your main database about the work it’s doing, such as when a job was submitted, processed and completed. And so you might also want your instrumentation data captured alongside it to be in one place. Some people are more comfortable analysing data using SQL or by pulling portions of data into a spreadsheet as not everyone is blessed with elite command line skills and intricate knowledge of the Gnu toolset.
Of course when I say ‘classic’ database I don’t just mean your classic Enterprise-y RDBMS – the vast array of NOSQL options available today are equally valid and quite possibly a better fit.
Round Robin Database
For analysis of long term trends the previous two options probably provide the best storage, but when it comes to the system’s dashboard you’ll only be interested in a much shorter time window, the last 10 minutes for example. Here you’ll want something akin to a fixed-size circular buffer where older data, which is of less interest, is either overwritten or pruned to make room for newer values.
There are a few variations on this idea; one in particular is the Round Robin Database (or RRDtool [ RRDtool ]). This prunes data by using a consolidation function to produce aggregate values that lower the resolution of older data whilst still maintaining some useful data points.
Performance counters
Another variation of the circular buffer idea is to expose counters from the process itself. Here the process is responsible for capturing and aggregating data on a variety of topics and remote processes collect them directly instead. This has the advantage of not requiring storage when you don’t need it as you then log only what you need via a remote collector. However, back at the beginning I suggested you probably want to capture everything you can afford to, so this benefit seems marginal.
One word of caution about the Windows performance counter API – it’s not for the faint of heart. It is possible that it’s got better with recent editions (or via a 3rd party framework) but there are some interesting ‘issues’ that lie in wait for the unwary. That said the concept still remains valid, even if it’s implemented another way, such as using a message bus to broadcast a status packet.
A simple instrumentation framework
Anyway, enough about the theory what does this look like in code. You shouldn’t be surprised to learn it’s not rocket science. Basically you just want to start one or more instruments before a piece of functionality is invoked, and stop them afterwards. Then write the results to one or more sinks.
If you’re working with C++ then RAII will no doubt have already featured in your thoughts, and for C# developers the Dispose pattern can give a similar low noise solution. The effect we want to achieve looks something like Listing 1 (I’m going to use C# for my examples).
public Customers FetchCustomers() { const string operation = "Fetch Customers"; var stopwatch = new Stopwatch(); var cpuMonitor = new CpuMonitor(); // Do the thing that fetches the customers. cpuMonitor.Stop(); stopwatch.Stop(); var measurements = new Measurements { stopwatch.Measurements, cpuMonitor.Measurements }; sink.Write(operation, measurements); return customers; } |
Listing 1 |
This code, which is largely boilerplate, adds a lot of noise to the proceedings that obscures the business logic and so we should be able to introduce some sort of façade to hide it. What I’d be looking for is something a little more like Listing 2.
public Customers FetchCustomers() { using (MeasureScope.With(Instruments.WallClock | Instruments.CpuMonitor)) { // Do the thing that fetches the customers. } } |
Listing 2 |
Hopefully this looks far less invasive. Through the use of reflection in C# (or pre-processor macros in C++) we should be able to determine a suitable name for the operation using the method name.
By switching to flags for the instrument types, we’ve factored out some noise but made customising the instruments harder. Luckily customisation is pretty rare and so it’s unlikely to be a problem in practice. The façade could also provide some common helper methods to simplify things further by wrapping up the common use cases, for example:
using (MeasureScope.WithWallClock())
or
using (Measure.CpuBoundTask())
The one other minor detail that I have so far glossed over is how the output sink is determined. For cases where it needs to be customised it can be passed in from above, but I’ve yet to ever need to do that in practice. Consequently I tend to configure it in
main()
and store it globally, then reach out to get it from behind the façade.
The façade
The façade used by the client code can be implemented like Listing 3.
public static class MeasureScope { public static Controller With (Instruments instruments) { string operation = new StackFrame(1).GetMethod().Name; return With(operation, instruments); } public static Controller With(string operation, Instruments instruments) { return new Controller(operation, instruments); } } |
Listing 3 |
Retrieving the calling method name in C# (prior to C# 4.0) relies on walking the stack back one frame. Of course if you write any wrapper methods you’ll need to put the stack walking in there as well or you’ll report the wrapper method name instead of the business logic method name (see Listing 4).
public static class MeasureScope { . . . public static Controller WithWallClock() { string operation = new StackFrame(1).GetMethod().Name; return With(operation, Instruments.WallClock); } } |
Listing 4 |
The same rule applies if you’ve used closures to pass your invocation along to some internal mechanism that hides grungy infrastructure code, such as your Big Outer Try Block [ Errors ]. See Listing 5.
public Customers FetchCustomers() { return Invoke(() => { // Do the thing that fetches the customers. }); } public T Invoke<T>(Func<T> operation) { string operationName = new StackFrame(1).GetMethod().Name; using (MeasureScope.With(operationName, Instruments.WallClock)) { return operation(); } } |
Listing 5 |
Bitwise flags seem to be a bit ‘old school’ these days, but for a short simple set like this they seem about right, plus you have the ability to create aliases for common permutations (see Listing 6).
[Flags] public enum Instruments { WallClock = 0x01, CpuMonitor = 0x02, MemoryMonitor = 0x04, CpuBoundTaskMonitor = WallClock | CpuMonitor, } |
Listing 6 |
The instrument controller
The behind-the-scenes controller class that starts/stops the instruments and invokes the outputting is only a tiny bit more complicated than the façade (Listing 7).
public class Controller : IDisposable { public static ISink DefaultSink { get; set; } public Controller(string operation, Instruments instrumentFlags) { var instruments = new List<IInstrument>(); if ((instrumentFlags & Instruments.WallClock) != 0) instruments.Add(new WallClock()); . . . _operation = operation; _instruments = instruments; _sink = DefaultSink; _instruments.ForEach((instrument) => { instrument.Start(); }); } public void Dispose() { _instruments.ForEach((instrument) => { instrument.Stop(); }); var measurements = new List<KeyValuePair<string, string>>(); foreach (var instrument in _instruments) measurements.AddRange (instrument.Measurements); _sink.Write(_operation, measurements); } private readonly string _operation; private List<IInstrument> _instruments; private ISink _sink; } |
Listing 7 |
The sink
The measurements that are returned from the instruments are just string key/value pairs. Given their different nature – datetime, timespan, scalar, etc. – it seems the most flexible storage format. I’ve implemented the sink in Listing 8 as a simple wrapper that writes to a log file through some unspecified logging framework.
public class LogFileSink : ISink { public void Write(string operation, IEnumerable<KeyValuePair<string, string>> measurements) { var buffer = new StringBuilder(); foreach (var measurement in measurements) buffer.AppendFormat("{0}={1};", measurement.Key, measurement.Value); Log.Performance("[{0}] {1}", operation, buffer); } } |
Listing 8 |
Here I’ve assumed that the Log façade will not allow an exception to propagate outside itself. That just leaves an out-of-memory condition as the only other sort of failure that might be generated and I’d let that propagate normally as it means the process will likely be unstable.
The instruments are equally simple – a pair of methods to control it and a property to retrieve the resulting metric(s). Like all ‘good’ authors I’ve chosen to elide some error handling in the
Start()
and
Stop()
methods to keep the code simple (Listing 9), but suffice to say you probably want to be considering what happens should they be called out of sequence.
public interface IInstrument { void Start(); void Stop(); Measurements Measurements { get; } } public class WallClock : IInstrument { public Measurements Measurements { get { return _measurements; } } public void Start() { _measurements.Clear(); _startTime = DateTime.Now; } public void Stop() { var stopTime = DateTime.Now; var difference = stopTime - _startTime; _measurements.Add (new KeyValuePair<string, string> ("WallClock:Duration", difference.ToString())); } private DateTime _startTime; private Measurements _measurements = new Measurements(); } |
Listing 9 |
Earlier I mentioned that I would avoid writing measurements taken whilst an exception was being handled; that can be achieved in C# with the following check [ Exception ]. See Listing 10.
public static bool IsHandlingException() { return (Marshal.GetExceptionPointers() != IntPtr.Zero || Marshal.GetExceptionCode() != 0); } . . . if (!IsHandlingException()) { var measurements = new List<KeyValuePair<string, string>>(); foreach (var instrument in _instruments) measurements.AddRange (instrument.Measurements); _sink.Write(_operation, measurements); } |
Listing 10 |
Summary
This article has shown that it’s relatively simple to add some instrumentation calls within your own code to allow you to track how it’s performing on a continual basis. It explained what sort of measurements you might like to take to gain some useful insights into how your service calls and computations are behaving. It then followed up by showing how a simple framework can be written in C# that requires only out-of-the box techniques and yet still remains fairly unobtrusive.
Once in place (I recommend doing this as early as possible, as in right from the very start) you’ll be much better prepared for when the inevitable happens and the dashboard lights up like a Christmas tree and the phones start buzzing.
Acknowledgements
Thanks to Tim Barrass for introducing me to the Round Robin Database as a technique for persisting a sliding window of measurements and generally helping me refine this idea. Additional thanks as always to the Overload advisors for sparing my blushes by casting their critical eye and spotting my mistakes.
References
[Causality] Chris Oldwood, ‘Causality – relating distributed diagnostic contexts’, Overload 114
[Errors] Andy Longshaw and Eoin Woods, ‘The Generation, Management and Handling of Errors (Part 2)’, Overload 93
[Exception] Ricci Gian Maria, ‘detecting if finally block is executing for an manhandled exception’, http://www.codewrecks.com/blog/index.php/2008/07/25/detecting-if-finally-block-is-executing-for-an-manhandled-exception
[Hoare] http://en.wikipedia.org/wiki/Program_optimization
[Lippert] Eric Lippert, ‘Precision and accuracy of DateTime’, http://blogs.msdn.com/b/ericlippert/archive/2010/04/08/precision-and-accuracy-of-datetime.aspx
[Memory] Chris Oldwood, ‘Utilising more than 4GB of memory in 32-bit Windows process’, Overload 113
[RRDtool] http://oss.oetiker.ch/rrdtool