Single-Threading: Back to the Future? (Part 2)

Multithreading can cause notoriously difficult bugs. Sergey Ignatchenko finds mitigating strategies for programs on servers.

As we have seen in the previous article [ Ign10 ] which described the 1st half of the historical match between our 'No Multithreaded Bugs' Bunny and Multithreaded Gorrillazz, in most cases our Bunny has managed to reach the touchdown area without the need for heavily multi-threaded development. It means that on the client side the number of cases which are really calling for heavily multithreaded code (known to have numerous problems) is rather limited, and so in most cases heavy multi-threading can be avoided. Now it is time to take a look at server-side programs to see if our hero can avoid heavy multithreading there. As 'server-side' we will consider programs aimed to support many users simultaneously over the network. Web applications is one prominent example, but we will not limit our analysis only to them.

One common feature of server-side applications is that they almost always depend on some server-side storage, normally a database; therefore, we will assume that some database (not specifying if it is relational or not) is used by application. Even if the application uses plain files for storage, we still can consider it as a kind of (very primitive) database for the purposes of our analysis. There are a few exceptions to this database dependency, but as we will show below in 'No Database? No Problem' section, the most practically important of them can be handled using the same techniques as described here and in part ¹ .

It should be noted that, as before, this is mostly a summary of practical experiences and cannot be used as strict instructions for 'how to avoid multithreading'. Still, we hope it will be useful as a bunch of 'rules of thumb'. We will also try to provide some very rough numbers to back up these ideas, but please take them with a good pinch of salt.

Quarter 3&4 line-up

As in the first half of the match, we have our 'No Multithreaded Bugs' Bunny standing on the left side of the field, with touchdown being his only chance to win. He faces a team of extremely strong 'Multithreaded Gorrillazz', and any single Gorrilla is strong enough to stop him forever. Fortunately they're rather slow, which leaves our hero a chance. As in the first half we will make a distinction between heavily multithreaded code all over the place (which results in perpetual debugging and a maintenance nightmare) and isolated multithreaded pieces (which are not exactly a picnic either, but can be dealt with with finite amount of effort; we will consider them acceptable if there are no better options).

To discuss server development, the very first thing we need to see is if our Bunny is writing a program based on standard interfaces and frameworks, or needs to develop his own framework. If a framework is already there (for example, he's developing a web application), it simplifies his task significantly. As the whole idea of the server-side application is to serve independent requests from many users, most existing frameworks (eg CGI/FastCGI, PHP, .NET, Java Servlets) do not require any interaction between requests. Sometimes avoiding interaction between threads within the framework requires some discipline (for example, static data in Java servlets can cause inadvertent interactions leading to problems [ CWE-567 ]), but overall it is usually not rocket science to avoid it (unless it is dictated by application logic itself, which is discussed below).

Now, let us consider the scenario where standard interfaces are not good enough; while it is not so common, there are still several good reasons not to use standard web frameworks in some specific cases. Such reasons may include, for example, the inherently request-response nature of the HTTP protocol, which doesn't necessarily fit all application usage scenarios. The decision to write your own framework is always a tough one and obviously includes lots of work, and often such frameworks need to be multithreaded for performance reasons. But even when it is really necessary, the framework can still be written such that all multithreading stuff is kept within the framework itself and is never exposed to the application. It means that even if our hero has quite an unusual case when existing frameworks don't work, he can still confine multithreading to relatively small and (probably even more importantly) rarely changed area of the code.

Wazzup, doc?

Now, one way or another, our 'No Multithreaded Bugs' Bunny has a framework which handles multiprocessing and multithreading itself, without imposing that his application code be multithreaded. It doesn't mean he will be able to avoid multithreading in the end, it merely means that he hasn't been grabbed by any of Gorrillazz yet.

The next question to our 'No MT Bugs' Bunny is the very same 'Houston, do we have a problem?' question that he needed to answer for the client-side. The main reason for multithreading is performance, so if there are no performance problems there is no real need to do anything about it (and if multi-threading exists for any other reason, our 'No Bugs' Bunny should think about it twice, especially if threads were added only because it's 'cool' or because without them the program is 'so 1990-ish'). If there are any observable performance problems, the very first thing our Bunny should ask himself is 'Are you sure that the database has all the indexes it needs?' It is unbelievable how many cases can be drastically improved by simply adding one single index. In particular, developers and even DBAs often tend to forget that a 2-column index on columns A+B can be orders of magnitude faster than 2 separate indexes on column A and column B. The biggest improvement we've observed from adding a 2-column index, was 1000x; not a bad incentive to take a closer look at indexes. So, we want to re-iterate: when using databases, indexing is of paramount importance, and is the very first thing to be considered if there are any performance problems. No amount of multithreading/multi-coring will save your program if the database lacks appropriate indexes. Another thing to take a look at this stage is eliminating outright inefficient requests (there are usually at least a few in any application, and basic profiling using database-provided tools should be able to help).

If the database indexes are fine and there are still performance problems (for Internet applications it usually won't happen until about 1M-10M requests/day1), than the next question arises. Usually most applications of this kind can be divided into two wide categories. The first category of applications is 'data publishing' and have mostly read-only requests (represented by any kind of site which publishes information, including serving search requests). The second category makes many updates, but these updates are usually trivial and, after optimizations mentioned above, should take rather little time; reporting can still be ugly and have heavy and very heavy requests (this is a typical pattern of 'Online Transactional Processing', or OLTP, applications). At this point our Bunny should understand which category his application belongs to.

The clone bunny

For a 'data publishing' application where updates are rare but the number of read requests is huge, the next step is usually to see if it some kind of caching will help. Caching does introduce interactions between requests, but with a proper implementation (similar, for example, to memcached [ Facebook08 ]) it can easily be used in a way which has nothing to do with multithreading. For applications/sites which can cache all the information they need (for example, content management systems with updates a few times a day, or a 'widget' showing weather to every user in its location), it usually means handling virtually unlimited number of users without much further effort (in practice, the exact number will depend greatly on application specifics and the framework used, but our extremely rough estimate would be on the order of 10M-100M requests per day per typical 2-socket 8-core 'workhorse' server, with an option to add more such servers as necessary). If, on the other hand, there are essential requests which cannot be handled from the cache (for example, ad-hoc searches) and even after caching everything we can performance is still not good enough, then things become more complicated. At this stage, our 'No Multithreaded Bugs' Bunny should consider creating a 'master' database which will handle all the updates, and multiple 'replica' databases which will handle the read-only requests. ² This approach will allow scalability for read-only requests (with an extremely rough estimate of number of requests per single 'workhorse' server on the order of 1M-10M per day, though with proper optimization in some cases it can reach as high as 100M), so the only risk which remains is handling the update requests; usually it is not a problem for this kind of application, but if it is - our 'No MT Bugs' Bunny can approach them the same way as described below for typical OLTP applications.

Heavy-weights

So, what should our 'No MT Bugs' Bunny do if he faces an application which needs to handle lots of updates (more than a few million per day) and still experiences problems after all necessary indexes are present and the outright inefficient requests are eliminated? The next step is usually to optimize the database further, mostly at the physical level. It could include things like upgrading the server to a RAID controller with a battery-backed write cache (this alone can help a lot), moving DB logs to a completely separate set of HDDs (usually RAID-1), selecting an optimal RAID structure for tables (often a simple bunch of RAID-1 arrays works the best, and RAID-5/RAID-6 are usually not a good idea for heavily updated tables), separating tables with different behavior into separate bufferpools and onto separate physical disks, and so on. Additionally, moving most (or all) reports to 'uncommitted read' transaction isolation level could be considered; in some cases this simple optimization can work wonders. A related optimization can include separating a few frequently updated fields into a separate table, even if such a table has 1:1 relation to the original one. Another application-related optimization which can occur at this stage is moving to prepared statements or stored procedures. It is worth noting that despite common perception, on a DBMS where prepared statements are properly supported (last time we've checked it still didn't include MySQL) they tend to provide almost the same performance as stored procedures, while requiring less code rewriting and keeping more potential for switching the DBMS if necessary.

Half-gotchaed?

What will happen if our 'No MT Bugs' Bunny did all the above optimizations, but his system or program still doesn't work efficiently enough (which we estimate shouldn't normally happen until 10M update requests/day is reached)? It is no picnic, but is still not as bad as heavy multithreading, yet. At this stage our hero can try to separate the operational updatable database from the read-only reporting database, making the reporting database a replica of the 'master' operational database, running on a separate server. The effect achieved by this step heavily depends on DBMS in use and types of load, but usually the heavier the load - the bigger the effect observed (removing inherently heavy reporting requests from an operational database reduces cache/bufferpool poisoning and disk contention, and these effects can hurt performance a lot).

If it doesn't help, our 'No Bugs' Bunny might need to take a closer look at inter-transaction interaction (including transaction isolation levels, SELECT FOR UPDATE statements and the order of obtained locks). We feel that if it goes as far as this, he is in quite big trouble. While inter-transaction communication is not exactly multithreading, it has many similar features. In particular, deadlocks or 'dirty reads' can easily occur, eliminating them can be really tricky, and debugging can become extremely unpleasant. If our 'No MT Bugs' Bunny finds himself in such situation, we will consider him being 'Half-Gotchaed'. One application-level option which might be useful at this point is to start postponing updates (storing them in separate table, or some kind of queue) for certain frequently updated statistical fields (like a 'number of hits' field) to avoid the locking, and move such postponed updates into the main table later, time-permitting or in bulk, reducing locking.

Single connection: back to the future?

It is worth noting that there is an option to avoid this kind of multithread-like problems altogether, which is rarely considered. It is sometimes possible to move all update statements into a single DB connection (essentially to a single thread); while such approaches are often ostracized for lack of scalability, practice shows that in some environments (especially those where data integrity is paramount with no room for mistakes, for example, in financial areas), it is a perfectly viable approach - the biggest load which we have observed for such single-update-connection architecture was on the order of 30M update transactions per day for a single synchronous DB connection, and when it became not enough, it was (though with a substantial effort) separated into several databases with a single connection for updates to each one, reaching 100M+ update transactions per day (and with the option to go further if necessary).

Divide et impera

If after applying all the optimizations above our our 'No MT Bugs' Bunny still experiences performance problems, his next chance to escape fierce 'Multithreaded Gorrillazz' is to try to find out if the data he works with can be easily classified by certain criteria, and split the single monolithic database into several partitioned ones. For example, if his application needs to handle transactions in a thousand stores worldwide, but most transactions are in-store and only a few interactions between the stores (similar to the task defined in [ TPC-C ] benchmark), he has a chance of getting away with partitioning the database by store (one or several stores per database), achieving scalability this way. Methods of separation can vary from DBMS-supported partitioning (for example, [ IBM06 ] and [ Oracle07 ]) to application-level separation. Application-level separation can have many varieties (with many being extremely application-specific), and detailed discussion of such separation can easily take a few books, so we will not try to go into more details here.

In-memory state: case of multiple sclerosis

If everything described above fails, and our Bunny indeed has an application with 100M+ update transactions per day, he may need to resort to RAM to remember some parts of the system state, rather than to keep everything in the database. It is a fundamental change, and it won't be easy. One important implication is that all information held in memory only will be lost if system goes down or reboots; in some cases (like caches) it doesn't matter, but if going beyond caches, the implications must be considered very carefully.

Still, even with in-memory state multithreading is not always necessary; it can be avoided either by techniques described in the previous article [ Ign10 ], or by separating the system into a series of logical objects, each having its own in-memory state and incoming queue, and all the logical object input being limited to the processing of incoming messages, with all interaction between objects restricted to sending messages to each other. One of us has seen such a system processing over 1 billion (yes, this is nine zeros) of requests per day, still without any multithreading at the application level (all multithreading has been confined to a few thousand lines of specialized framework, which is 100% isolated from the application-level business logic and therefore is changed extremely rarely). If our Bunny is one of the few very lucky (or unlucky, depending on point of view) ones who really need to process more than 1e9 requests per day - there is a chance he will be gotchaed, but honestly - how many of us are really working on such systems? To set some kind of benchmark: NASDAQ is currently able to process 2e8 transactions per day [ NASDAQ ], so we can reasonably expect that there are relatively few systems which need more than 1e9. Still, it can happen and we have no choice other than to award a point to Gorrillazz in this case.

No database? No problem

As promised at the very beginning of the article, now we will come back to discussing examples of server-side applications which don't use databases (or use them in a very limited way). One such example is music/video streaming server applications. While such applications don't need to rely on a database, they can be scaled easily enough similar to any other 'data publishing' application (see 'The Clone Bunny' above); in extreme cases where top performance is necessary, using non-blocking I/O techniques can be used to improve performance further.

Another prominent example of server-side applications which don't really need to depend on the database, is game servers. While it is very difficult to generalize such a vast field as games in general, massive server-side games usually seem to fit under 'In-Memory State: Case of Multiple Sclerosis' described above, and our 'No MT Bugs' Bunny can try to handle them using the very same techniques as described there and in previous article.

Quarter 3&4: 'No MT Bugs' Bunny: 4

Now as the match between 'No MT Bugs' Bunny and 'Multithreaded Gorrillazz' has came to end, we're able to find out the final score of this magnificent game. As we've seen, similar to client-side, on the server-side there aren't too many cases for multithreading either. Our 'No MT Bugs' Bunny managed to make 9 home runs on the server side of the field, while being gotchaed only once, and being half-gotchaed once. Taking into account the relative weights of these runs, we conclude that quarters 3 & 4 have been completed with a score of 'No MT Bugs' Bunny: 4 ¾ 'Multithreaded Gorrillazz': 1 ¼ , making the overall score for the whole game 'No MT Bugs' Bunny: 8 ¾ 'Multithreaded Gorrillazz': 2 ¼ .

References

[CWE-567] CWE-567: Unsynchronized Access to Shared Data, Common Weakness Enumeration, http://cwe.mitre.org/data/definitions/567.html

[Facebook08] Scaling memcached at Facebook, Paul Saab, 2008, http://www.facebook.com/note.php?note_id=39391378919

[IBM06] Introducing DB2 9, Part 2: Table partitioning in DB2 9, Rav Ahuja, 2006, http://www.ibm.com/developerworks/db2/library/techarticle/dm-0605ahuja2/

[Ign10] 'Single-Threading: Back to the Future?' Sergey Ignatchenko, Overload #97, June 2010

[NASDAQ] Technology Fast Facts, NASDAQ, http://www.nasdaq.com/newsroom/presskit/reports/NASDAQ_Technology_Worksheet.pdf

[Oracle07] Partitioning in Oracle Database 11g, Hermann Baer, 2007, http://www.oracle.com/technology/products/bi/db/11g/pdf/partitioning-11g-whitepaper.pdf

[TPC-C] Overview of the TPC Benchmark C: The Order-Entry Benchmark, Francois Raab, Walt Kohler, Amitabh Shah, http://www.tpc.org/tpcc/detail.asp

1 All numbers in the article are extremely rough estimates, your mileage may vary. Also we're assuming 'typical' Internet application with 'typical' distribution of requests over the day, with difference between minimum hour and peak hour not exceeding 2-5x. Still, while numbers are extremely rough, we feel that even such rough numbers can be of some value on initial stages of analysis.

2 Unfortunately, way too many RDBMS still experience problems under heavy load when replication is implemented using RDBMS-provided means. Heavy testing with comparable to production loads and data volumes is advised when trying to implement replication. As a workaround, custom application-level replication can be considered, but it is rather complicated and is beyond the scope of this article.