TCP/IP Explained. A Bit

Nowadays most programmers rely on network connectivity, often without really understanding the details. Sergey Ignatchenko compares and contrasts the two main protocols.

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Bunny, and do not necessarily coincide with the opinions of the translator or the Overload editor. Please also keep in mind that translation difficulties from Lapine (like those described in [ Loganberry04 ]) might have prevented providing an exact translation. In addition, both the translators and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

TCP/IP is widely used on one hand – most applications, apps and applets use some kind of connectivity these days – and yet paradoxically is not widely understood. While there is a strong temptation to just use TCP as a ‘magic box that works’, and most of the time it does indeed work as expected, there are still pitfalls in the cases when it doesn’t.

This article does not intend to discuss specific APIs in detail (those interested may refer to an appropriate book or reference; for example, for *nix APIs, [ Stevens98/04 ] provides a great read and reference); rather, it attempts to describe some common issues with TCP/IP that might not be obvious from API references.

It should also be mentioned that this area is still evolving and there may be recent developments which are not reflected here; as usual, please take everything you read (including this article) with a pinch of salt.

TCP vs UDP

Everything which travels over the Internet is represented by an IP (Internet Protocol) packet (for the purposes of this article, there is no difference between IPv4 and IPv6 packets). As IP packets travel across the Internet any router on the way may drop them; recovery from such dropped packets must be handled by the client and server computers involved in sending and receiving the data.

Both TCP and UDP are protocols which are implemented on top of the IP packet mechanism, so technically TCP and UDP are in the same ‘Transport layer’ of the ‘Internet Protocol Suite’. However, when looking at UDP we find that it is a basic IP packet with only simple additional information (like UDP port), and without any built-in mechanism to detect dropped packets. This means that if you’re using UDP you’re on your own with regards to detecting dropped packets and recovering from them – this is exactly why UDP is often referred to as an ‘unreliable’ protocol. In practice, the use of UDP is usually limited to scenarios when the delay of data is more harmful than the partial loss of data; one specific example is VoIP/video delivery protocols such as Real-time Transport Protocol (RTP) (while RTP may work over TCP, in practice UDP is usually used).

One feature which is present in UDP (but is not present in TCP) is multi-casting – when the same packet may be delivered to multiple locations, though these locations will still be identified by a single IP address. But while it is the case that one-to-many delivery looks interesting, a word of caution is necessary: the last time I checked, multi-cast wasn’t generally supported by Internet routers, and it didn’t look likely that this was going to change. This means that if you want to use multi-cast on an Intranet (with full control over routers and network administrators willing to help you, possibly including VPN-based network connections) it has a reasonably good chance of working, but if you need multi-cast over the public Internet you’re likely to be out of luck. If you’re desperate for multi-cast over the Internet by all means try it (things might have changed), but make sure that you’ve tested it in a real-world environment before committing to any large-scale development.

Unlike UDP, TCP is a ‘reliable’ protocol; this means that TCP detects IP packets that have been lost, re-transmits the request, and eventually gets the requested packet or the TCP connection becomes broken – and all of this happens almost invisibly to the developer. ‘Almost’ refers to the fact that nothing comes for free, and one needs to pay for the reliability with potentially increased delays which can be an observable effect at the application level.

One common misconception in ‘TCP vs UDP’ discussions is the argument that ‘UDP is faster’. This is not really a statement which can be argued to be right or wrong without further clarification – knowing what kind of ‘faster’ is needed. On the one hand UDP does provide better control over delays, but on the other hand, from the point of view of pure throughput, it is extremely difficult to build a UDP-based protocol which is able to compete with TCP over the Internet.

Overall, for applications which do not care about delays too much, TCP is usually a much better choice. However, there are still some caveats.

TCP caveat – reliability

While TCP is a ‘reliable’ protocol, it’s not absolute: as TCP checksums are only 16-bits long, if an IP packet is randomly corrupted on the way there is a 1 in 65536 (or ~0.0015%) chance that the checksum will be the same and the corruption will not be detected. In practice this has two implications. First one is: ‘never ever rely on the reliability of bare TCP transfers’; if one needs to transfer an important file it is necessary to do an extra check that the file has been transferred correctly (for example, by using SHA-1 or similar checksum on the whole file). While guarantees provided by SHA-1 are also not absolute, the probability of a corrupted file being undetected by SHA-1 is 1 in 2160, which can be roughly translated as ‘not in your lifetime’ (even the long lifetime of an ithé ¹ such as yourself). It should be noted that if SSL-over-TCP (or TLS-over-TCP) in which additional checks are used, the reliability of the transfer can usually be assumed. The second implication is that if, for example, one needs to transfer over a not-so-good link (and all links involving the ‘last mile’ to a home user should be deemed as potentially unreliable) a multi-gigabyte file with a SHA-1 checksum on the whole file (to guarantee integrity), it might be prudent to transfer the file in chunks with a checksum on each chunk; this way if TCP did allow a corrupted packet through, one will be able to re-transmit only the offending chunk instead of re-transmitting the whole multi-gigabyte file.

TCP caveat – interactivity

In general TCP has not been built for interactive communications, but mostly for long and steady file transfer; delays on the order of minutes have never been considered a problem for TCP. This means that a delay in the order of minutes is not a fault, it is a feature. The question is what to do when you need an interactive communication. While writing your own reliable protocol over UDP might sound like a good idea, it rarely is. On the one hand, any reliable protocol is highly complicated so it is very easy to make a costly mistake; on the other hand, TCP has at least some means to help with interactivity.

The first thing which is usually mentioned as a way to improve the interactivity of TCP connection is the TCP_NODELAY socket option. This might indeed help a bit, but one needs to keep in mind several issues:

TCP_NODELAY behaviour varies significantly from one platform to another, so testing on all potential platforms is highly desirable; in particular, on some platforms it has been reportedly observed it affects the timing of re-transmissions in the case of dropped packets.
It has been reported that it can affect the ‘PSH flag’, which might improve interactivity too, but the exact effects again need to be tested on all platforms.
Usually TCP_NODELAY forces a packet to be sent immediately after send() is called. This means that if your code is written in a manner that calls send() for each single byte, then your code would work ok without TCP_NODELAY (as the TCP stack will wait before actually sending a packet, combining several send() calls together using Nagle’s algorithm), but with TCP_NODELAY enabled you’ll end up sending a 40-byte TCP+IP header for each call, leading to up to a 40x overhead! Ouch! On the other hand, if your code already combines all the available data before calling send() (which is often a good idea anyway), then TCP_NODELAY may indeed improve interactivity.

Hence if you don’t have problems with interactivity then don’t bother with TCP_NODELAY ; it is a rather risky option which, unless carefully tested, may cause more problems than it solves.

Another thing which is not often mentioned but is at least as important for interactivity, is the handling of ‘hung’ TCP connections. Have you ever seen a web page which has stalled in the middle of being loaded, just to press ‘reload’ and voila – the page is there in no time? Chances are it was a ‘hung’ TCP connection. To make things worse, it might not be technically ‘hung’ from a TCP point of view (as mentioned above, TCP isn’t intended to care about a delay of a few minutes), but from the end-user’s point of view it certainly feels like it. For example, a compliant TCP stack is required to double the retry time each time, which means that if the first retry is 1 second (the default in the TCP standard), and then 7 subsequent retries fail (and if we have packet loss of a mere 0.01% then this will happen sooner or later given the number of packets in use nowadays), we’re already in the 2-minute delay range; from TCP’s point of view the connection is still alive and kicking, but for the end-user it is not so clear, and the user would probably just prefer it if the application detects the problem, cancels the old transfer and establishes a new connection to retrieve the data (which is a heresy from the network point of view, but forcing user to hit ‘reload’ to solve purely technical problem is an even worse heresy from the user interface point of view).

In addition, TCP as such does not really provide the means to detect connections which are really ‘hung’ even from the TCP point of view, e.g. when other side is not reachable at all; the socket was closed by the server but the RST response got lost on the way back; the server has been powercycled, etc. When I first saw the socket SO_KEEPALIVE option I thought ‘hey, this is exactly what I need!’; however, my excitement soon faded when I realized that the default SO_KEEPALIVE timeout is 2 hours (!), and while on Windows it can be changed in the registry there is no way to change it programmatically. On Linux there are non-standard options such as TCP_KEEPIDLE and so on, but as many clients are on Windows it won’t help us much.

All of the above may easily result in the need to design your own keep-alive subprotocol over a TCP connection, and doing it is quite an effort. Still, it is much less time-consuming and error-prone than writing your own reliable protocol over UDP (and if you don’t need reliability – you may want to think about using UDP directly).

TCP caveat – single-channel throughput

While the original TCP (as specified in RFC 793) works over a transatlantic link with its signal delays (a round-trip time of ~100ms, although even worse are satellite links but these are rare in practice), there is a well-known problem that maximum throughput of a single TCP channel is limited; namely the Bandwidth-Delay Product [ BW-D P ] of TCP is limited to 64K,which with the RTT above corresponds to approx. 5MBit/s; it means that if TCP is used, over a single transatlantic connection it is not possible to obtain throughput over that even if all the paths between hosts are multi-gigabit. To deal with this, ‘TCP window scaling’ was introduced in RFC 1323 to increase this 64K limit. It does help, but there are still a few things to know: first, for TCP scaling to work both the client and server must support it; second, TCP window scaling is not enabled by default in pre-Vista Windows, so XP clients are usually still limited. Also, I know of people who were trying to establish a transfer in the gigabit/s range over a single transatlantic TCP link (they have had both servers close to the backbone, both servers had TCP window enabled, window scaling was used according to Wireshark, etc.); but they have found that a single TCP link is still limited to a speed in the order of a few hundred Mbit/s. They didn’t manage to find out what was the underlying reason, but as a work around ended up using multiple connections which has solved the problem. My guess would be that at such speeds there was another bottleneck (perhaps the application wasn’t able to write data with sufficient speed, and if encryption was used it would explain a lot). The lesson of this is that such bottlenecks are easy to run into, and if very high throughput is needed it must be carefully tested.

At one time so-called ‘download accelerators’ were quite popular; these were (and still are) quite efficient and often do improve download speeds in practice. Almost all of them simply establish multiple connections to the server, which apparently works well. The reason for the effectiveness of download accelerators has only a weak relation to the TCP window limit described above: while multiple connections from a client may indeed help to bypass the 5MBit/s limit, another issue is usually much more important: namely, if the server channel is limited, usually packets from all TCP connections are dropped and/or delayed in the same manner and therefore TCP connections are effectively throttled down proportionally. This means that during throttling a client having two TCP connections will get roughly twice as much data than a client having only one TCP connection, at the expense of the other clients.

The bottom line about throughput – in most cases, you can get away with a single TCP channel, but if getting the highest possible throughput is an issue you need to be ready to investigate problems, and in extreme cases may still need to use multiple TCP connections.

TCP caveat – packet loss resilience

One thing which should be noted about TCP is that, as a rule of thumb, it becomes virtually unusable when packet loss exceeds a certain percentage, in many cases within 5–10% range. Such a packet loss rate is usually considered abnormal (normal values even for the last mile should be within 0.01–0.1%), though I’ve personally experienced ISP support who told me “ hey, 10% loss is ok, you still have 90% of the stuff going through, so we won’t do anything about it ”. It is unlikely to become a problem in practice, and it is not clear if anything can be done about it, except for developing our own reliable protocol over UDP, which is unlikely to be worth it for all but very special applications.

TCP caveat – developers without a clue

One very common bug in TCP programs (probably the most common for beginners) is related to the incorrect use of streaming APIs. By its very nature TCP is a stream, so if on the sending side there is a single call to send() , on the receiving side there is absolutely no guarantee that there will be exactly one successful call to recv() . In general the boundaries between send() calls are not seen on the server side at all, so for any number of send() calls there can be any number of recv() i.e. there is no 1–1 correspondence.

To make matters worse, when testing a program on the same computer or in a LAN the 1-to-1 relation between send() and recv() calls may happen to be observed, but when going into a WAN, things can start to fail from time to time. The only way to avoid it is to remember that TCP is always a stream, and if one needs boundaries between messages within this stream they must be introduced on top of TCP by the developer.

Another common problem with network programs (which applies both to TCP and UDP), is developers sending C/C++ structures over the network without marshaling. While this might work at first, in a project which aims to live for more than a few days, it is a time bomb. If sending/receiving C/C++ structures without marshaling, you do not really have a well-defined protocol. Instead, you’re implicitly relying not only on the specific platform (because of little-endian/big-endian stuff), but also relying on the way a specific compiler applies alignment rules, and on stuff like #pragma pack in a specific place where the header which defines structure was included. If you don’t use marshaling, and then, at any point down the road, you’ll decide to go cross-platform or even to use different compiler for the same platform – the scale of the resulting problems due to the lack of marshaling might easily prevent you from doing it. Think more than twice before deciding not to marshal your data over the network.

TCP caveat – PMTUD

One of the very many features of TCP is ‘Path MTU Discovery’, or PMTUD in short. It is a nice feature which aims to detect the maximum packet size over the connection between client and server, and then to use this information to improve throughput. Unfortunately, one misconfigured router or firewall on the way may break it easily, leading to TCP connections which work normally when packets are small, and hanging forever when a large packet is seen. This was a big problem back 10 years ago, although is now less of an issue but it still happens from time to time. Usually it is considered a misconfiguration issue, but if it becomes a real problem (in other words too many customers are complaining), there is a chance to resolve it by using TCP_NODELAY and ensuring that all calls to send() are limited to at most 512 bytes in size (disclaimer: this is a guess from my side, and I’ve never tried such way of handling PMTUD myself; also note that strictly speaking, formally it is guaranteed to fix the issue only if the size is at most 28 bytes, but in practice 512 bytes should do nicely).

Troubleshooting, testing and Wireshark

If you have problems with a TCP connection, or if you’re using any of the not-so-common TCP options (and this includes TCP_NODELAY ), it is highly recommended to use a network analyzer to see what exactly is going on. If your application is used over the public Internet, it is highly recommended to test it with the server being as close to a real world one as possible and with all the likely clients (the behaviour of different TCP features may vary greatly from platform to platform). Testing over a link with a high delay is highly desirable even if the application is expected to be deployed over an Intranet. It should be noted that testing with high-delay links does not necessarily require special hardware or servers on the other side of the Atlantic. For example, for a low-traffic but highly-critical application, we’ve ended up purchasing a dial-up connection with the hope that if we can make it work reliably over that, it will work reliably under all realistic scenarios; it turned out that we’ve indeed chosen a very good way of testing.

When testing and analyzing TCP connectivity, a packet analyzer can be of great help. One I can recommend is Wireshark; it is free, and does its job wonderfully. One of the features I like the most, is the ability to analyze tcpdump logs. This means that if I have a Linux or BSD server and a real-life problem with one of the clients, I can, without installing Wireshark on the server, run a tcpdump on the server (it’s usually part of a default installation), filtering by the client I’m interested in by IP using tcpdump’s options, then download tcpdump’s log to my desktop where Wireshark is installed, and then see what was going on using Wireshark’s GUI. This allows us to use full-scale analysis for real-world problems. One thing to remember when using tcpdump for this purpose,is to use the -s 65535 option, otherwise on some platforms tcpdump packets may be truncated, which might complicate analysis by Wireshark.

Firewall considerations

If one tries to build an application for the public Internet, a rabbit needs to think about the entire path all the way from the server to the client. This is likely to include firewalls at least for some of the clients. Usually, firewalls out there are statistically very friendly to TCP connections, and statistically a bit less friendly to UDP connections; in addition, UDP may cause issues when a client is behind certain types of NAT.

On the other hand, if a developer tries to use port 80 (the usual port for HTTP) for non-HTTP traffic in a naive attempt to bypass over-eager firewalls, there is another potential issue – many ISPs (especially in 3rd-world countries) use ‘transparent caching proxies’ on port 80, parsing requests in an attempt to save on network traffic; what happens with such proxies when a non-standard request comes in over port 80 is not defined (in practice the result may vary from forwarding the request ‘as is’ to hanging the whole proxy), so using port 80 for non-HTTP traffic cannot be regarded as safe.

On HTTP

One protocol implemented on top of TCP is HTTP. In general, HTTP is even more firewall-friendly than bare TCP, and if your conversation over TCP is limited to a request-response pattern, in some cases it might be worth to consider using HTTP instead (usually over port 80). In simple cases you may limit your HTTP to HTTP 1.0, which is trivial to implement at least on the client side. On the other hand, if you need multiple requests over the same TCP connection (in order to avoid penalties of re-establishing TCP, which might be quite large in case of multiple small requests), you might need to implement HTTP 1.1, which is doable but is a little bit more tricky. Alternatively, you may want to use HTTP APIs which are already available on many platforms. When using APIs, it is important to realize that as HTTP is implemented on top of TCP, so most of the TCP caveats also apply to HTTP connections.

Epilogue

With TCP and UDP being cornerstones of the Internet, lots of developers are bound to use them, either explicitly or implicitly. In many cases these protocols (especially TCP) do their job marvelously without the need for the developer to understand how they work. However, as there are only two of these protocols for all the myriad of usage scenarios on the Internet – sometimes they’re used under conditions for which they were not designed; in such cases it may become necessary to understand how this low-level stuff works under the hood. I hope that this article is a good starting point.

References

[BW-D P] http://en.wikipedia.org/wiki/Bandwidth-delay_product

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[Stevens98/04] UNIX Network Programming: Networking APIs: Sockets and XTI ; Volume 1 W. Richard Stevens

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ithé – n. man, human in general; from [ Loganberry04 ]