Why removing the TCP pseudo-header doesn’t help mobility, by John Day

In a recent presentation at a standards event, someone cited a quote from Vint Cerf that they made a mistake binding IP to TCP with the pseudo-header. Supposedly not doing this would make it easier to solve some of the current problems facing the Internet, such as mobility. Unfortunately, this isn’t the case. In fact, the evidence points in the opposite direction.

1. The pseudo-header

For those unfamiliar with the pseudo-header when IP was separated from TCP in the late 1970s, the TCP pseudo-header was invented to protect against modification and mis-delivered packets in transit. When the TCP checksum is computed by the sender, the following fields (called the pseudo-header) are included in the computation: the source and destination IP addresses, a byte of zeros, the IP protocol-id field, and the length of the TCP segment. The receiver knows the values the pseudo-header should have without using either the TCP or IP headers and includes the pseudo-header in the checksum calculation of an incoming TCP packet. Constituting an independent check that the packet has not been modified. If one of the addresses is changed during the lifetime of the connection, the checksum calculation will fail and the connection is aborted.

Because the IP address names the interface, this means that to deliver packets to the same node over different interfaces requires routing to different addresses.[1] For both multihoming and mobility, the addresses of the interfaces need to change. Hence the pseudo-header data will change and the checksum will fail. Thus thwarting both multihoming and mobility. One might think that simply removing the pseudo-header would eliminate the problem. Unfortunately, it isn’t that simple.

Contrary to popular belief the semantics of the IP address are not overloaded, they simply name the wrong object.[2] However, the semantics of the port-ids are overloaded. (The TCP port-id is a 16 bit field.) The port-id is a local identifier that has the same role as a file descriptor, used by TCP (or another transport protocol) and the application to refer to the connection. In TCP, the port-ids also act as connection-endpoint-ids.[3] The convention of concatenating connection-end-point-identifiers (CEPIs) to form a connection-id has a long history in data communications and networking. However, TCP takes it a step further and overloads again half of the connection-id (the destination port-id) as a well-known (local) identifier for a registered application.[4] So in TCP, the port-id is a port-id, a connection-endpoint-id, and identifies the path to a specific application.

In addition, TCP supports “fan-in,” which the ARPANET Host-to-Host Protocol did not. This allows multiple connections to the same well-known port at the same time. (The server distinguishes individual connections by the source address and source port-id, since the destination address and destination port-id (the well-known port) will be the same for all connections to this server.) This creates a security problem, because the server must rely on the source IP address, and source port-id (values that it did not create) to distinguish this connection. This is the source of many spoofing attacks.

The situation is worse for mobile devices. If the pseudo-header were eliminated under the belief that for applications in mobile devices changing the source address would now be possible because the checksum would not fail, it would be correct that the checksum would not fail but the situation would be worse. Now the server is seeing TCP packets arrive with a different IP-address. The server only has the port-id to distinguish connections and it is not unique across the entire network. Not only might they be spoofed but more than one system may have a connection with the same source port-id! While 2¹⁶ may seem to be a large number, given the number of TCP connections that a server can see in the average lifetime of a TCP connection plus 120 seconds (the lifetime of a port-id after a TCP connection is closed), 2¹⁶ begins to seem like a pretty small number.

For those who are now thinking, “see a network-wide endpoint identifier is needed” not so fast. That would be yet another patch for that specific problem, but it doesn’t address the others: It would not be an application name and would not solve the security problems, or the mobility problems.[5] As Richard Watson showed in 1978, the solution is to decouple port-allocation from synchronization[6] and eliminate well-known ports.[7] This has two benefits effects: 1) no identifier shared with the application is carried in protocol, and the server has values it generated to distinguish connections. If done correctly, this would also eliminate the need for the protocol-id field.[8]

2. We Aren’t Out of the Woods Yet

This isn’t the only IP problem. IP Fragmentation has never worked. (For a protocol with so few fields, it certainly has a lot of problems.) The fragmentation is done with an offset relative to the packet-identifier field in the IP header.[9] This allows the fragments to be reassembled at the other end. Sounds like it works! But suppose a fragment was discarded in route. IP doesn’t do re-transmission. How will the missing fragment(s) be recovered?

TCP does do retransmission, but its retransmitted packets are handed to IP, which assigns a different packet-id to it, which is sent through the network and may also lose a fragment. Consequently, the destination can have multiple copies of the same partially reassembled TCP packet and not know it.[10] Therefore, for an IP packet to be useful, either all fragments have to arrive or it should never be fragmented.

Hence, the MTU Discovery patch is used to determine the Maximum Transmission Unit on the path and thus avoid doing fragmentation. However, MTU Discovery isn’t always possible because it uses ICMP to report packets that are too large. Many systems block ICMP as an often-used denial of service attack vector. MTU Discovery was also very effectively used in the heartbleed attack.

But lets go back to the original problem. Before IP was separated from TCP there was no problem with fragmentation. In fact, the reason that TCP has byte sequence numbers was to make fragmentation easy at the boundary between networks. So why is there a problem after IP was separated?

There are basically two ways to handle fragmentation. The way IP does it: a packet-identifier (sequence number) and an offset, or the way TCP does it: a byte sequence number so that if there is a need to fragment, the sequence number is simply the byte number of the first byte of the fragment. Elegant, simple! This would be a very fast method to use when fragmenting in a router. But once IP was separated from TCP, the byte sequencing was no longer available for fragmentation in the routers. The packet-id and offset in IP are used.

If IP were not split from TCP, there is no fragmentation problem because retransmissions will use the same sequence number (packet-id) as the original. The receiver knows when it has multiple copies of the same packet and uses fragments from different copies to complete the original packet, and partially reassembled packets do not pile up.

Splitting TCP from IP is not a clean separation. Layers are supposed to be independent of each other. There is a dependency between IP and TCP. They are not well-formed layers.

All of this indicates that IP should never have been separated from TCP.

However, byte sequence numbers create a couple of problems. It can make reassembly much more complex. Because TCP is a byte stream, the retransmissions don’t have to be on the same byte boundaries, nor is there any guarantee the fragments will be on the same boundaries. The receiver may be confronted with fragments that overlap with other fragments in weird and wonderful ways.

This is especially complex if the arriving packets are stored as a linked list (which is common). But pretty easy if the arriving data is stored contiguously.

If the buffer is a contiguous block of memory the size of the flow control window, the sequence number not only provides the order but also how large the holes are, and the sequence number indicates where in the buffer the packet goes. Then it is simpler, like completing the pieces of a puzzle. If a retransmitted packet fills a hole but overlaps, one can just overwrite existing data with the same data filling in the holes as it goes.

Pretty simple.[11]

But that wasn’t too much of a problem for the original TCP because those big contiguous buffers would only be held for at most a few retransmission cycles, the packets would be completed and delivered to the application freeing the TCP buffer space for more incoming data (and allowing the window to be advanced). However, IP is separated from TCP and there may not be MTU Discovery, the hold time goes from maybe less than .5 seconds to often 5 seconds. There will have to be large amounts of buffer space for both IP and TCP or limiting the data rate or both.

In addition, another problem with byte sequence numbers is that they are going to roll over much faster (a concern even at the time). This can limit bandwidth depending on the Maximum Packet Lifetime (MPL) of the internet, since a sequence number can not be re-used unless it is know to not be on a packet still in the network. IOW, it was either ack’ed or the MPL has expired. This is why the Extended Sequence Number option is used on most connections today. Another patch.

The reader should be getting a sense of how what may have seemed like a good idea at the time is proving to be a cascading series of patches that increase complexity and generally introduce new problems.

In fact, there is even stronger evidence from our research into the fundamental nature of error and flow control protocols, like TCP [See Chapter 3, Patterns in Network Architecture]. An analysis of this class of protocols shows that rather than splitting multiplexing/relaying functions from reliable transfer functions, the protocol naturally cleaves between simple data transfer functions (delimiting, fragmentation/reassembly, sequencing, multiplexing) that must be with the data and data transfer control functions (retransmission control (acks) and flow control), feedback that doesn’t have to be associated with the data, the traditional separation of control and data.[12] The two are coordinated through a state vector, called a TCB (Transmission Control Block) in TCP. The data transfer functions write to the state vector and the data transfer control functions read from the state vector to determine what acks and credits to send. The only action of the control side on the data transfer side is to turn off a queue once in a while, when buffers are tight. Otherwise they are independent.

TCP should have been split vertically not horizontally, where the data transfer side is basically IP+UDP and the control side generates control PDUs with acks and credit. The single header format made that hard to see, (and would have made TCP look too much like its nearest competitor).

But we still don’t have a solution to the mobility problem.

3. What Is Needed?

From what this note covers, it is clear that at the very least the following characteristics are needed for a robust architecture that supports mobility:

The addresses used by a layer should name the entity in the layer that does the relaying and that routing should be done on those addresses, not on the interfaces, i.e the addresses in the layer below.
Replace the use of well-known port with application names.
Decouple the port-id from the connection-endpoint-id
Base synchronization on the explicit bounds of the 3 timers Watson describes.
Fragmentation where necessary should work

If done right (and there are some subtleties buried here), this would be a good foundation for mobility. It would get the internal structure of the layer correct and would not require cumbersome constructs like home routers, foreign routers, and tunnels.

Beyond just getting the internal structure of the layer correct, the layers will need to repeat to ensure scalability and to ensure that layers are configured to the conditions in their environment. It is beyond the scope of this note to discuss the addressing and scaling issues but may be taken up later.

RINA actually satisfies these requirements and the concepts have been tested with implementations. We have based the error and flow control protocol on Watson’s work and separated mechanism and policy so that it can be configured for the full range of data transfer behaviors. By decoupling port allocation and synchronization, not only is the layer more secure, but also the only identifiers the layer above sees are the local port-id and the destination application name. And fragmentation works.

4. References

G. Boddapati, et al., “Assessing the Security of a Clean-slate Internet Architecture”, 20th IEEE International Conference on Network Protocols (ICNP), Austin, Taxas, 2012.
J. Day, “Patterns in Network Architecture: A Return to Fundamentals”, Prentice Hall, ISBN 978-0132252423, 2008.
Gonca Gursun, Ibrahim Matta, and Karim Mattar. “On the Performance and Robustness of Managing Reliable Transport Connections“. In Proceedings of the 8th International Workshop on Protocols for Future, Large-Scale and Diverse Network Transports (PFLDNeT), Lancester, PA, November 2010
Saltzer, Jerry. Name Binding in Computer Systems. 1977.
Shoch, J. “Internetwork Naming, Addressing, and Routing,” IEEE Proceedings COMPCON, Fall 1978: 72–79.
Watson, R. “Timer-Based Mechanisms in Reliable Transport Protocol Connec- tion Management.” Computer Networks 5, 1981: 47–56.

[1] If the IP address named the node rather than the interface, as we have known it should since 1972, this would not be a problem. (No other network architecture, e.g. CYCLADES, XNS, DECNET, or OSI, made this error.) There was an attempt to rectify the problem in 1992 but it was soundly rejected by the IETF during the IPng temper tantrum because OSI named the node. While beyond the scope of this note, routing must route to the end of the path, which is not the device or even the interface, but the node, i.e. all applications reachable at that address. So traumatically was it rejected that no one has dared raise the possibility again. Routing on the node address solves multihoming, mobility, and reduces router table size by at least a factor 3 – 4 with no additional cost or complexity. With all of these benefits, the IETF’s action is unfathomable.

[2] Loc/id split is the post-IPng trauma reaction believing there must be a workaround. There isn’t. Loc/id split is also flawed and has been found not to scale because it does not route to the end of the path but to a point on the path and fails to recognize that ultimately it is a false distinction: in computing all identifiers are locators and vice versa. In fact, resolving an identifier is to locate an object in a given context. [Saltzer, 1977].

[3] Hence, connections are between port-ids, rather than between protocol machines.

[4] This is analogous to a practice in early (and very resource constrained) operating systems of using low memory locations as jump points to registered applications. When it was first used in the early 1970s ARPANET, it was a known kludge that was assumed would be removed at the first opportunity. Needless to say, it hasn’t been. Note that the well-known port identifies a path to the application, not the application.

[5] Perhaps in a later note we can go through the various cases as to why this doesn’t address the problem. A note at www.pouzinsociety.org on loc/id split also answers this question.

[6] Of course Watson’s results go much further proving the necessary and sufficient condition for synchronization is to bound 3 timers. (The 3-way handshake has nothing to do with why synchronization is achieved. It is because the 3 timers are bounded either explicitly or assumed to be.)

[7] Applications should be identified by application names. [Shoch, 1978]

[8] This creates a much more secure and robust protocol. [Matta, 2009] [Boddaparti, 2012].

[9] Although the receiving IP protocol machine does not order IP packets, the packet-id is usually assigned sequentially. It is assumed that the TTL will expire before the 16-bit field rolls over. Not necessarily a good assumption.

[10] Given that IP has to hold these partially reassembled packets for 5 seconds, this can consume a lot of buffer space; megabytes are easily consumed. There can be a lot of retransmissions in 5 seconds. (This problem has been known since IP was separated from TCP in the late 70s.)

[11] Has the reader calculated what that would be at today’s data rates to keep the pipe full?

[12] In fact, one realizes that synchronization is only needed for the feedback side of the protocol.