New data transfer protocol (NTCP)

Semenov Yu.A. (semenov@itep.ru, ITEP, Moscow)

Abstract

The new experimental transport protocol (NTCP) has been proposed. It can provide higher efficiency of channel usage. It minimizes RTT and excludes buffer overflows. This protocol has three advantages:

  1. A local retransmission in case of a segment loss.
  2. Buffer status information along the pass is transmitted in every response, giving a full picture of buffer filling. This excludes any buffer overflow, as sender can easily forecast the situation with buffer.)
  3. At overloading recovery is much faster, as status information is sent upstream immediately from the overloaded node.

1. Introduction

The modern version of TCP (RFC-793, -1323) has a number of disadvantages (see RFC-2525 or [1, 2]). A part of these disadvantages is connected with congestion avoidance, some related to the protocol security.

Recently a lot of attempts has been undertaken to improve the existing protocol TCP. There are different model of TCP realization: Tahoe, Reno (see RFC-2582), Vegas, T/TCP (RFC-1644), SACK (RFC-2018, -2883, -3517) and others (see e.g., RFC-2309, -3042, -3168)) .

All these algorithms can not provide a perfect solution, as situation in buffers along the pass is supposed unknown. Note, that radical change of any protocol in use is rather dramatic. Any innovation has to work with existing equipment and software.

In this document a new transport protocol NTCP is proposed. In the header of the starting SYN request of this protocol must present a specific option code (e.g., = 20). Any transit node, supporting NTCP, at receiving such a segment, have to respond, sending to the upstream neighbor, supporting NTCP, a segment with the same option code.

Let us consider, what makes traditional TCP inefficient in the high speed channels. Here is the most important parameter window > RTT´B/MSS, where B - channel bandwidth in bps, MSS - maximum segment size in bits, and RTT – (round-trip time) the time the reply is received minus the time the request was sent. By default MSS is 1460*8 = 11680 bits. The performance of TCP for networks with high bandwidth-delay products described in details in [3].

The RTT value is composed of a propagation delay of the segment in the physical medium and a time of queuing and development in the network equipment. The second part predominates (if it is not satellite channel). The propagation time for 5000km cable is of 25 ms, the delay time of modern router is from 1 to 20 ms. The number of the transit nodes (routers or switches) may amount up to 10-15 and even more.)

Minimizing RTT we may optimize requirements to a buffer size. In this proposal efficient RTT value is minimized.

All the transit network objects, supporting protocol NTCP, must analyze and modify TCP-response segments headers. For full scale NTCP implementation it is necessary to update network routers (in present they work with only IP-headers) and it is also desirable to adapt L2 switches for NTCP-support.

2. Basic ideas

Every transit node, which got a segment, acknowledges it by sending to the upstream neighbor a response, containing information, it has, of the of downstream buffers free space. After this it takes responsibility on segment delivery to the downstream node (and so, up to the destination node).

In this proposal a virtual connection point-to-point is substituted with a chain of channels, via which the neighbor nodes communicating each other. In such a way RTT may be reduced by order of value, as we have to take into account the delays only between neighboring nodes (at the hop number from the source up to the destination equal N, - in N times). The usual virtual TCP-pass is converting to the chain of channels, connecting nodes, supporting NTCP.

Examples of such channels are shown on the figure 1 between nodes S and I, and also between III and IV. For each of the channel RTT become equal to the value corresponding to a single hop, which is equal to 1-5 ms (or even less). As a consequence, retransmit time out (RTO) will also go down. The 16-bit window header field at RTT ~ 5 ms will be enough up to the rates ~20 Gbps. One may suggest that 100 Gbps equipment will have less delay values, which in many cases will exclude usage of the TCP window options. In this case frame loss problem may be settled locally in the segment pass, where the loss is taken place. The volume of repeatedly sent data in this protocol modification is minimal, proportionally will be decreased RTO value.


Figure 1


Contrary to the standard TCP in this mode a source and destination are not interacting directly, but only a sender with a nearest router, supporting NTCP, neighboring routers and terminating router with a destination node (see fig. 2, the case, when all the routers support NTCP).


Figure 2


It is supposed, that all the NTCP participants can control, whether neighbor supports NTCP or not. Clear, that in this case routers should be able to analyze headers of L4 (now they work at level L3), distinguish sessions between the same terminal agents (analyze port numbers or socket identifiers).

The cutting of virtual connection source-destination into sequence of passes between neighbor nodes, supporting NTCP, decreases transit traffic at retransmissions, as lost frames are retransmitted only at the pass segment, where loss has been taken place, but not for the whole route source-destination.

The NTCP-segment development is carried out not only in routers, but also in switches of level L2. Every active NTCP node intercepts segment-acknowledgments and put into their data field (or option field) its identifier and number of MSS, it can accept in his buffer (or free buffer space in bytes). There should not be problems, as the data field in the responses is empty and there is quite enough space for any number of such records. The source, getting such a response, records these data, compares with that it get in previous acknowledgment, estimates the probability of the buffer overflow, and decide, when to send the next frame.

This algorithm fully excludes buffer overflow, though a probability segment loss due to the frame distortion is preserved. The status date of the nodes of the pass is at the disposal of all participant of exchange. If some sender in spite of unfavorable forecast sends segment, it will be rejected by the nearest router. And there will be no problems for downstream nodes.

Thus, such a behavior will not be encouraged. But the sender, having full information of intermediate buffers status, may forecast future situation more precisely, and decide, how many frames it can send, without any buffer overflow. From the point of view of CWND this algorithm is ideal, but this technique requires a serious equipment modification. If some part of equipment does not follow this protocol, a system will work, but a probability to lose a frame because of buffer overflow becomes higher, and algorithm efficiency lower.

The record format in the response data field is shown in fig. 3. The record number in a response segment depends on a hop number from the sender up to the destination.


Figure 3


In the field “identifier” one may use IP-address of network object (in this version a field is of 4 octets length), in case of L2 device - its identifier or sequence number along the pass.

The field "Free space (in segments or bytes) in the input buffer" contains a number of segments, which can be written there without overflow. The records for input and output buffers are necessary, as data traffic may be quite different in opposite directions.

There is a possibility to use one byte to record free buffer volume for each direction. The problem can be dramatically complicated, if routes there and back are different. In this case a status record of output buffer will be meaningless. The necessary data one may get just at data frame transporting from source to destination. For this purpose one may use a field “option” or any other special header field.

The algorithm can be made more universal with a broadening the header field options. The field is to be used for putting down the level of buffer filling for any network agents along the pass source-destination. This data is transferred by receiver to status records of response. We should take into account an overflow possibility due to the response traffic. But as responses are sent only for received segments, and this procedure is conditioned with a buffer state along the pass, a correct choice of algorithm parameters makes such an overflow incredible.

A response for received segment is generated by a nearest router (or any other network object). Its task is a segment sending to the next router and getting an acknowledge response from the latter. The router “N” sends an acknowledge response to the router “N-1” immediately.

Accepting and acknowledging a segment the router takes responsibility for a further segment delivery to the next node and so on up to the very destination. What will happen if a situation arises, when a segment can not be delivered? Buffers of intermediate upstream nodes can be overloaded, but how the source will know of the problem?

As all exchange participants know the source IP-address, after predetermined number of attempts source will be informed of the failure (e.g., by ICMP-message). If source continue sending, this ICMP-message will be sent again. The same result can be achieved with stopping sending acknowledgments to the upstream neighbor. So the no acknowledgments from the blocked hop will spread in direction of the source. This will help to prevent or at any case localize a possible overloading.

There may be dedicated response informing an upstream node of the route blocking. In case of frame losses in the equipment, not supporting NTCP, the slow start procedure is initiated.

If a frame is damaged, after timeout it will be retransmitted (RTO=RTTm +4D, as in usual TCP, but RTT and its dispersion D in considered variant several times less; see also RFC-2988).

The proposed protocol may be realized in point-to-point mode or for a part of virtual pass, e.g., at an area of backbone network of service provider. In the latter case the chain of the NTCP-connections is formed in border routers of the backbone, as in MPLS-protocol).

At connection setting up one need to check whether particular equipment support NTCP or not. At the connection setting up an algorithm described in RFC-1644 may be applied.

At the setting up a source (S) sends datagram to the destination IP-address (D). The nearest to the sender node, supporting NTCP, intercepts the frame, puts it into a buffer, and immediately sends an acknowledgement. This acknowledgement is intercepted by nearest upstream NTCP-router, notifying that the segment is successfully delivered. At the same time it gets data of a free space in input downstream buffer. This process will go until the datagram will be received by the destination node. As a result a chain of connections is formed with numbers i=1,...,N, each of which contains from 1 to k hops (see fig. 1). The first hop on the figure corresponds to a joint S-I, containing K hops (in the figure k=2).

The gain in efficiency of bandwidth usage is the higher, the more nodes in the pass supports NTCP (in ideal k=1). If there are no such nodes, there will be no gain.

If buffer of an intermediate node is near overflow, it will send a response to inform his neighbor-sender of its buffer status (this is so for hops (I-II), (II-III), (III-IV), … and (N-1 - N)). The source will stop sending. In its turn his neighbor upstream (direction to S) may find oneself near overflow and will be forced to inform an upstream neighbor.

Unfortunately some nodes between S and D may be switches of level L2, not supporting NTCP.

In this protocol as in all other TCP models there is no mechanism to suppress frame losses due to their distortion. If a probability of such distortion is essential, then we should try to optimize a MTU value (see, e.g., RFC-2923). However, having full information of all buffer status along the pass, sender can identify a loss reason easily (if all the nodes support NTCP). That is why NTCP is the most attractive for homogeneous service-provider networks, with reliable information of the NTCP support.

If k =1, a frame loss will not cause a slow start, as it is evident, that a loss is accidental, and a retransmission meets the case. At k>1 this strategy is not admissible and we should use one of the standard TCP models.

A rate of buffer filling is characterized with a derivative db/dt, where b is a current level of a buffer filling. If buffer is filled up to the level Bmax, next segment will be lost. Any network object should keep track its buffer filling level. And, if after next segment receiving it turns out, that

(b(t) + db/dt *RTT + d) > Bmax,


then to all sender-neighbors, that use this node to transport data, must be sent a response with window=0 (signal to stop transmission). d is a configuration factor. A fulfillment of the condition b<(Bmax-d) makes buffer overflow practically improbable. It is essential, that response with window=0 will stop immediately a transmission in all nodes on the way to the source. However, the sending of the notification with window=0 is redundant, as in norm all nodes on the pass and the sender itself must forecast and prevent such an overflow. The source has to look through buffer statuses for all the nodes and determine the sliding window value on the base of node with the worse status. If db/dt=0 or negative, and level of buffer filling is not high, sliding_window may be made equal window (receiver parameter). The value window >0 is sent to senders, at fulfillment of condition b(t) <(Bmax -d).

In the traditional TCP for setting up an optimal sliding window value a slow start algorithm is used. This is doing over a mechanism of CWND control. For optimal usage of a channel bandwidth one has to fulfill a condition sliding_window > RTT*B/MSS. At any time sliding_window = min(window,CWND), i.e. practically always below an optimum (window value is determined by destination node).

At slow start CWND is changed from 1 to the possible maximum, which is less or equal the window value. It is a consequence of the fact, that neither source nor destination nodes know nothing of buffer statuses along the pass, and the retrieval of possible value of sliding window is realized in blind.

In case k=1 (when all network devices along the pass provide data of their buffer status), a situation is changed radically. If the buffer is free (b=0, that is typical at start), or b < (Bmax-d), than value sliding_window=CWND=window and, thereby slow start is excluded. The decision, when to send a next frame, is taken after every getting a response.

3. Security considerations

As this protocol changes transfer algorithm dramatically, it may include some additional modifications, which can improve its security. This may be certificate usage, excluding attacks "man_in_the_middle" or ISN-guessing.

4. Conclusion

The NTCP implementation is easier for IPv6, as there is no problem with a new header format (at a connection setting up participants may decide what type of header to use).

The most serious problem is a desirable modification of equipment, working on L2 (LAN). But it is clear, that similar modifications of L2 are required for Diffserv services to provide necessary level of QoS for computers in LAN.

5. References

  1. W. Richard Stivens, "TCP/IP Illustrated, V1 The Protocols", Addison-Wesley Longman, Inc.
  2. W. Xu, A.G. Qureshi and K.W. Sarkies "Novel TCP congestion control scheme and its performance evaluation", EE Proc. Commun., Vol. 149, No. 4, August 2002.
  3. T. V. Lakshman, Upamanyu Madhow, “The performance of TCP/IP for networks with high bandwidth-delay products and random loss” Infocom ’97, April, 1997, IEEE/ACM Trans. Networking, V5, N3, p.336-350, June 1997. (http://www.ece.ucsb.edu/Faculty/Madhow/publications.html/)