lwIP With Proxy PDF
lwIP With Proxy PDF
T2001:20 ISRN:SICS-T–2001/20-SE
Adam Dunkels
[email protected]
February 2001
Abstract
Over the last years, interest for connecting small devices such as sensors to an
existing network infrastructure such as the global Internet has steadily increased. Such
devices often has very limited CPU and memory resources and may not be able to run
an instance of the TCP/IP protocol suite.
In this thesis, techniques for reducing the resource usage in a TCP/IP implemen-
tation is presented. A generic mechanism for offloading the TCP/IP stack in a small
device is described. The principle the mechanism is to move much of the resource
demanding tasks from the client to an intermediate agent known as a proxy. In par-
ticular, this pertains to the buffering needed by TCP. The proxy does not require any
modifications to TCP and may be used with any TCP/IP implementation. The proxy
works at the transport level and keeps some of the end to end semantics of TCP.
Apart from the proxy mechanism, a TCP/IP stack that is small enough in terms
of dynamic memory usage and code footprint to be used in a minimal system has
been developed. The TCP/IP stack does not require help from a proxy, but may be
configured to take advantage of a supporting proxy.
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methodology and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 The TCP/IP protocol suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 The Internet Protocol — IP . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Internet Control Message Protocol — ICMP . . . . . . . . . . . . . . . . . 5
2.1.3 The simple datagram protocol — UDP . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Reliable byte stream — TCP . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The BSD implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Buffer and memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Application Program Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Performance bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Data touching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Small TCP/IP stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
4.6.2 Sending packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.3 Forwarding packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.4 ICMP processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 UDP processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 TCP processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.2 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8.3 Sequence number calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8.4 Queuing and transmitting data . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8.5 Receiving segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8.6 Accepting new connections . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.7 Fast retransmit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.8 Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.9 Round-trip time estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.10 Congestion control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 Interfacing the stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.10 Application Program Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.10.2 Implementation of the API . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.11 Statistical code analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.11.1 Lines of code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.11.2 Object code size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.12 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Summary 46
5.1 The small TCP/IP stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 The API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 The proxy scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A API reference 48
A.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.1.1 Netbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.2 Buffer functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.3 Network connection functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
D Glossary 71
Bibliography 73
Chapter 1
Introduction
Over the last few years, the interest for connecting computers and computer supported devices
to wireless networks has steadily increased. Computers are becoming more and more seamlessly
integrated with everyday equipment and prices are dropping. At the same time wireless networking
technologies, such as Bluetooth [HNI+ 98] and IEEE 802.11b WLAN [BIG+ 97], are emerging. This
gives rise to many new fascinating scenarios in areas such as health care, safety and security,
transportation, and processing industry. Small devices such as sensors can be connected to an
existing network infrastructure such as the global Internet, and monitored from anywhere.
The Internet technology has proven itself flexible enough to incorporate the changing network
environments of the past few decades. While originally developed for low speed networks such as
the ARPANET, the Internet technology today runs over a large spectrum of link technologies with
vastly different characteristics in terms of bandwidth and bit error rate. It is highly advantageous
to use the existing Internet technology in the wireless networks of tomorrow since a large amount
of applications using the Internet technology have been developed. Also, the large connectivity of
the global Internet is a strong incentive.
Since small devices such as sensors are often required to be physically small and inexpensive, an
implementation of the Internet protocols will have to deal with having limited computing resources
and memory. Despite the fact that there are numerous TCP/IP implementations for embedded
and minimal systems little research has been conducted in the area. Implementing a minimal
TCP/IP stack is most often considered to be an engineering activity, and thus has not received
research attention.
In this thesis techniques for reducing the resources needed for an implementation of the Internet
protocol stack in a small device with scarce computing and memory resources are presented. The
principle of the mechanism is to move as much as possible of the resource demanding tasks from
the small device to an intermediate agent known as a proxy, while still keeping as much of the
end-to-end semantics of TCP as possible. The proxy typically has order of magnitudes more
computing and memory resources than the small device.
1.1 Goals
There are two goals with this work:
• Designing and implementing a small TCP/IP stack that uses very little resources. The stack
should have support for TCP, UDP, ICMP and IP with rudimentary routing.
• The development of a proxy scheme for offloading the small TCP/IP stack.
In order to minimize the TCP/IP implementation, the proxy should implement parts of the
standards. The proxy should also offload the memory and CPU of the small system in which the
stack runs. The TCP/IP implementation should be sufficiently small in terms of code size and
resource demands to be used in minimal systems.
1
2 CHAPTER 1. INTRODUCTION
Chapter 4 describes the design and implementation of lwIP, the TCP/IP stack in the small
client system. This does not go into details on the code level, such as which parameters are
used in function calls, but rather present data structures used, algorithms and mechanisms,
and the general flow of control.
Appendix B is an implementation of the BSD socket API using the lwIP API.
Appendix C shows some code examples of how to use the lwIP API for applications.
Of those, all but Chapter 2 presents own work conducted within the scope of this thesis.
Chapter 2
Background
The TCP/IP protocol stack consists of four layers, as seen in Figure 2.1. However, the layering
is not kept as strict as in other protocol stacks, and it is possible for, e.g., application layer
3
4 CHAPTER 2. BACKGROUND
functions to access the internetwork layer directly. The functionality provided by each layer is
(from bottom to top):
The network interface layer is responsible for transporting data over the physical (directly
connected) network. It takes care of low-level addressing and address mapping;
The internetwork layer provides abstract addressing and routing of datagrams between differ-
ent physical networks. It provides an unreliable datagram service to the upper layers;
The transport layer takes care of addressing processes at each host. UDP is used for pure
process addressed datagrams, whereas TCP provides a reliable stream transmission for the
application layer protocols;
The application layer utilizes the lower layers to provide functionality to the end user. Appli-
cations include email (SMTP), world wide web page transfer (HTTP), file transfer (FTP),
etc.
Each layer adds a protocol header to the data as shown in Figure 2.2. The figure shows
application data encapsulated in a TCP segment, which in turn is included in an IP packet. The
IP packet is then encapsulated in a link level frame. Each protocol layer has added a header that
keeps protocol specific information. The link layer has also added a trailer.
Link level header IP header TCP header Application data Link level trailer
IP options
The IP options are control information which are appended to the IP header. The IP options
may contain time stamps, or information for routers about which forwarding decisions to make.
In normal communication IP options are unnecessary but in special cases they can be useful. In
today’s Internet, packets carrying IP options are very rare [Pax97].
IP routing
The infrastructure of any internet, such as the global Internet, is built up by interconnected
routers. The routers are responsible for forwarding IP packets in the general direction of the final
recipient of the packet. Figure 2.3 shows an example internet with a number of hosts (boxes)
connected to a few routers (circles). If host H sends data to host I, it will send an IP packet
towards router R, which will inspect the destination address of the IP packet, and conclude that
router S (as opposed to router T ) is in the general direction of the final recipient, and will forward
2.1. THE TCP/IP PROTOCOL SUITE 5
the IP datagram to router S. Router S will find that the final recipient is directly connected, and
will forward the packet on the local network to host I.
The IP header contains a field called the time to live (TTL) field for IPv4 and HopLimit for
IPv6. Each time an IP packet is forwarded by a router this field is decremented and when it reaches
zero, the packet is dropped. This ensures that IP packets eventually will leave the network, and
is used to prevent packets circling forever.
In order to gather information about the topology of the network the routers communicate
with routing protocols. In the routing protocol messages, the routers report on the reachability
of networks and hosts and each router gathers knowledge of the general direction of hosts on the
network. In case of a network failure the router can find new working paths through the network
by using information excanged in the routing protocol.
1111111
0000000
R
1010 01
0000000
1111111
H
S
1
0
0
1
T
1 1
0 0
00000
11111
0 0
1 1
I
Congestion
IP routers work with the so called store-and-forward principle, where incoming IP packets are
buffered in an internal queue if they cannot be forwarded immediately. The available memory
for buffered packets is not unlimited, however, and any packets arriving when the buffer is full
are dropped. Most often, no notification to either the sender or the receiver of the packet is
given. When the queue in a router is full, and packets are being dropped, the router is said to be
congested.
Even though ICMP uses IP as its delivery mechanism, ICMP is considered an integral part of
IP and is often implemented as such.
UDP Lite
UDP Lite [LDP99] is an extension to UDP which allows the checksum to cover only a part of
the UDP datagram, most commonly the UDP header and any application level header directly
following it. This is useful for applications which send and receive data that is insensitive to
spurious bit errors, such as real time audio or video. Wireless links are prone to errors, and when
using UDP Lite, datagrams that otherwise would be discarded due to a failing UDP checksum
can still be used. UDP Lite utilizes the fact that the length field in the UDP header is redundant,
since the length of the datagram can be obtained from IP. Instead, the length field specifies how
much of the datagram is covered by the checksum. In a low end system, checksumming only parts
of the datagrams can also be a performance win.
TCP options
TCP options provide additional control information other than that in the TCP header. TCP
options reside between the TCP header and the data of a segment. Since the original TCP
specification [Pos81c] a number of additions to TCP has been defined as TCP options. This
includes the TCP selective acknowledgment SACK [MMFR96] and the TCP extensions for high
speed networks [JBB92] which define TCP time-stamps and window scaling options.
The only TCP option defined in the original TCP specification was the Maximum Segment
Size (MSS) option which specifies how large the largest TCP segment may be in a connection.
The MSS option is sent by both parties during the opening of a connection.
2.1. THE TCP/IP PROTOCOL SUITE 7
Each byte in the byte stream is assigned a sequence number starting at some arbitrary value.
The stream is partitioned into arbitrary sized segments. The TCP sender will try however, to fill
each segment with enough data so that the segment is as large as the maximum segment size of
the connection. This is shown in Figure 2.4 (refer to the paragraphs on opening and closing a
connection later in this section for a description of the SYN and FIN segments). Each segment
is prepended with a TCP header and transmitted in separate IP packets. In theory, for each
received segment the receiver produces an ACK. In practice however, most TCP implementations
send an ACK only on every other incoming segment in order to reduce ACK traffic. ACKs are also
piggybacked on outgoing TCP segment. The ACK contains the next sequence number expected
in the continuous stream of bytes. Thus, the ACKs do not acknowledge the reception of any
individual segment, but rather acknowledges the transmission of a continuous range of bytes.
Consider a TCP receiver that has received all bytes up to and including sequence number x,
as well as the bytes x + 20 to x + 40, with a gap between x + 1 and x + 19, as in the top figure of
Figure 2.5. The ACK will contain the sequence number x + 1, which is the next sequence number
expected in the continuous stream. When the segment containing bytes x + 1 to x + 19 arrives,
the next ACK will contain the sequence number x + 41. This is shown in the bottom figure of
Figure 2.5.
ACK
ACK
Figure 2.5. TCP byte stream with a gap and corresponding ACKs
The sending side of a TCP connection keeps track of all segments sent that have not yet
been ACKed by the receiver. If an ACK is not received within a certain time, the segment is
retransmitted. This process is referred to as a time-out and is depicted in Figure 2.6. Here we
see a TCP sender sending segments to a TCP receiver. Segment 3 is lost in the network and
the receiver will continue to reply with ACKs for the highest sequence number of the continuous
stream of bytes that ended with segment 2. Eventually, the sender will conclude that segment 3
was lost since no ACK has been received for this segment, and will retransmit segment 3. The
receiver has now received all bytes up to and including segment 5, and will thus reply with an
ACK for segment 5. (Even though TCP ACKs are not for individual segments it is sometimes
convenient to discuss ACKs as belonging to specific segments.)
8 CHAPTER 2. BACKGROUND
Segment 1
Segment 2 ACK for segment 1
Segment 3 ACK for segment 2
Segment 4 Segement 3 lost
Segment 5 ACK for segment 2
ACK for segment 2
Time−out for
segment 3
ACK for segment 5
Flow control
The flow control mechanism in TCP assures that the sender will not overwhelm the receiver with
data that the receiver is not ready to accept. Each outgoing TCP segment includes an indication
of the size of the available buffer space and the sender must not send more data than the receiver
can accommodate. The available buffer space for a connection is referred to as the window of
the connection. The window principle ensures proper operation even between two hosts with
drastically different memory resources.
The TCP sender tries to have one receiver window’s worth of data in the network at any
given time provided that the application wishes to send data at the appropriate rate (this is
not entirely true; see the next section on congestion control). It does this by keeping track of the
highest sequence number s ACKed by the receiver, and makes sure not to send data with sequence
number larger than s + r, where r is the size of the receiver’s window.
Returning to Figure 2.6, we see that the TCP sender stopped sending segments after segment
5 had been sent. If we assume that the receiver’s window was 1000 bytes in this case and that the
individual sizes of segments 3, 4 and 5 was exactly 1000 bytes, we can see that since the sender
had not received any ACK for segments 3, 4 and 5, the sender refrained from sending any more
segments. This is because the sequence number of segment 6 would in this case be equal to the
2.1. THE TCP/IP PROTOCOL SUITE 9
sum of the highest ACKed sequence number and the receiver’s window.
Congestion control
While flow control tries to avoid that buffer space will be overrun at the end points, the congestion
control mechanisms [Jac88, APS99] tries to prevent the overrun of router buffer space. In order
to achieve this TCP uses two separate methods:
• slow start, which probes the available bandwidth when starting to send over a connection,
and
• congestion avoidance, which constantly adapts the sending rate to the perceived bandwidth
of the path between the sender and the receiver.
The congestion control mechanism adds another constraint on the maximum number of out-
standing (unacknowledged) bytes in the network. It does this by adding another state variable
called the congestion window to the per-connection state. The minimum of the congestion win-
dow and the receiver’s window is used when determining the maximum number of unacknowledged
bytes in the network.
TCP uses packet drops as a sign of congestion. This is because TCP was designed for wired
networks where the main source of packet drops (> 99%) are due to buffer overruns in routers.
There are two ways for TCP to conclude that a packet was dropped, either by waiting for a
time-out, or to count the number of duplicate ACKs that are received. If two ACKs for the same
sequence number is received, this could mean that the packet was duplicated within the network
(not an unlikely event under certain conditions [Pax97]). It could also mean that segments were
reordered on their way to the receiver. However, if three duplicate ACKs are received for the
same sequence number, there is a good chance that this indicates a lost segment. Three duplicate
ACKs trigger a mechanism known as fast retransmit and the lost segment is retransmitted without
waiting for its time-out.
During slow start, the congestion window is increased with one maximum segment size per
received ACK, which leads to an exponential increase of the size of the congestion window1 . When
the congestion window reaches a threshold, known as the slow start threshold, the congestion
avoidance phase is entered.
When in the congestion avoidance phase, the congestion window is increased linearly until a
packet is dropped. The drop will cause the congestion window to be reset to one segment, the slow
start threshold is set to half of the current window, and slow start is initiated. If the drop was
indicated by three duplicate ACKs the fast recovery mechanism is triggered. The fast recovery
mechanism will halve the congestion window and keep TCP in the congestion avoidance phase,
instead of falling back to slow start.
Increasing the congestion window linearly is in fact harder than increasing the window expo-
nentially, since a linear increase requires an increase of one segment per round-trip time, rather
than one segment per received ACK. Instead of using the round-trip time estimate and using a
timer to increase the congestion window, many TCP implementations, including the BSD imple-
mentations, increase the congestion window by a fraction of a segment per received ACK.
TCP senders started with sending a whole receiver’s window worth of data.
10 CHAPTER 2. BACKGROUND
CLOSED
open
send: SYN
open
recv: SYN LISTEN send: SYN
send: SYN, ACK
recv: SYN
send: SYN, ACK
SYN−RCVD SYN−SENT
recv: ACK
recv: SYN, ACK
send: ACK
close
send: FIN ESTABLISHED CLOSE−WAIT
recv: FIN
send: ACK
close close
send: FIN send: FIN
recv: FIN
send: ACK CLOSING LAST−ACK
FIN−WAIT−1
Opening a connection
In order for a connection to be established, one of the participating sides must act as a server and
the other as a client. The server enters the LISTEN state and waits for an incoming connection
request from a client. The client, being in the CLOSED state, issues an open, which results in a
TCP segment with the SYN flag set to be sent to the server and the client enters the SYN-SENT
state. The server will enter the SYN-RCVD state and responds to the client with a TCP segment
with both the SYN and ACK flags set. As the client responds with an ACK both sides will be in
the ESTABLISHED state and can begin sending data.
This process is known as the three way handshake (Figure 2.8), and will not only have the
effect of setting both sides of the connection in the ESTABLISHED state, but also synchronizes
the sequence numbers for the connection.
SYN, seqno = x
LISTEN
SYN−SENT
SYN−RCVD
SYN, ACK, seqno = y , ackno = x + 1
ESTABLISHED
ACK, ackno = y + 1
ESTABLISHED
Figure 2.8. The TCP three way handshake with sequence numbers and state transitions
Both the SYN and FIN segments occupy one byte position in the byte stream (refer back to
Figure 2.4) and will therefore be reliably delivered to the other end point of the connection through
the use of the retransmission mechanism.
Closing a connection
The process of closing a connection is rather more complicated than the opening process since all
segments must be reliably delivered before the connection can be fully closed. Also, the TCP close
function will only close one end of the connection, meaning that both ends of the connection will
have to close before the connection is completely terminated.
When a connection end point issues a close on the connection, the connection state on the
closing side of the connection will traverse the FIN-WAIT-1 and FIN-WAIT-2 states, and op-
tionally passing the CLOSING state, after which it will end up in the TIME-WAIT state. The
connection is required to stay in the TIME-WAIT state for twice the maximum segment lifetime
(MSL) in order to account for duplicate copies of segments that might still be in the network (see
the discussion in Section 3.3.3 on page 20). The remote end goes from the ESTABLISHED state
to the CLOSE-WAIT state in which it stays until the connection is closed by both sides. When
the remote end issues a close, the connection passes the LAST-ACK state and the connection
will be removed at the remote end.
suite is the de facto reference implementation, and is the most well documented implementation
(see for example [WS95]).
Since the first release in 1984, the BSD TCP/IP implementation has evolved and many dif-
ferent versions have been released. The first release to incorporate the TCP congestion control
mechanisms described above was called TCP Tahoe. The Tahoe release still forms the basis of
many TCP/IP implementations found in modern operating systems. The Tahoe release did not
implement the fast retransmit and fast recovery algorithms, which were developed after the re-
lease. The BSD TCP/IP release which incorporated those algorithms, as well as many other
performance related optimizations, was called TCP Reno. TCP Reno has been improved with
better retransmission behavior and those TCP modifications are known as NewReno [FH99].
complete in the shortest possible amount of time. Moreover, when having communication protocols
reside in a processes, the protocol might have to compete with other processes for CPU resources
and the protocol might have to wait for a scheduling quantum before servicing a request.
Another key point in protocol processing is that the processing time of a packet depends
on factors such as CPU scheduling and interrupt handling. The conclusions drawn from this
observation is that the protocol should send large packets and that unneeded packets should be
avoided. Unneeded packets will in general require almost the same amount of processing as a
useful packet but does not do anything useful.
The design of a protocol stack implementation, where protocols are layered on top of each
other can be done in different ways. Depending on the way the implementation of the layering is
designed, the efficiency of the implementation varies. The key issue is the communication overhead
between the protocol layers.
In a small client system that is to operate in a wireless network, there are essentially four quantities
worth optimizing,
• power consumption,
• memory utilization.
Power consumption can be reduced by, e.g., tailoring the network protocols or engineering
of the physical network device and is not covered in this work. The efficiency of the code will,
however, effect the power consumption in that more efficient code will require less electrical CPU
power than less efficient code. Code efficiency requires careful engineering, especially in order to
reduce the amount of data copying. The size of the code can be reduced by careful design and
implementation of the TCP/IP stack in terms of both the protocol processing and the API. Since
a typical embedded system has more ROM than RAM available the most profitable optimization
is to reduce the RAM utilization in the client system. This can be done by letting the proxy do a
lot of the buffering that otherwise would have to be done by the client.
Most of the basic protocols in the TCP/IP suite, such as IP, ICMP, and UDP are fairly simple
by design and it is easy to make a small implementation of these protocols. Additionally, since
they are not designed to be reliable they do not require that end-hosts buffer data. TCP, on the
other hand, is more expensive both in terms of code size and memory consumption mostly due to
the reliability of TCP, which requires it to buffer and retransmit data that is lost in the network.
Wireless clients
Proxy
Wireless router
The Internet
The proxy is designed to operate in an environment as shown in Figure 3.1, where one side
of the proxy is connected to the Internet through a wired link, and the other side to a wireless
14
3.1. ARCHITECTURE 15
network with zero or more routers and possibly different wireless link technologies. The fact that
there may be routers in the wireless network means that all packet losses behind the proxy cannot
be assumed to stem from bit errors on the wireless links, since packets also can be dropped if the
routers are congested. Although routers may appear in the wireless network, the design of the
proxy does not depend on their existence, and the proxy may be used in an environment with
directly connected clients as well.
In an environment as in Figure 3.1 the wireless clients and the router, which are situated quite
near each other, can communicate using a short range wireless technology such as Bluetooth. The
router and the proxy communication can use a longer range and more power consuming technology,
such as IEEE 802.11b.
An example of this infrastructure is the Arena project [ARN] conducted at Luleå University
of Technology. In this project, ice hockey players of the local hockey team will be equipped
with sensors for measuring pulse rate, blood pressure, and breathing rate as well as a camera for
capturing full motion video. Both the sensors and the camera will carry an implementation of
the TCP/IP protocol suite, and information from the sensors and the camera will be transmitted
to receivers on the Internet. The sensors, which corresponds to the wireless clients in Figure 3.1,
communicates using Bluetooth technology with the camera, which is the wireless router. The
camera is connected with a gateway, which runs the proxy software, using wireless LAN IEEE
802.11b technology.
Apart from this very concrete example, other examples of this environment are easily imagined.
In an office environment, people at the office has equipment such as hand held computers, and at
each desk a wireless router enables them to use the hand held devices on the corporate network.
In an industrial environment, the machines might be equipped with sensors for measurement and
control. Each machine also has one sensor through which the others communicate. The sensors
might run some distributed control algorithm for controlling the machine, and the process can be
monitored from a remote location via a local IP network or over the Internet.
The proxy does not require any modifications to TCP in either the wireless clients or the fixed
hosts in the Internet. This is advantageous since any TCP/IP implementation may be used in the
wireless clients, and also simplifies communication between clients in the wireless network behind
the proxy.
3.1 Architecture
The proxy operates as an IP router in the sense that it forwards IP packets between the two
networks to which it is connected, but also captures TCP segments coming from and going to the
wireless clients. Those TCP segments are not necessarily directly forwarded to the wireless hosts,
but may be queued for later delivery if necessary. IP packets carrying protocols other than TCP
are forwarded immediately. The path of the packets can be seen in Figure 3.2. The proxy does
both per-packet processing and per-connection processing in order to offload the client system.
Per-connection processing pertains only to TCP connections.
Proxy
TCP
IP IP
long for all fragments. Rather, each IP packet which lacks one or more fragments is associated
with a lifetime, and if the missing fragments have not been received within the lifetime, the packet
is discarded. This means that if one or more fragments were lost on its way to the receiver, the
other fragments are kept in memory for their full lifetime in vain.
Since reassembly of IP fragments might use useful memory for doing useless work in the case
of lost fragments, this process can be moved to the proxy. Also, since the loss of a fragment of an
IP packet implies the loss of all fragments of the packet, IP fragmentation does not work well with
lossy links, such as wireless links. Therefore, by making the reassembly of fragmented IP packets
at the proxy the wireless links are better utilized.
The problem with reassembling, potentially large, IP packets at the proxy is that the reassem-
bled packet might be too large for the wireless links behind the proxy. No suitable solution to this
problem has been found, and finding a better solution has been been postponed to future work
(Section 5.4).
• acknowledging data sent by the client so that it will not need to wait for an entire round-trip
time (or more) for outstanding data to be acknowledged,
• reordering TCP segments so that the client need not buffer out-of-sequence data, and
Of these, the first is most useful in connections where the wireless client acts mostly as a
TCP sender, e.g., when the wireless client hosts an HTTP server. The second is most useful
when the client is the primary receiver of data, e.g., when downloading email to the client, and
the third when the client is the first end-point to close down connections, such as when doing
HTTP/1.0 [BLFF96] transfers.
3.3. PER-CONNECTION PROCESSING 17
For every active TCP connection being forwarded, the proxy has a Protocol Control Block
(PCB) entry which contains state of the connection. This includes variables such as the IP
addresses and port numbers of the endpoints, the TCP sequence numbers, etc. The PCB entries
themselves are soft-state entities in that each PCB entry has an associated lifetime, which is
updated every time a TCP segment belonging to the connection arrives. If no segments arrive
within the lifetime, the PCB will be completely removed. This ensures that PCBs for inactive
connections and connections that have terminated because of end host reboots will not linger in
the proxy indefinitely. The lifetime is depends on the state of the connection; if the proxy holds
cached data for the connection, the lifetime is prolonged.
When a TCP segment arrives, the proxy tries to find a PCB with the exact same IP addresses
and port numbers as the TCP segment. This is similar to the process of finding a PCB match in
an end host, but differs in the way that both connection endpoints are completely specified, i.e.,
there are no wild-card entries in the list of PCBs. A new PCB is created if no match is found.
If a PCB match is found, the proxy will process the TCP segment as described in the following
sections.
remote host. It does know, however, that the data has been successfully received by someone.
18 CHAPTER 3. THE PROXY BASED ARCHITECTURE
ACK, Data
ACK, Data
ACK
ACK
FIN
FIN
FIN, ACK
FIN, ACK
time-out. This will lead to faster retransmission of segments that are lost due to bit errors over
the wireless links behind the proxy and higher overall throughput.
Congestion control
Since the proxy is responsible for the retransmission of prematurely acknowledged segments, the
wireless client is unaware of any congestion in the wired internet and is therefore unable to respond
to it. One approach to solve this problem would be to let the proxy sense the congestion, and
use Explicit Congestion Notification [RF99] (ECN) to inform the client of the congestion. The
client would then reduce its sending rate appropriately. The disadvantage of this approach is that
the client is forced to buffer data that the proxy could have assumed responsibility for. Also, it
contradicts the idea of having the data moved to the proxy as fast as possible.
Instead, the proxy assumes responsibility of the congestion control over the wired Internet.
Since the proxy has the responsibility for retransmitting segments that it has acknowledged, the
same congestion control mechanisms that are used in ordinary TCP can be used by the proxy.
When the congestion window at the proxy does not allow the proxy to send more segments
any segments coming from the client are acknowledged, and the advertised receiver’s window is
artificially closed. To the wireless client this seems as if the application at the remote host does
not read data at the same rate that the wireless client is sending. When doing this, the congestion
control problem of the wired links is mirrored as a flow control problem in the wireless network
behind the proxy.
forwards all previously queued out-of-order segments to the client, while trying not to congest any
wireless routers. If the proxy is installed to operate in an environment without wireless routers,
the congestion control features can be switched off.
Using this mechanism, the clients are likely to receive all TCP segments in order. This will
not only relieve burden of the memory, but also work well with Van Jacobson’s header prediction
optimization [Jac90], which makes processing of in-order segments more efficient than processing
of out-of-order segments.
Since the client is receive most of its segments in order, it can refrain from buffering out-of-
order segments. If an out-of-order segment do arrive at the client, it will produce an immediate
ACK. This duplicate ACK will be able to trigger a fast retransmit from the proxy.
TCP
sequence
numbers
TCP
sequence
numbers
the proxy started with sending one segment and has now doubled its congestion window, there-
fore sending twice as many segments. Notice that even if the proxy has buffered the out-of-order
segments, they have not yet been acknowledged to the sender, and therefore still are buffered in
the sender.
A B
Connection c
FIN
FIN, ACK
ACK
SYN Connection c’
SYN, ACK
ACK
clients often open many simultaneous connections to the server, the memory consumed by the
TIME-WAIT connections can be a significant amount. Also, since every TIME-WAIT connection
occupy a PCB, the time for finding a PCB match when demultiplexing incoming packets will
increase with the number of TIME-WAIT connections.
The naive approach to solving the TIME-WAIT problem is to shorten the time a connection is
in TIME-WAIT. While this reduces memory costs, it can be dangerous due to reasons described
above. Other approaches include keeping TIME-WAIT connections in a smaller data structure
than other connections, to modify TCP so that the client keeps the connection in TIME-WAIT
instead of the server [FTY99], or to modify HTTP so that the client does an active close before
the server [FTY99].
While the above approaches are promising in a quite specialized case, none of them are directly
applicable here. Keeping TIME-WAIT connections in a smaller data structure will still involve
using valuable memory. Modifying TCP contradicts with the purpose of this work in that it
produces a solution that do not match the standards, and more importantly requires changing
TCP in every Internet host. Since a general solution is sought, modifying HTTP is not a plausible
solution either.
The approach taken in this work is to let the proxy handle connections in TIME-WAIT on
behalf of the wireless hosts. Here, the wireless hosts can remove the PCB and reclaim all memory
associated with the connection when entering TIME-WAIT. The relative cost of keeping a TIME-
WAIT connection in the proxy is very small compared to the cost of keeping it in the wireless
host.
When the proxy sees that the wireless client has entered the TIME-WAIT state, it sends
an RST to the client, which kills the connection in the client2 . The proxy then refrains from
forwarding any TCP segments in that particular connection to the client.
connection. Also, since packets may be lost on their way from the proxy to the wireless host there
are some uncertainties with what state transitions that are actually made in the wireless host.
For example, consider a connection running over the proxy in which the wireless host has closed
the connection and is in FIN-WAIT-1, and the other host is in CLOSE-WAIT. When the wireless
client receives a FINACK segment acknowledging the FIN it sent, it should enter the TIME-WAIT
state (see Figure 2.7). Even if the proxy has seen the FIN segment, we cannot be sure that the
wireless host has entered TIME-WAIT until we know that the FIN has been successfully received.
Thus we cannot conclude that the wireless host is in TIME-WAIT until an acknowledgment for
the FIN has arrived at the proxy.
The state diagram describing the state transitions in the proxy is seen in Figure 3.6. The
abbreviation c stands for “the wireless client” and the abbreviation h stands for “the remote
host”. The remote host is a host on the Internet. The notation SYN + 1 means “the next data
byte in the sequence after the SYN segment”.
This state diagram is similar to the TCP state diagram in Figure 2.7, but with more states.
Notice that there is no LISTEN state in Figure 3.6. This is because there is no way for the proxy
to know that a connection has gone into LISTEN at the wireless host since no segments are sent
when doing the transition from CLOSED to LISTEN.
Explanations for the states are as follows.
SYN-RCVD-1 The remote host has sent a SYN, but the wireless client has not responded.
SYN-RCVD-2 The wireless client has responded with a SYNACK to a SYN from the remote
host.
SYN-RCVD-3 An ACK has been sent by the remote host for the SYNACK sent by the wireless
client. It is uncertain whether the wireless client has entered ESTABLISHED or not.
SYN-SENT-2 The remote host has sent a SYNACK in response to the SYN, but it is uncertain
whether the wireless client has entered ESTABLISHED or not.
ESTABLISHED The wireless client is known to have entered the ESTABLISHED state.
FIN-WAIT-1 The wireless client has sent a FIN and is thus in FIN-WAIT-1.
FIN-WAIT-2 The remote host has acknowledged the FIN, but we do not know if the wireless
client is in FIN-WAIT-1 or FIN-WAIT-2.
FIN-WAIT-3 The remote host has sent a FIN, but we do not know if the wireless client is in
FIN-WAIT-2 or TIME-WAIT.
CLOSING-1 The remote host has sent a FIN but it is uncertain whether the wireless client is
in FIN-WAIT-1 or in CLOSING.
CLOSING-2 The wireless client has acknowledged the FIN, and is in CLOSING.
Since the proxy does not prematurely acknowledge the SYN or FIN segments, the proxy will
acknowledge segments from the wireless client only in the states SYN-RCVD-3, ESTABLISHED
and CLOSE-WAIT. Segments from the remote host will be acknowledged in the states ESTAB-
LISHED, FIN-WAIT-1 and FIN-WAIT-2.
3.3. PER-CONNECTION PROCESSING 23
SYN-RCVD-1 SYN-SENT-1
SYN-RCVD-2 SYN-SENT-2
from h: ACK
SYN-RCVD-3
from c: SYN + 1 from c: ACK
from c: ACK for SYN + 1
from h: FIN
from c: FIN
FIN-WAIT-1 CLOSING-1
CLOSING-2
from h: FIN
• The end to end semantics are totally gone. In a split connection approach the proxy termi-
nates the connection and thus acknowledges both the SYN and the FIN segments as well as
all data segment.
• The asymmetry of the approach described here could not be exploited as easily. With the
proxy scheme described here, out of sequence segments going from the wireless client to the
remote Internet host are not queued but rather forwarded immediately. A split connection
scheme would queue out of sequence segments from the wireless client until an in-sequence
segment arrived.
• Soft state cannot be used for the connections in the proxy. A split connection approach
needs to have state even for inactive connections.
3.5 Reliability
By buffering and acknowledging segments at the proxy, the clients are led to believe that the data
has been successfully delivered, even though that may not be the case. If a PCB with cached TCP
data times out, the data will be discarded. Since this data is not kept in any of the end hosts,
discarding it would kill the connection. With a sufficiently long life time, however, a time out of
a PCB with cached data means that either one of or both the end hosts has powered off, or a
permanent network failure has occurred. In either case the connection has timed out at the end
hosts, and cannot be used any longer.
Due to the fact that data is acknowledged at the proxy, a crash of the proxy would have
a severe effect on all active TCP connections over the proxy, since any buffered data would be
lost. It would not be a viable solution to save the cached data to stable storage, such as a hard
disk, due to the enormous overhead involved. Instead, the proxy should be configured to refrain
from caching data for those connections that require higher reliability. Such connections could be
identified by the port numbers or IP addresses of the end-points. The filter could be implemented
by adding a filter PCB in the PCB list, which would carry a flag indicating that the connection
should not be processed by the proxy. Since the proxy already searches the PCB list for each
incoming TCP segment, this solution would not add any complexity to the proxy. This solution is
not general enough to be universally applicable however, since it can in some cases be hard to know
in advance what port numbers that will be used for a connection (FTP data being an example of
this). Finding a better solution for the reliability problem has been postponed as future work.
scope of this thesis (Section 4.8), much of the same code has been used in the implementation of
the proxy.
The proxy has been implemented as a user processes running under FreeBSD 4.1. Implementing
the proxy in user space rather than in the operating system kernel has numerous advantages:
• Deployment of the proxy is much easier since it does not involve rebuilding the kernel of the
machine on which the proxy will run.
• Failure of the proxy due to bugs in the code will not compromise the entire system.
One disadvantage is that packets have to be copied multiple times; from kernel space to user
space, and back again to kernel space, after having been processed by the proxy. Also, context has
to be switched between the kernel and the proxy twice per packet. This substantially increases
the delay of packets going through the proxy.
Kernel
IP
/dev/tun0 /dev/tun1
Proxy
Packets destined to the wireless network are forwarded to the tunnel interface tun1, and packets
to the Internet, from the wireless network, are forwarded to tun0. When using tunnel interfaces
to capture and send packets the network sniffer program tcpdump can easily be used to inspect
traffic through the proxy.
Chapter 4
The protocols in the TCP/IP suite are designed in a layered fashion, where each protocol layer
solves a separate part of the communication problem. This layering can serve as a guide for
designing the implementation of the protocols, in that each protocol can be implemented separately
from the other. Implementing the protocols in a strictly layered way can however, lead to a
situation where the communication overhead between the protocol layers degrades the overall
performance [Cla82a]. To overcome these problems, certain internal aspects of a protocol can be
made known to other protocols. Care must be taken so that only the important information is
shared among the layers.
Most TCP/IP implementations keep a strict division between the application layer and the
lower protocol layers, whereas the lower layers can be more or less interleaved. In most operating
systems, the lower layer protocols are implemented as a part of the operating system kernel with
entry points for communication with the application layer process. The application program is
presented with an abstract view of the TCP/IP implementation, where network communication
differs only very little from inter-process communication or file I/O. The implications of this is
that since the application program is unaware of the buffer mechanisms used by the lower layers,
it cannot utilize this information to, e.g., reuse buffers with frequently used data. Also, when the
application sends data, this data has to be copied from the application process’ memory space
into internal buffers before being processed by the network code.
The operating systems used in minimal systems such as the target system of lwIP most often
do not maintain a strict protection barrier between the kernel and the application processes. This
allows using a more relaxed scheme for communication between the application and the lower
layer protocols by the means of shared memory. In particular, the application layer can be made
aware of the buffer handling mechanisms used by the lower layers. Therefore, the application can
more efficiently reuse buffers. Also, since the application process can use the same memory as the
networking code the application can read and write directly to the internal buffers, thus saving
the expense of performing a copy.
4.1 Overview
As in many other TCP/IP implementations, the layered protocol design has served as a guide
for the design of the implementation of lwIP. Each protocol is implemented as its own module,
with a few functions acting as entry points into each protocol. Even though the protocols are
implemented separately, some layer violations are made, as discussed above, in order to improve
performance both in terms of processing speed and memory usage. For example, when verifying
the checksum of an incoming TCP segment and when demultiplexing a segment, the source and
destination IP addresses of the segment has to be known by the TCP module. Instead of passing
26
4.2. PROCESS MODEL 27
these addresses to TCP by the means of a function call, the TCP module is aware of the structure
of the IP header, and can therefore extract this information by itself.
lwIP consists of several modules. Apart from the modules implementing the TCP/IP proto-
cols (IP, ICMP, UDP, and TCP) a number of support modules are implemented. The support
modules consists of the operating system emulation layer (described in Section 4.3), the buffer
and memory management subsystems (described in Section 4.4), network interface functions (de-
scribed in Section 4.5), and functions for computing the Internet checksum. lwIP also includes
an abstract API, which is described in Section 4.10.
The only process synchronization mechanism provided is semaphores. Even if semaphores are
not avaliable in the underlying operating system they can be emulated by other synchronization
primitives such as conditional variables or locks.
The message passing is done through a simple mechanism which uses an abstraction called
mailboxes. A mailbox has two operations: post and fetch. The post operation will not block the
process; rather, messages posted to a mailbox are queued by the operating system emulation layer
until another process fetches them. Even if the underlying operating system does not have native
support for the mailbox mechanism, they are easily implemented using semaphores.
next
payload
len
tot_len
flags ref
Figure 4.1. A PBUF RAM pbuf with data in memory managed by the pbuf subsystem.
Pbufs are of three types, PBUF RAM, PBUF ROM, and PBUF POOL. The pbuf shown in
Figure 4.1 represents the PBUF RAM type, and has the packet data stored in memory managed
by the pbuf subsystem. The pbuf in Figure 4.2 is an example of a chained pbuf, where the first
pbuf in the chain is of the PBUF RAM type, and the second is of the PBUF ROM type, which
means that it has the data located in memory not managed by the pbuf system. The third type of
pbuf, PBUF POOL, is shown in Figure 4.3 and consists of fixed size pbufs allocated from a pool
of fixed size pbufs. A pbuf chain may consist of multiple types of pbufs.
The three types have different uses. Pbufs of type PBUF POOL are mainly used by network
device drivers since the operation of allocating a single pbuf is fast and is therefore suitable for
4.4. BUFFER AND MEMORY MANAGEMENT 29
next next
payload payload
len len
tot_len tot_len
flags ref flags ref
Figure 4.2. A PBUF RAM pbuf chained with a PBUF ROM pbuf that has data in external
memory.
Figure 4.3. Chained PBUF POOL pbufs from the pbuf pool.
use in an interrupt handler. PBUF ROM pbufs are used when an application sends data that is
located in memory managed by the application. This data may not be modified after the pbuf
has been handed over to the TCP/IP stack and therefore this pbuf type main use is when the
data is located in ROM (hence the name PBUF ROM). Headers that are prepended to the data
in a PBUF ROM pbuf are stored in a PBUF RAM pbuf that is chained to the front of of the
PBUF ROM pbuf, as in Figure 4.2.
Pbufs of the PBUF RAM type are also used when an application sends data that is dynamically
generated. In this case, the pbuf system allocates memory not only for the application data, but
also for the headers that will be prepended to the data. This is seen in Figure 4.1. The pbuf
system cannot know in advance what headers will be prepended to the data and assumes the
worst case. The size of the headers is configurable at compile time.
In essence, incoming pbufs are of type PBUF POOL and outgoing pbufs are of the PBUF ROM
or PBUF RAM types.
The internal structure of a pbuf can be seen in the Figures 4.1 through 4.3. The pbuf structure
consists of two pointers, two length fields, a flags field, and a reference count. The next field is a
pointer to the next pbuf in case of a pbuf chain. The payload pointer points to the start of the
data in the pbuf. The len field contains the length of the data contents of the pbuf. The tot len
field contains the sum of the length of the current pbuf and all len fields of following pbufs in
the pbuf chain. In other words, the tot len field is the sum of the len field and the value of the
tot len field in the following pbuf in the pbuf chain. The flags field indicates the type of the
pbuf and the ref field contains a reference count. The next and payload fields are native pointers
and the size of those varies depending on the processor architecture used. The two length fields
30 CHAPTER 4. DESIGN AND IMPLEMENTATION OF THE TCP/IP STACK
are 16 bit unsigned integers and the flags and ref fields are 4 bit wide. The total size of the
pbuf structure depends on the size of a pointer in the processor architecture being used and on the
smallest alignment possible for the processor architecture. On an architecture with 32 bit pointers
and 4 byte alignment, the total size is 16 bytes and on an architecture with 16 bit pointers and 1
byte alignment, the size is 9 bytes.
The pbuf module provides functions for manipulation of pbufs. Allocation of a pbuf is done
by the function pbuf_alloc() which can allocate pbufs of any of the three types described above.
The function pbuf_ref() increases the reference count. Deallocation is made by the function
pbuf_free(), which first decreases the reference count of the pbuf. If the reference count reaches
zero the pbuf is deallocated. The function pbuf_realloc() shrinks the pbuf so that it occupies
just enough memory to cover the size of the data. The function pbuf_header() adjusts the
payload pointer and the length fields so that a header can be prepended to the data in the pbuf.
The functions pbuf_chain() and pbuf_dechain() are used for chaining pbufs.
next
prev
used = 1
next
prev
used = 0
next
prev
used = 1
Memory is allocated by searching the memory for an unused allocation block that is large
enough for the requested allocation. The first-fit principle is used so that the first block that is
large enough is used. When an allocation block is deallocated, the used flag is set to zero. In order
to prevent fragmentation, the used flag of the next and previous allocation blocks are checked. If
any of them are unused, the blocks are combined into one larger unused block.
network interfaces are kept on a global linked list, which is linked by the next pointer in the
structure.
struct netif {
struct netif *next;
char name[2];
int num;
struct ip_addr ip_addr;
struct ip_addr netmask;
struct ip_addr gw;
void (* input)(struct pbuf *p, struct netif *inp);
int (* output)(struct netif *netif, struct pbuf *p,
struct ip_addr *ipaddr);
void *state;
};
Each network interface has a name, stored in the name field in Figure 4.5. This two letter
name identifies the kind of device driver used for the network interface and is only used when the
interface is configured by a human operator at runtime. The name is set by the device driver and
should reflect the kind of hardware that is represented by the network interface. For example, a
network interface for a Bluetooth driver might have the name bt and a network interface for IEEE
802.11b WLAN hardware could have the name wl. Since the names not necessarily are unique,
the num field is used to distinguish different network interfaces of the same kind.
The three IP addresses ip addr, netmask and gw are used by the IP layer when sending and
receiving packets, and their use is described in the next section. It is not possible to configure a
network interface with more than one IP address. Rather, one network interface would have to be
created for each IP address.
The input pointer points to the function the device driver should call when a packet has been
received.
A network interface is connected to a device driver through the output pointer. This pointer
points to a function in the device driver that transmits a packet on the physical network and it is
called by the IP layer when a packet is to be sent. This field is filled by the initialization function
of the device driver. The third argument to the output function, ipaddr, is the IP address of
the host that should receive the actual link layer frame. It does not have to be the same as the
destination address of the IP packet. In particular, when sending an IP packet to a host that is
not on the local network, the link level frame will be sent to a router on the network. In this case,
the IP address given to the output function will be the IP address of the router.
Finally, the state pointer points to device driver specific state for the network interface and
is set by the device driver.
4.6 IP processing
lwIP implements only the most basic functionality of IP. It can send, receive and forward packets,
but cannot send or receive fragmented IP packets nor handle packets with IP options. For most
applications this does not pose any problems.
done, as well as computing and checking the header checksum. It is expected that the stack will
not receive any IP fragments since the proxy described in Chapter 3 is assumed to reassemble any
fragmented packets, thus any packet that is an IP fragment is silently discarded. Packets carrying
IP options are also assumed to be handled by the proxy, and are dropped.
Next, the function checks the destination address with the IP addresses of the network interfaces
to determine if the packet was destined for the host. The network interfaces are ordered in a linked
list, and it is searched linearly. The number of network interfaces is expected to be small so a
more sophisticated search strategy than a linear search has not been implemented.
If the incoming packet is found to be destined for this host, the protocol field is used to decide
to which higher level protocol the packet should be passed to.
Transport layer
icmp_dest_unreach() icmp_input()
Internetwork layer
ip_output() ip_input()
netif−>output()
Network interface layer
Using ICMP ECHO messages to probe a network is widely used, and therefore ICMP echo
processing is optimized for performance. The actual processing takes place in icmp input(), and
consists of swapping the IP destination and source addresses of the incoming packet, change the
ICMP type to echo reply and adjust the ICMP checksum. The packet is then passed back to the
IP layer for transmission.
struct udp_pcb {
struct udp_pcb *next;
struct ip_addr local_ip, dest_ip;
u16_t local_port, dest_port;
u8_t flags;
u16_t chksum_len;
void (* recv)(void *arg, struct udp_pcb *pcb, struct pbuf *p);
void *recv_arg;
};
The UDP PCB structure contains a pointer to the next PCB in the global linked list of UDP
PCBs. A UDP session is defined by the IP addresses and port numbers of the end-points and
these are stored in the local ip, dest ip, local port and dest port fields. The flags field
indicates what UDP checksum policy that should be used for this session. This can be either to
switch UDP checksumming off completely, or to use UDP Lite [LDP99] in which the checksum
34 CHAPTER 4. DESIGN AND IMPLEMENTATION OF THE TCP/IP STACK
covers only parts of the datagram. If UDP Lite is used, the chksum len field specifies how much
of the datagram that should be checksummed.
The last two arguments, recv and recv arg, are used when a datagram is received in the
session specified by the PCB. The function pointed to by recv is called when a datagram is
received.
Due to the simplicity of UDP, the input and output processing is equally simple and follows
a fairly straight line (Figure 4.8). To send data, the application program calls udp send() which
calls upon udp output(). Here the the necessary checksumming is done and UDP header fields are
filled. Since the checksum includes the IP source address of the IP packet, the function ip route()
is in some cases called to find the network interface to which the packet is to be transmitted. The
outgoing network interface could be cached in the PCB, but this is currently not done. The IP
address of this network interface is used as the source IP address of the packet. Finally, the packet
is turned over to ip output if() for transmission.
udp_send()
udp_input() Transport layer
udp_output()
netif−>output()
Network interface layer
When a UDP datagram arrives, the IP layer calls the udp input() function. Here, if check-
summing should be used in the session, the UDP checksum is checked and the datagram is demul-
tiplexed. When the corresponding UDP PCB is found, the recv function is called.
4.8.1 Overview
The basic TCP processing (Figure 4.9) is divided into six functions; the functions tcp input(),
tcp process(), and tcp receive() which are related to TCP input processing, and tcp write(),
tcp enqueue(), and tcp output() which deals with output processing.
When an application wants to send TCP data, tcp write() is called. The function tcp write()
passes control to tcp enqueue() which will break the data into appropriate sized TCP segments
if necessary and put the segments on the transmission queue for the connection. The function
tcp output() will then check if it is possible to send the data, i.e., if there is enough space in the
receiver’s window and if the congestion window is large enough and if so, sends the data using
ip route() and ip output if().
Input processing begins when ip input() after verifying the IP header hands over a TCP
segment to tcp input(). In this function the initial sanity checks (i.e., checksumming and TCP
4.8. TCP PROCESSING 35
tcp_write() tcp_receive()
tcp_output() tcp_input()
netif->output()
Network interface layer
options parsing) are done as well as deciding to which TCP connection the segment belongs. The
segment is then processed by tcp process(), which implements the TCP state machine, and any
necessary state transitions are made. The function tcp receive() will be called if the connection
is in a state to accept data from the network. If so, tcp receive() will pass the segment up to
an application program. If the segment constitutes an ACK for unacknowledged (thus previously
buffered) data, the data is removed from the buffers and its memory is reclaimed. Also, if an ACK
for data was received the receiver might be willing to accept more data and therefore tcp output()
is called.
struct tcp_pcb {
struct tcp_pcb *next;
enum tcp_state state; /* TCP state */
void (* accept)(void *arg, struct tcp_pcb *newpcb);
void *accept_arg;
struct ip_addr local_ip;
u16_t local_port;
struct ip_addr dest_ip;
u16_t dest_port;
u32_t rcv_nxt, rcv_wnd; /* receiver variables */
u16_t tmr;
u32_t mss; /* maximum segment size */
u8_t flags;
u16_t rttest; /* rtt estimation */
u32_t rtseq; /* sequence no for rtt estimation */
s32_t sa, sv; /* rtt average and variance */
u32_t rto; /* retransmission time-out */
u32_t lastack; /* last ACK received */
u8_t dupacks; /* number of duplicate ACKs */
u32_t cwnd, u32_t ssthresh; /* congestion control variables */
u32_t snd_ack, snd_nxt, /* sender variables */
snd_wnd, snd_wl1, snd_wl2, snd_lbb;
void (* recv)(void *arg, struct tcp_pcb *pcb, struct pbuf *p);
void *recv_arg;
struct tcp_seg *unsent, *unacked, /* queues */
*ooseq;
};
The fields rttest, rtseq, sa, and sv are used for the round-trip time estimation. The sequence
number of the segment that is used for estimating the round-trip time is stored in rtseq and the
time this segment was sent is stored in rttest. The average round-trip time and the round-trip
time variance is stored in sa and sv. These variables are used when calculating the retransmission
time-out which is stored in the rto field.
The two fields lastack and dupacks are used in the implementation of fast retransmit and
fast recovery. The lastack field contains the sequence number acknowledged by the last ACK
received and dupacks contains a count of how many ACKs that has been received for the sequence
number in lastack. The current congestion window for the connection is stored in the cwnd field
and the slow start threshold is kept in ssthresh.
The six fields snd ack, snd nxt, snd wnd, snd wl1, snd wl2 and snd lbb are used when sending
data. The highest sequence number acknowledged by the receiver is stored in snd ack and the
next sequence number to send is kept in snd nxt. The receiver’s advertised window is held in
snd wnd and the two fields snd wl1 and snd wl2 are used when updating snd wnd. The snd lbb
field contains the sequence number of the last byte queued for transmission.
The function pointer recv and recv arg are used when passing received data to the application
layer. The three queues unsent, unacked and ooseq are used when sending and receiving data.
Data that has been received from the application but has not been sent is queued in unsent and
data that has been sent but not yet acknowledged by the remote host is held in unacked. Received
data that is out of sequence is buffered in ooseq.
The tcp seg structure in Figure 4.11 is the internal representation of a TCP segment. This
structure starts with a next pointer which is used for linking when queuing segments. The len
4.8. TCP PROCESSING 37
struct tcp_seg {
struct tcp_seg *next;
u16_t len;
struct pbuf *p;
struct tcp_hdr *tcphdr;
void *data;
u16_t rtime;
};
field contains the length of the segment in TCP terms. This means that the len field for a data
segment will contain the length of the data in the segment, and the len field for an empty segment
with the SYN or FIN flags set will be 1. The pbuf p is the buffer containing the actual segment and
the tcphdr and data pointers points to the TCP header and the data in the segment, respectively.
For outgoing segments, the rtime field is used for the retransmission time-out of this segment.
Since incoming segments will not need to be retransmitted, this field is not needed and memory
for this field is not allocated for incoming segments.
When a segment is on the unacked list, it is also timed for retransmission as described in
Section 4.8.8. When a segment is retransmitted the TCP and IP headers of the original segment
is kept and only very little changes has to be made to the TCP header. The ackno and wnd fields
of the TCP header are set to the current values since we could have received data during the time
between the original transmission of the segment and the retransmission. This changes only two
16-bit words in the header and the whole TCP checksum does not have to be recomputed since
simple arithmetic [Rij94] can be used to update the checksum. The IP layer has already added
the IP header when the segment was originally transmitted and there is no reason to change it.
Thus a retransmission does not require any recomputation of the IP header checksum.
The Silly Window Syndrome [Cla82b] (SWS) is a TCP phenomena that can lead to very bad
performance. SWS occurs when a TCP receiver advertises a small window and the TCP sender
immediately sends data to fill the window. When this small segment is acknowledged the window
is opened again by a small amount and sender will again send a small segment to fill the window.
This leads to a situation where the TCP stream consists of very small segments. In order to avoid
SWS both the sender and the receiver must try to avoid this situation. The receiver must not
advertise small window updates and the sender must not send small segments when only a small
window is offered.
In lwIP SWS is naturally avoided at the sender since TCP segments are constructed and
queued without knowledge of the advertised receiver’s window. In a large transfer the output
queue will consist of maximum sized segments. This means that if a TCP receiver advertises a
small window, the sender will not send the first segment on the queue since it is larger than the
advertised window. Instead, it will wait until the window is large enough for a maximum sized
segment.
When acting as a TCP receiver, lwIP will not advertise a receiver’s window that is smaller
than the maximum segment size of the connection.
When TCP segments arrive at the tcp input() function, they are demultiplexed between the
TCP PCBs. The demultiplexing key is the source and destination IP addresses and the TCP
port numbers. There are two types of PCBs that must be distinguished when demultiplexing a
segment; those that correspond to open connections and those that correspond to connections that
are half open. Half open connections are those that are in the LISTEN state and only have the
local TCP port number specified and optionally the local IP address, whereas open connections
have the both IP addresses and both port numbers specified.
Many TCP implementations, such as the early BSD implementations, use a technique where a
linked list of PCBs with a single entry cache is used. The rationale behind this is that most TCP
connections constitute bulk transfers which typically show a large amount of locality [Mog92],
resulting in a high cache hit ratio. Other caching schemes include keeping two one entry caches,
one for the PCB corresponding to the last packet that was sent and one for the PCB of the last
packet received [PP93]. An alternative scheme to exploit locality can be done by moving the most
recently used PCB to the front of the list. Both methods have been shown [MD92] to outperform
the one entry cache scheme.
In lwIP, whenever a PCB match is found when demultiplexing a segment, the PCB is moved
to the front of the list of PCBs. PCBs for connections in the LISTEN state are not moved to the
front however, since such connections are not expected to receive segments as often as connections
that are in a state in which they receive data.
4.8. TCP PROCESSING 39
Receiving data
The actual processing of incoming segments is made in the function tcp receive(). The ac-
knowledgment number of the segment is compared with the segments on the unacked queue of
the connection. If the acknowledgment number is higher than the sequence number of a segment
on the unacked queue, that segment is removed from the queue and the allocated memory for the
segment is deallocated.
An incoming segment is out of sequence if the sequence number of the segment is higher than
the rcv nxt variable in the PCB. Out of sequence segments are queued on the ooseq queue in
the PCB. If the sequence number of the incoming segment is equal to rcv nxt, the segment is
delivered to the upper layer by calling the recv function in the PCB and rcv nxt is increased by
the length of the incoming segment. Since the reception of an in-sequence segment might mean
that a previously received out of sequence segment now is the next segment expected, the ooseq
queued is checked. If it contains a segment with sequence number equal to rcv nxt, this segment
is delivered to the application by a call to to recv function and rcv nxt is updated. This process
continues until either the ooseq queue is empty or the next segment on ooseq is out of sequence.
If a supporting proxy is used, the proxy mechanism for ordering TCP segments described in
Section 3.3.2 will lessen the need for the client to buffer out of sequence segments. Therefore,
lwIP may be configured to refrain from buffering such segments.
4.8.8 Timers
As in the the BSD TCP implementation, lwIP uses two periodical timers that goes off every 200
ms and 500 ms. Those two timers are then used to implement more complex logical timers such
as the retransmission timers, the TIME-WAIT timer and the delayed ACK timer.
The fine grained timer, tcp timer fine() goes through every TCP PCB checking if there are
any delayed ACKs that should be sent, as indicated by the flag field in the tcp pcb structure
(Figure 4.10). If the delayed ACK flag is set, an empty TCP acknowledgment segment is sent and
the flag is cleared.
The coarse grained timer, implemented in tcp timer coarse(), also scans the PCB list. For
every PCB, the list of unacknowledged segments (the unacked pointer in the tcp seg structure
40 CHAPTER 4. DESIGN AND IMPLEMENTATION OF THE TCP/IP STACK
in Figure 4.11), is traversed, and the rtime variable is increased. If rtime becomes larger than
the current retransmission time-out as given by the rto variable in the PCB, the segment is
retransmitted and the retransmission time-out is doubled. A segment is retransmitted only if
allowed by the values of the congestion window and the advertised receiver’s window. After
retransmission, the congestion window is set to one maximum segment size, the slow start threshold
is set to half of the effective window size, and slow start is initiated on the connection.
For connections that are in TIME-WAIT, the coarse grained timer also increases the tmr field
in the PCB structure. When this timer reaches the 2×M SL threshold, the connection is removed.
The coarse grained timer also increases a global TCP clock, tcp ticks. This clock is used for
round-trip time estimation and retransmission time-outs.
parts, one part dealing with the communication and one part dealing with the computation. The
part doing the communication would then reside in the TCP/IP process and the computationally
heavy part would be a separate process. The lwIP API presented in the next section provides a
structured way to divide the application in such a way.
using the interprocess communication (IPC) mechanisms provided by the operating system emu-
lation layer. The current implementation uses the following three IPC mechanisms:
• shared memory,
• semaphores.
While these IPC types are supported by the operating system layer, they need not be directly
supported by the underlying operating system. For operating systems that do not natively support
them, the operating system emulation layer emulates them.
IPC
API API
The general design principle used is to let as much work as possible be done within the ap-
plication process rather than in the TCP/IP process. This is important since all processes use
the TCP/IP process for their TCP/IP communication. Keeping down the code footprint of the
part of the API that is linked with the applications is not as important. This code can be shared
among the processes, and even if shared libraries are not supported by the operating system, the
code is stored in ROM. Embedded systems usually carry fairly large amounts of ROM, whereas
processing power is scarce.
The buffer management is located in the library part of the API implementation. Buffers are
created, copied and deallocated in the application process. Shared memory is used to pass the
buffers between the application process and the TCP/IP process. The buffer data type used in
communication with the application program is an abstraction of the pbuf data type.
Buffers carrying referenced memory, as opposed to allocated memory, is also passed using
shared memory. For this to work, is has to be possible to share the referenced memory between
the processes. The operating systems used in embedded systems for which lwIP is intended
usually do not implement any form of memory protection, so this will not be a problem.
The functions that handle network connections are implemented in the part of the API im-
plementation that resides in the TCP/IP process. The API functions in the part of the API that
runs in the application process will pass a message using a simple communication protocol to the
API implementation in the TCP/IP process. The message includes the type of operation that
should be carried out and any arguments for the operation. The operation is carried out by the
API implementation in the TCP/IP process and the return value is sent to the application process
by message passing.
• The Intel Pentium III processor, henceforth referred to as the Intel x86 processor. The code
was compiled with gcc 2.95.2 under FreeBSD 4.1 with compiler optimizations turned on.
• The 6502 processor [Nab, Zak83]. The code was compiled with cc65 2.5.5 [vB] with compiler
optimizations turned on.
4.11. STATISTICAL CODE ANALYSIS 43
The Intel x86 has seven 32-bit registers and uses 32-bit pointers. The 6502, which main use
today is in embedded systems, has one 8-bit accumulator as well as two 8-bit index registers and
uses 16-bit pointers.
ICMP
UDP
IP
TCP
API
Support functions
Table 4.1 summarizes the number of lines of source code in lwIP and Figure 4.13 shows the relative
number of lines of code. The category “Support functions” include buffer and memory manage-
ment functions as well as the functions for computing the Internet checksum. The checksumming
functions are generic C implementations of the algorithm that should be replaced with processor
specific implementations when actually deployed. The category “API” includes both the part of
the API that is linked with the applications and the part that is linked with the TCP/IP stack.
The operating system emulation layer is not included in this analysis since its size varies heavily
with the underlying operating system and is therefore not interesting to compare.
For the purpose of this comparison all comments and blank lines have been removed from the
source files. Also, no header files were included in the comparison since those files mostly contain
declarations that are repeated in the source code. We see that TCP is vastly larger than the other
protocol implementations and that the API and the support functions taken together are as large
as TCP.
44 CHAPTER 4. DESIGN AND IMPLEMENTATION OF THE TCP/IP STACK
Table 4.2. lwIP object code size when compiled for the Intel x86.
Module Size (bytes) Relative size
TCP 6584 48%
API 2556 18%
Support functions 2281 16%
IP 1173 8%
UDP 731 5%
ICMP 505 4%
Total 13830 100%
ICMP
UDP
IP
TCP
Support functions
API
Figure 4.14. lwIP object code size when compiled for the x86.
Table 4.3. lwIP object code size when compiled for the 6502.
Module Size (bytes) Relative size
TCP 11461 51%
Support functions 4149 18%
API 3847 17%
IP 1264 6%
UDP 1211 5%
ICMP 714 3%
Total 22646 100%
4.12. PERFORMANCE ANALYSIS 45
ICMP
UDP
IP
API
TCP
Support functions
Figure 4.15. lwIP object code size when compiled for the 6502.
Table 4.3 shows the sizes of the object code when compiled for the 6502 and in Figure 4.14
the relative sizes are shown. We see that the TCP, the API, and the support functions are nearly
twice as large as when compiled for the Intel x86, whereas IP, UDP and ICMP are approximately
the same size. We also see that the support functions category is larger than the API, contrary
to Table 4.2. The difference in size between the API and the support functions category is small
though.
The reason for the increase in size of the TCP module is that the 6502 does not natively
support 32-bit integers. Therefore, each 32-bit operation is expanded by the compiler into many
lines of assembler code. TCP sequence numbers are 32-bit integers and the TCP module performs
numerous sequence number computations.
The size of the TCP code can be compared to the size of TCP in other TCP/IP stacks, such
as the popular BSD TCP/IP stack for FreeBSD 4.1 and the independently derived TCP/IP stack
for Linux 2.2.10. Both are compiled for the Intel x86 with gcc and compiler optimizations turned
on. The size of the TCP implementation in lwIP is almost 6600 bytes. The object code size
of the TCP implementation in FreeBSD 4.1 is roughly 27000 bytes, which is four times as large
as in lwIP. In Linux 2.2.10, the object code size of the TCP implementation is even larger and
consists of 39000 bytes, roughly six times as much as in lwIP. The large difference in code size
between lwIP and the two other implementations arise from the fact that both the FreeBSD and
the Linux implementations contain more TCP features such as SACK [MMFR96] as well as parts
of the implementation of the BSD socket API.
The reason for not comparing the sizes of the implementation of IP is that there is vastly
more features in the IP implementations of FreeBSD and Linux. For instance, both FreeBSD
and Linux includes support for firewalling and tunneling in their IP implementations. Also, those
implementations support dynamic routing tables, which is not implemented in lwIP.
The lwIP API constitutes roughly one sixth of the size of lwIP. Since lwIP can be used
without inclusion of the API, this part can be left out when deploying lwIP in a system with very
little code memory.
Summary
• The design and implementation of a small TCP/IP stack, lwIP, that uses very little RAM
and that has a very small code footprint. The TCP/IP stack is written from scratch and
has been designed with the restrictions of a minimal client system in mind.
• The design and implementation of an API for the lwIP stack that utilizes knowledge of the
internal structure of the TCP/IP stack to reduce data copying.
• The design and implementation of a proxy based scheme for offloading a TCP implementation
in a small client system. The scheme does not require any modifications of either the
TCP sender or the TCP receiver thus making it possible to use the proxy with any TCP
implementation.
46
5.4. FUTURE WORK 47
Care has been taken so that some of the end to end semantics of TCP are kept.
API reference
Each data type is repressented as a pointer to a C struct. Knowledge of the internal structure of
the struct should not be used in application programs. Instead, the API provides functions for
modifying and extracting necessary fields.
A.1.1 Netbufs
Netbufs are buffers that are used for sending and receiving data. Internally, a netbuf is associated
with a pbuf as presented in Section 4.4.1. Netbufs can, just as pbufs, accomodate both allocated
memory and referenced memory. Allocated memory is RAM that is explicitly allocated for holding
network data, whereas referenced memory might be either application managed RAM or external
ROM. Referenced memory is useful for sending data that is not modified, such as static web pages
or images.
The data in a netbuf can be fragmented into differenly sized blocks. This means that an
application must be prepared to accept fragmented data. Internally, a netbuf has a pointer to one
of the fragments in the netbuf. Two functions, netbuf_next() and netbuf_first() are used to
manipulate this pointer.
Netbufs that have been received from the network also contain the IP address and port number
of the originator of the packet. Functions for extracting those values exist.
Description
Allocates a netbuf structure. No buffer space is allocated when doing this, only the top level
structure. After use, the netbuf must be deallocated with netbuf_delete().
48
A.2. BUFFER FUNCTIONS 49
netbuf delete()
Synopsis
Description
Deallocates a netbuf structure previosly allocated by a call to the netbuf_new() function. Any
buffer memory allocated to the netbuf by calls to netbuf_alloc() is also deallocated.
Example
This example shows the basic mechanisms for using netbufs.
int
main()
{
struct netbuf *buf;
netbuf alloc()
Synopsis
Description
Allocates buffer memory with size number of bytes for the netbuf buf. The function returns a
pointer to the allocated memory. Any memory previously allocated to the netbuf buf is deallo-
cated. The allocated memory can later be deallocated with the netbuf_free() function. Since
protocol headers are expected to precede the data when it should be sent, the function allocates
memory for protocol headers as well as for the actual data.
netbuf free()
Synopsis
Description
Deallocates the buffer memory associated with the netbuf buf. If no buffer memory has been
allocated for the netbuf, this function does nothing.
50 APPENDIX A. API REFERENCE
netbuf ref()
Synopsis
Description
Associates the external memory pointer to by the data pointer with the netbuf buf. The size of the
external memory is given by size. Any memory previously allocated to the netbuf is deallocated.
The difference between allocating memory for the netbuf with netbuf_alloc() and allocating
memory using, e.g., malloc() and referencing it with netbuf_ref() is that in the former case,
space for protocol headers is allocated as well which makes processing and sending the buffer
faster.
Example
This example shows a simple use of the netbuf_ref() function.
int
main()
{
struct netbuf *buf;
char string[] = "A string";
/* deallocate netbuf */
netbuf_delete(buf);
}
netbuf len()
Synopsis
int netbuf len(struct netbuf *buf)
Description
Returns the total length of the data in the netbuf buf, even if the netbuf is fragmented. For a
fragmented netbuf, the value obtained by calling this function is not the same as the size of the
first fragment in the netbuf.
netbuf data()
Synopsis
Description
This function is used to obtain a pointer to and the length of a block of data in the netbuf buf.
The arguments data and len are result parameters that will be filled with a pointer to the data
and the length of the data pointed to. If the netbuf is fragmented, this function gives a pointer
to one of the fragments in the netbuf. The application program must use the fragment handling
functions netbuf_first() and netbuf_next() in order to reach all data in the netbuf.
See the example under netbuf_next() for an example of how use netbuf_data().
netbuf next()
Synopsis
int netbuf next(struct netbuf *buf)
Description
This function updates the internal fragment pointer in the netbuf buf so that it points to the next
fragment in the netbuf. The return value is zero if there are more fragments in the netbuf, > 0
if the fragment pointer now points to the last fragment in the netbuf, and < 0 if the fragment
pointer already pointed to the last fragment.
Example
This example shows how to use the netbuf_next() function. We assume that this is in the middle
of a function and that the variable buf is a netbuf.
/* [...] */
do {
char *data;
int len;
netbuf first()
Synopsis
void netbuf first(struct netbuf *buf)
Description
Resets the fragment pointer in the netbuf buf so that it points to the first fragment.
netbuf copy()
Synopsis
Description
Copies all of the data from the netbuf buf into the memory pointed to by data even if the netbuf
buf is fragmented. The len parameter is an upper bound of how much data that will be copied
into the memory pointed to by data.
Example
This example shows a simple use of netbuf_copy(). Here, 200 bytes of memory is allocated on
the stack to hold data. Even if the netbuf buf has more data that 200 bytes, only 200 bytes are
copied into data.
void
example_function(struct netbuf *buf)
{
char data[200];
netbuf_copy(buf, data, 200);
netbuf chain()
Synopsis
void netbuf chain(struct netbuf *head, struct netbuf *tail)
Description
Chains the two netbufs head and tail together so that the data in tail will become the last
fragment(s) in head. The netbuf tail is deallocated and should not be used after the call to this
function.
netbuf fromaddr()
Synopsis
struct ip addr * netbuf fromaddr(struct netbuf *buf)
Description
Returns the IP address of the host the netbuf buf was received from. If the netbuf has not
been received from the network, the return the value of this function is undefined. The function
netbuf_fromport() can be used to obtain the port number of the remote host.
netbuf fromport()
Synopsis
unsigned short netbuf fromport(struct netbuf *buf)
Description
Returns the port number of the host the netbuf buf was received from. If the netbuf has not
been received from the network, the return the value of this function is undefined. The function
netbuf_fromaddr() can be used to obtain the IP address of the remote host.
A.3. NETWORK CONNECTION FUNCTIONS 53
Description
Creates a new connection abstraction structure. The argument can be one of NETCONN_TCP or
NETCONN_UDP, yielding either a TCP or a UDP connection. No connection is established by the
call to this function and no data is sent over the network.
netconn delete()
Synopsis
void netconn delete(struct netconn *conn)
Description
Deallocates the netconn conn. If the connection is open, it is closed as a result of this call.
netconn type()
Synopsis
enum netconn type netconn type(struct netconn *conn)
Description
Returns the type of the connection conn. This is the same type that is given as an argument to
netconn_new() and can be either NETCONN_TCP or NETCONN_UDP.
netconn peer()
Synopsis
int netconn peer(struct netconn *conn,
struct ip addr **addr, unsigned short *port)
Description
The function netconn_peer() is used to obtain the IP address and port of the remote end of a
connection. The parameters addr and port are result parameters that are set by the function. If
the connection conn is not connected to any remote host, the results are undefined.
netconn addr()
Synopsis
int netconn addr(struct netconn *conn,
struct ip addr **addr, unsigned short *port)
Description
This function is used to obtain the local IP address and port number of the connection conn.
54 APPENDIX A. API REFERENCE
netconn bind()
Synopsis
int netconn bind(struct netconn *conn,
struct ip addr *addr, unsigned short port)
Description
Binds the connection conn to the local IP address addr and TCP or UDP port port. If addr is
NULL the local IP address is determined by the networking system.
netconn connect()
Synopsis
int netconn connect(struct netconn *conn,
struct ip addr *remote addr, unsigned short remote port)
Description
In case of UDP, sets the remote receiver as given by remote_addr and remote_port of UDP
messages sent over the connection. For TCP, netconn_connect() opens a connection with the
remote host.
netconn listen()
Synopsis
int netconn listen(struct netconn *conn)
Description
Puts the TCP connection conn into the TCP LISTEN state.
netconn accept()
Synopsis
struct netconn * netconn accept(struct netconn *conn)
Description
Blocks the process until a connection request from a remote host arrives on the TCP connection
conn. The connection must be in the LISTEN state so netconn_listen() must be called prior
to netconn_accept(). When a connection is established with the remote host, a new connection
structure is returned.
Example
This example shows how to open a TCP server on port 2000.
int
main()
{
struct netconn *conn, *newconn;
A.3. NETWORK CONNECTION FUNCTIONS 55
netconn recv()
Synopsis
struct netbuf * netconn recv(struct netconn *conn)
Description
Blocks the process while waiting for data to arrive on the connection conn. If the connection has
been closed by the remote host, NULL is returned, otherwise a netbuf containing the recevied data
is returned.
Example
This is a small example that shows a suggested use of the netconn_recv() function. We assume
that a connection has been established before the call to example_function().
void
example_function(struct netconn *conn)
{
struct netbuf *buf;
netconn write()
Synopsis
int netconn write(struct netconn *conn, void *data,
int len, unsigned int flags)
Description
This function is only used for TCP connections. It puts the data pointed to by data on the output
queue for the TCP connection conn. The length of the data is given by len. There is no restriction
on the length of the data. This function does not require the application to explicitly allocate
buffers, as this is taken care of by the stack. The flags parameter has two possible states, as
shown below.
#define NETCONN_NOCOPY 0x00
#define NETCONN_COPY 0x01
When passed the flag NETCONN_COPY the data is copied into internal buffers which is allocated
for the data. This allows the data to be modified directly after the call, but is inefficient both in
terms of execution time and memory usage. If the flag NETCONN_NOCOPY is used, the data is not
copied but rather referenced. The data must not be modified after the call, since the data can be
put on the retransmission queue for the connection, and stay there for an indeterminate amount
of time. This is useful when sending data that is located in ROM and therefore is immutable.
If greater control over the modifiability of the data is needed, a combination of copied and
non-copied data can be used, as seen in the example below.
Example
This example shows the basic usage of netconn_write(). Here, the variable data is assumed to
be modified later in the program, and is therefore copied into the internal buffers by passing the
flag NETCONN_COPY to netconn_write(). The text variable contains a string that will not be
modified and can therefore be sent using references instead of copying.
int
main()
{
struct netconn *conn;
char data[10];
char text[] = "Static text";
int i;
netconn send()
Synopsis
int netconn send(struct netconn *conn, struct netbuf *buf)
Description
Send the data in the netbuf buf on the UDP connection conn. The data in the netbuf should not
be too large since IP fragmentation is not used. The data should not be larger than the maximum
transmission unit (MTU) of the outgoing network interface. Since there currently is no way of
obtaining this value a careful approach sould be taken, and the netbuf should not contain data
that is larger than some 1000 bytes.
No checking is made whether the data is sufficiently small and sending very large netbufs might
give undefined results.
Example
This example shows how to send some UDP data to UDP port 7000 on a remote host with IP
address 10.0.0.1.
int
main()
{
struct netconn *conn;
struct netbuf *buf;
struct ip_addr addr;
char *data;
char text[] = "A static text";
int i;
netconn close()
Synopsis
int netconn close(struct netconn *conn)
Description
Closes the connection conn.
Appendix B
This appendix provides a simple implementation of the BSD socket API using the lwIP API. The
implementation is provided as a reference only, and is not intended for use in actual programs.
There is for example no error handling.
Also, this implementation does not support the select() and poll() functions of the BSD
socket API since the lwIP does not have any functions that can be used to implement those. In
order to implement those functions, the BSD socket implementation would have to communicate
directly with the lwIP stack and not use the API.
int
socket(int domain, int type, int protocol)
{
struct netconn *conn;
int i;
/* create a netconn */
switch(type) {
case SOCK_DGRAM:
conn = netconn_new(NETCONN_UDP);
break;
case SOCK_STREAM:
conn = netconn_new(NETCONN_TCP);
59
60 APPENDIX B. BSD SOCKET LIBRARY
break;
}
int
bind(int s, struct sockaddr *name, int namelen)
{
struct netconn *conn;
struct ip_addr *remote_addr;
unsigned short remote_port;
conn = sockets[s];
netconn_bind(conn, remote_addr, remote_port);
return 0;
}
int
connect(int s, struct sockaddr *name, int namelen)
{
struct netconn *conn;
struct ip_addr *remote_addr;
unsigned short remote_port;
B.3. CONNECTION SETUP 61
conn = sockets[s];
netconn_connect(conn, remote_addr, remote_port);
return 0;
}
int
listen(int s, int backlog)
{
netconn_listen(sockets[s]);
return 0;
}
int
accept(int s, struct sockaddr *addr, int *addrlen)
{
struct netconn *conn, *newconn;
struct ip_addr *addr;
unsigned short port;
int i;
conn = sockets[s];
newconn = netconn_accept(conn);
addr->sin_addr = *addr;
addr->sin_port = port;
return -1;
}
int
send(int s, void *data, int size, unsigned int flags)
{
struct netconn *conn;
struct netbuf *buf;
conn = sockets[s];
switch(netconn_type(conn)) {
case NETCONN_UDP:
/* create a buffer */
buf = netbuf_new();
return size;
}
int
sendto(int s, void *data, int size, unsigned int flags,
struct sockaddr *to, int tolen)
{
struct netconn *conn;
struct ip_addr *remote_addr, *addr;
unsigned short remote_port, port;
int ret;
conn = sockets[s];
int
write(int s, void *data, int size)
{
struct netconn *conn;
conn = sockets[s];
switch(netconn_type(conn)) {
case NETCONN_UDP:
64 APPENDIX B. BSD SOCKET LIBRARY
int
recv(int s, void *mem, int len, unsigned int flags)
{
struct netconn *conn;
struct netbuf *buf;
int buflen;
conn = sockets[s];
buf = netconn_recv(conn);
buflen = netbuf_len(buf);
int
read(int s, void *mem, int len)
{
return recv(s, mem, len, 0);
}
B.4. SENDING AND RECEIVING DATA 65
int
recvfrom(int s, void *mem, int len, unsigned int flags,
struct sockaddr *from, int *fromlen)
{
struct netconn *conn;
struct netbuf *buf;
struct ip_addr *addr;
unsigned short port;
int buflen;
conn = sockets[s];
buf = netconn_recv(conn);
buflen = netbuf_len(conn);
addr = netbuf_fromaddr(buf);
port = netbuf_fromport(buf);
from->sin_addr = *addr;
from->sin_port = port;
*fromlen = sizeof(struct sockaddr);
netbuf_delete(buf);
Code examples
#include "api.h"
66
C.1. USING THE API 67
<body> \
This is a small test page. \
</body> \
</html>";
/* Loop forever. */
while(1) {
/* Accept a new connection. */
newconn = netconn_accept(conn);
#include "tcp.h"
the connection. */
if(p != NULL) {
Glossary
ACK
The acknowledgment signal used by TCP.
API
Application Program Interface. A set of functions that specifies the communication between
an application program and a system service.
checksum
A checksum is a function that computes a specific number by summing all bytes in a packet.
Used for detection of data corruption.
congestion
Congestion occurs when a router drops packets due to full buffers, i.e., when the network is
overloaded.
datagram
A chunk of information. Analogous to a packet.
demultiplexing
The opposite of multiplexing. Extracting one of the information streams from a combined
stream of information streams.
header
Control information for a packet located at the beginning of the packet.
ICMP
Internet Control Message Protocol. An unreliable signaling protocol used together with IP.
internet
An interconnected set of networks using IP for addressing.
Internet
The global Internet.
IP
Internet Protocol. The protocol used for addressing packets in an internet.
IPv4
Internet Protocol version 4. The version of IP mainly used in the global Internet.
IPv6
Internet Protocol version 6. The next generation IP. Expands the address space from 232
combinations to 2128 and also supports auto-configuration.
71
72 APPENDIX D. GLOSSARY
multiplexing
A technique that enables two or more information streams to use the same link. In the
TCP/IP case this refers to the process of, e.g., using the IP layer for many different protocols
such as UDP or TCP.
packet
A chunk of information. Analogous to a datagram.
PCB
Protocol Control Block. The data structure holding state related information of a (possibly
half-open) UDP or TCP connection.
proxy
An intermediate agent that utilizes knowledge of the transportation mechanism to enhance
performance.
RFC
Request For Comments. A paper specifying a standard or discussing various mechanisms in
the Internet.
round-trip time
The total time it takes for a packet to travel from the sender to the receiver, and for the
reply to travel back to the sender.
router
A node in an internet. Connects two or more networks and forwards IP packets across the
connection point.
UDP
User Datagram Protocol. An unreliable datagram protocol on top of IP. Mainly used for
delay sensitive applications such as real time audio and video.
TCP
Transmission Control Protocol. Provides a reliable byte stream on top of IP. The most
commonly used transportation protocol in todays Internet. Used for email as well as file and
web services.
TCP/IP
The internet protocol suite which includes the basic delivery protocols such as IP, UDP and
TCP as well as some application level protocols such as the email transfer protocol SMTP
and the file transfer protocol FTP.
Bibliography
[ABM95] B. Ahlgren, M. Björkman, and K. Moldeklev. The performance of a no-copy api for
communication (extended abstract). In IEEE Workshop on the Architecture and Im-
plementation of High Performance Communication Subsystems, Mystic, Connecticut,
USA, August 1995.
[APS99] M. Allman, V. Paxson, and W. Stevens. TCP congestion control. RFC 2581, Internet
Engineering Task Force, April 1999.
[BB95] A. Bakre and B. R. Badrinath. I-TCP: Indirect TCP for mobile hosts. In Proceedings
of the 15th International Conference on Distributed Computing Systems, May 1995.
[BIG+ 97] C. Brian, P. Indra, W. Geun, J. Prescott, and T. Sakai. IEEE-802.11 wireless local
area networks. IEEE Communications Magazine, 35(9):116–126, September 1997.
[Bra89] R. Braden. Requirements for internet hosts – communication layers. RFC 1122,
Internet Engineering Task Force, October 1989.
[Bra92] R. Braden. TIME-WAIT assassination hazards in TCP. RFC 1337, Internet Engi-
neering Task Force, May 1992.
[BS97] K. Brown and S. Singh. M-TCP: TCP for mobile cellular networks. ACM Computer
Communications Review, 27(5):19–43, October 1997.
[Car96] B. Carpenter. Architectural principles of the Internet. RFC 1958, Internet Engineering
Task Force, June 1996.
[Cla82a] D. D. Clark. Modularity and efficiency in protocol implementation. RFC 817, Internet
Engineering Task Force, July 1982.
[Cla82b] D. D. Clark. Window and acknowledgement strategy in TCP. RFC 813, Internet
Engineering Task Force, July 1982.
[FH99] S. Floyd and T. Henderson. The NewReno modifications to TCP’s fast recovery
algorithm. RFC 2582, Internet Engineering Task Force, April 1999.
73
74 BIBLIOGRAPHY
[FTY99] T. Faber, J. Touch, and W. Yue. The TIME-WAIT state in TCP and its effect on
busy servers. In Proceedings of IEEE INFOCOM ’99, New York, March 1999.
[HNI+ 98] J. Haartsen, M. Naghshineh, J. Inouye, O. Joeressen, and W. Allen. Bluetooth: Vision,
goals, and architecture. Mobile Computing and Communications Review, 2(4):38–45,
October 1998.
[Jac88] V. Jacobson. Congestion avoidance and control. In Proceedings of the SIGCOMM ’88
Conference, Stanford, California, August 1988.
[Jac90] V. Jacobson. 4.3BSD TCP header prediction. ACM Computer Communications Re-
view, 20(2):13–15, April 1990.
[JBB92] V. Jacobson, R. Braden, and D. Borman. TCP extensions for high performance. RFC
1323, Internet Engineering Task Force, May 1992.
[KP87] P. Karn and C. Partridge. Improving round-trip time estimates in reliablie transport
protocols. In Proceedings of the SIGCOMM ’87 Conference, Stowe, Vermont, August
1987.
[KP96] J. Kay and J. Pasquale. Profiling and reducing processing overheads in TCP/IP.
IEEE/ACM Transactions of Networking, 4(6):817–828, December 1996.
[LDP99] L. Larzon, M. Degermark, and S. Pink. UDP Lite for real-time multimedia appli-
cations. In Proceedings of the IEEE International Conference of Communications,
Vancouver, British Columbia, Canada, June 1999.
[MD92] Paul E. McKenney and Ken F. Dove. Efficient demultiplexing of incoming TCP
packets. In Proceedings of the SIGCOMM ’92 Conference, pages 269–279, Baltimore,
Maryland, August 1992.
[MK90] T. Mallory and A. Kullberg. Incremental updating of the internet checksum. RFC
1141, Internet Engineering Task Force, January 1990.
[Mog92] J. Mogul. Network locality at the scale of processes. ACM Transactions on Computer
Systems, 10(2):81–109, May 1992.
[Pax97] Vern Paxson. End-to-end internet packet dynamics. In Proceedings of the SIGCOMM
’97 Conference, Cannes, France, September 1997.
[Pos80] J. Postel. User datagram protocol. RFC 768, Internet Engineering Task Force, August
1980.
[Pos81a] J. Postel. Internet control message protocol. RFC 792, Internet Engineering Task
Force, September 1981.
[Pos81b] J. Postel. Internet protocol. RFC 791, Internet Engineering Task Force, September
1981.
[Pos81c] J. Postel. Transmission control protocol. RFC 793, Internet Engineering Task Force,
September 1981.
BIBLIOGRAPHY 75
[PS98] S. Parker and C. Schmechel. Some testing tools for TCP implementors. RFC 2398,
Internet Engineering Task Force, August 1998.
[Rij94] A. Rijsinghani. Computation of the internet checksum via incremental update. RFC
1624, Internet Engineering Task Force, May 1994.
[Shr] H. Shrikumar. IPic - a match head sized web-server. Web page. 2000-11-24.
URL: http://www-ccs.cs.umass.edu/˜shri/iPic.html
[vB] U. von Bassewitz. cc65 - a freeware c compiler for 6502 based systems. Web page.
2000-11-30.
URL: http://www.cc65.org/