
646 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003
characteristics of key applications for IP-based video, the cur-
rently used protocol infrastructure and their characteristics are
introduced.
A. Applications
Before discussing the transmission of video over IP, it is nec-
essary to take a closer look at its intended applications. The na-
ture of which determines the constraints and the protocol envi-
ronment with which the video source coding has to cope.
Using IP as a transport, three major applications can currently
be identified.
• Conversational applications, such as videotelephony and
videoconferencing. Such applications are characterized
by very strict delay constraints—significantly less than
one second end-to-end latency, with less than 100 ms as
the (so far unreachable) goal. They are also limited to
point-to-point or small multipoint transmissions. Finally,
they imply the use of real-time video encoders and de-
coders, which allow the tuning of the coding parameters
in real-time, including the adaptive use of error-resilience
tools appropriate to the actual network conditions, and
often the use of feedback-based source coding tools.
However, the use of real-time encoders also limits the
maximum computational complexity, especially in the
encoder. Low delay constraints further prevent the use
of some coding tools that are optimized for high-latency
applications, such as bipredicted slices.
• Thedownload of complete, pre-coded video streams.Here,
the bit string is transmitted as a whole, using reliable pro-
tocols such as ftp [3] or http [4]. The video coder can op-
timize the bit stream for the highest possible coding effi-
ciency, and does not have to obey restrictions in terms of
delay and error resilience. Furthermore, the video coding
process is normally not a real-time process; hence, com-
putational complexity of the encoder is also a less crit-
ical subject. Most of the traditional video coding research
somewhat implies this type of application.
• IP-based streaming. This is a technology that, with respect
to its delay characteristics, is somewhere in the middle
between download and conversational applications.
There is no generally accepted definition for the term
“streaming”. Most people associate it with a transmission
service that allows the start of video playback before the
whole video bit stream has been transmitted, with an
initial delay of only a few seconds, and in a near real-time
fashion. The video stream is either pre-recorded and
transmitted on demand, or a life session is compressed
in real-time—often in more than one representation with
different bit rates—and sent over one ore more multicast
channels to a multitude of users. Due to the relaxed delay
constraints when compared to conversational services,
some high-delay video coding tools, such as bipredicted
slices, can be used. However, under normal conditions,
streaming services use unreliable transmission protocols,
making error control in the source and/or the channel
coding a necessity. The encoder has only limited—if
any—knowledge of the network conditions and has to
adapt the error resilience tools to a level that most users
would find acceptable. Streaming video is sent from a
single server, but may be distributed in a point-to-point,
multipoint, or even broadcast fashion. The group size
determines the possibility of the use of feedback-based
transport and coding tools.
This paper is mostly concerned with conversational services,
because here techniques from both the source coding and the
channel coding must be employed, and their interaction can
be shown. In addition, most research within JVT with respect
to IP-transport was performed assuming such an application.
Many of the discussions also apply to a streaming environment.
Readers primarily interested in download-type applications
should refer to papers that are concerned with coding efficiency
in this special issue [5].
IP networks can currently be found in two flavors: unman-
aged IP networks, with the Internet as its most prominent
example, and managed IP networks such as the wide-area
networks of some long-distance telephony companies. An
emerging third category could also be addressed: wireless
IP networks based on the third-generation mobile networks.
(Please see [2] in this Special Issue for an in-depth discussion.)
All three network types have somewhat different characteris-
tics in terms of the maximum transfer unit size (MTU size), the
probability for bit errors in packets, and the need to obey the the
Transmission Control Protocol (TCP) traffic paradigm.
1) MTU Size: The MTU size is the largest size of a packet
that can be transmitted without being split/recombined on the
transport and network layer. It is generally advisable to keep
coded slice sizes as close to, but never bigger than, the MTU
size, because this: 1) optimizes the payload/header overhead
relationship and 2) minimizes the loss probability of a (frag-
mented) coded slice due to the loss of a single fragment on
the network/transport layer and the resulting discarding of all
other fragments belonging to the coded slice in question (by the
network/transport layer protocols). The end-to-end MTU size
of a transmission path between two IP nodes is very difficult
to identify, and may change dynamically during a connection.
However, most research assumes MTU sizes of around 1500
bytes for wireline IP links (because of the maximum size of an
Ethernet packet). In a wireless environment, the MTU size is
typically considerably smaller—most research including JVT’s
wireless common conditions assume an MTU size of around
100 bytes.
2) Bit Errors: Bit-error probabilities of today’s wireline net-
works are so low that, within the scope of this work, they can
be safely ignored. (Please see [2] for a discussion on how the
H.264 test model handles the significantly higher bit error rates
found in wireless networks.)
3) Rate Control and TCP Traffic Paradigm: Since the big
Internet Meltdown of the late 1980s, the transport protocol TCP
[6], which is used to carry most Internet content such as email
and Web traffic, obeys the so-called TCP traffic paradigm [7].
It would be beyond the scope of this paper to discuss it in detail
but, in short, the TCP traffic paradigm mandates that a sender
reduces its sending bit rate to half (as a result of an adjustment
of the TCP buffer size) as soon as it observes a packet loss rate
above a certain threshold. Once the packet loss rate drops below