iSCSI The Universal Storage Connection PDF
iSCSI The Universal Storage Connection PDF
Task management
Error handling
Copyright
Credits and Disclaimer
Preface
Organization
Chapter 1. The Background of SCSI
SCSI Bus Interconnect
Fibre Channel Interconnect
iSCSI Interconnect
File Servers and NAS
Chapter Summary
Chapter 2. The Value and Position of iSCSI
To the Reader
The Home Office
The Small Office
The Midrange
The High End
FC and iSCSI
Chapter Summary
Chapter 3. The History of iSCSI
To the Reader
SCSI over TCP/IP
Cisco and IBM's Joint Effort
iSCSI and IETF
The End of the Story
Chapter Summary
Chapter 4. An Overview of iSCSI
To the Reader
TCP/IP
iSCSI-Related Protocol Layers
Sessions
Protocol Data Unit (PDU) Structure
iSCSI and TOE Integration on a Chip or HBA
Checksums and CRC (Digests)
Naming and Addressing
Chapter Summary
Chapter 5. Session Establishment
To the Reader
Introduction to the Login Process
Login and Session Establishment
Login PDUs
iSCSI Sessions
Login Keywords
Discovery Session
Chapter Summary
Chapter 6. Text Commands and Keyword Processing
To the Reader
Text Requests and Responses
Text Keywords and Negotiation
Chapter Summary
Chapter 7. Session Management
To the Reader
Initiator Session ID
Connection Establishment
Data Travel Direction
Sequencing
Resending Data or Status
Chapter Summary
Chapter 8. Command and Data Ordering and Flow
To the Reader
Command Ordering
Command Windowing
Initiator Task Tag
Data Ordering
Target Transfer Tag
Data Placement (a Form of RDMA)
Chapter Summary
Chapter 9. Structure of iSCSI and Relationship to SCSI
To the Reader
iSCSI Structure and SCSI Relationship
SCSI Nexus
Chapter Summary
Chapter 10. Task Management
To the Reader
The authors and publisher have taken care in the preparation of this book, but
make no expressed or implied warranty of any kind and assume no
responsibility for errors or omissions. No liability is assumed for incidental or
consequential damages in connection with or arising out of the use of the
information or programs contained herein.
The publisher offers discounts on this book when ordered in quantity for bulk
purchases and special sales. For more information, please contact:
(800) 382-3419
International Sales
(317) 581-3793
Hufferd, John L.
p. cm.
004.6'068dc21
2002026086
For information on obtaining permission for use of material from this work,
please submit a written request to:
Boston, MA 02116
1 2 3 4 5 6 7 8 9 10CRS0605040302
Dedication
This book is dedicated to my family, who had to put up with me during its
writing. My wife Cathy and my children Jared, Jeffrey, and Joanne
Credits and Disclaimer
Much of the information in this book has been obtained from the IETF drafts
relating to iSCSI. That information has been edited and interpreted; however,
any discrepancies between this book and the IETF drafts/standards should be
resolved in favor of the IETF drafts/standards. This book is written against the
Internet draft draft-ietf-ips-iscsi-18.
The IETF IPS Internet drafts for iSCSI, and related drafts, from which
information has been extracted and referenced, have the following copyright
statement:
Copyright © The Internet Society 2001, 2002. All Rights Reserved. This
document [The various IETF IPS Internet Drafts for iSCSI, and related
drafts] and translations of it may be copied and furnished to others, and
derivative works that comment on or otherwise explain it or assist in its
implementation may be prepared, copied, published and distributed, in
whole or in part, without restriction of any kind, provided that the above
copyright notice and this paragraph are included on all such copies and
derivative works. However, this document itself [The IETF IPS Internet
Drafts for iSCSI, and related drafts] may not be modified in any way,
such as by removing the copyright notice or references to the Internet
Society or other Internet organizations, except as needed for the purpose
of developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be followed, or
as required to translate it into languages other than English.
Note: The above is the copyright statement found within the various IETF
documents from which much of the information in this book was obtained. It is
not the copyright statement for this book.
Credit should go to the many individuals who make up the IETF ips workgroup
(iSCSI track) for the many contributions that have been made on the
"reflector," which has permitted the iSCSI related drafts to reach their current
state.
The IETF drafts are always in a state of refinement. Thus, what was correct
when this book was written may be out of date when the book is read. The
reader is cautioned to use the information here as important background
material and use the only current IETF iSCSI drafts as the correct version.
A very big thank you is also sent to Julian Satran not only for the countless
hours he has spent being the editor and primary author of the main iSCSI
draft, but also for the hours we have spent together working out details and
implementation approaches of the iSCSI protocols. Julian led the IBM Haifa
Research team that did so much of the early work on iSCSI. He is a man with a
large intellect and the willingness to share it with others in a most gracious
manner. Thank you, Julian.
Also, my thanks go to the following folks for their major contributions to the
main iSCSI draft and several related drafts, as well as to some key people
from the early days when the effort was still called "SCSI over TCP/IP."
Mark Bakke, for his work on SCSI and the iSCSI MIB draft, the Naming
and Discovery draft, the SLP draft, and the NamePrep and StringPrep
drafts and their relationship to iSCSI naming.
Marjorie Krueger, for her work on the SCSI and iSCSI MIB draft, the
Naming and Discovery draft, and the iSCSI Requirements draft.
Jim Hafner, for his work on the Naming and Discovery draft and for
keeping all our work consistent with the SCSI model.
Prasenjit Sarkar, for his leadership on the iSCSI Boot draft and his
leading-edge iSCSI implementation work in the initial SCSI-over-TCP/IP
prototyping and measurements, as well as for building the IBM 200i
prototype.
Kalahari Voruganti, for his leadership on the iSCSI Naming and Discovery
draft and his leading-edge iSCSI implementation work in the initial SCSI-
over-TCP/IP prototyping and measurements in the IBM 200i prototype.
Joshua Tseng, for his work on the Naming and Discovery draft and for his
leadership on the iSNS draft.
Bernard Aboba, for his leading effort and work on the IP Storage Security
draft.
John Sondeno, for his careful reviews and edits of this manuscript.
Steven Hetzler, for his leading effort in getting attention within IBM for
pursuit of SCSI over TCP/IP and for leading an IBM Almaden Research
team that built some of the initial demonstration projects that proved the
concept.
Daniel Smith, for his effort in some of the initial work on the IBM
prototype of SCSI over TCP/IP.
Bill Kabelac, for his effort in some of the initial work on the IBM
prototype of SCSI over TCP/IP.
Jai Menon, for his work with Clod Barrera to ensure that the SCSI-over-
TCP/IP project was continued within IBM and for being instrumental in
the agreement with Cisco for co-authoring the original draft.
Clod Barrera, for his work with Jai Menon and for ensuring that the
SCSI-over-TCP/IP project was continued within IBM Haifa and Almaden
Research, and for being instrumental in the agreement with Cisco for co-
authoring the original draft. Additional thanks are appropriate for Clod
because of the continued support that he gave me as the IBM iSCSI
projects were being defined.
Manoj Naik, for his work in creating one of the very first working SCSI-
over-TCP/IP prototypes.
Andy Bechtolsheim, for his vision of bringing Cisco into the iSCSI
venture with IBM.
Costa Saputzakis, for his initial work with IBM in developing the
Cisco/IBM proposal for the iSCSI protocol.
Tom Clark, for his careful review of this manuscript and his very useful
suggestions.
Elizabeth Rodriguez, for her effort as co-chair of the IETF ips workgroup
and for her efforts in editing this manuscript.
John Kuhn, for his management of the 200i project and for all the
techniques he used through the difficult process of bringing out a
paradigm-changing product within IBM.
John Dowdy, for his key planning and coordination in getting the IBM
200i target storage controller shipped and updated through all the
different versions of the specification.
Efri Zeidner, for building an early primitive target and initiator test bed
using a connection-per-LU model that helped to put together the case for
TCP.
Kalman Meth, for his help exploring different variants of the early
protocol and for his heavy involvement in writing the version that was
first submitted to IETF.
Ofer Biran, for his expertise in building a good security story and for his
work on the main iSCSI drafts.
Micky Rodeh and Joseph Raviv, for their support of the whole project
from the outset, their agreement to fund it, and the energy they spent
convincing everyone that iSCSI is good business.
Preface
This book is a guide to understanding Internet SCSI (iSCSI) and where it fits
in the world. It contains discussions of the marketplace where appropriate and
of some technology competitors, such as Fibre Channel. How ever, mostly
there will be positioning of the various technologies to em phasize their
appropriate strengths. iSCSI is based on such a ubiquitous network technology
(TCP/IP) that it seems to play in many different areas that are currently
dominated by other technologies. Therefore, one needs to view all iSCSI
capabilities and determine its applicability to the area in which the reader is
interested.
Since iSCSI is only a transport, that is, a carrier of the SCSI protocol, there is
no involved discussion of SCSI itself. Many parts of the book are general
enough that a thorough knowledge of SCSI is not needed. There are, however,
more detailed parts of the book where SCSI knowledge would be helpful.
I wrote this book to provide both the manager and the technician with a useful
understanding of the technology. Product marketing and strategy professionals
should also find the information useful and meaningful. The technician should
view this book as a primer, in which the iSCSI technology is discussed with
enough depth that the IETF iSCSI documents should be readily
understandable. Those who want to understand and build a product based on
iSCSI should find this book to be a must-read, especially if they plan to dive
down into the details of the IETF iSCSI drafts/standards documents.
The book begins with a general background of the market and an answer to
why iSCSI is of interest. A taxonomy of the various markets is given, along
with an explanation of how iSCSI fits into them. This is followed by a short
history of iSCSI so that the reader can get a sense of what propelled its
development.
Next the book heads into the technology itself, with an overview that includes
iSCSI layering. This shows the use of the underpinning TCP/IP technology, the
concept of a session, and the structure of the message units. Various other key
concepts are introduced here to ensure that the reader knows not only the
importance of data integrity to storage technology, but also that new hardware
is being introduced specifically to address bandwidth and latency issues. A few
pages are spent explaining the iSCSI naming conventions, because of their
major significance to the use of the technology.
Following the discussion of iSCSI naming conventions, the book takes the
reader through the login process and the identification and option negotiation
process. These processes are key in the establishment of a communication
path between the host system and the storage controller. The process of
sequencing the commands and data, as well as controlling the flow of
commands and data, is reviewed.
The various forms of task and error management are explained in a very
technical discussion. The detail and technical depth build from that point to the
end of the book. Finally the reader is taken through the various companion
technologies that iSCSI uses to complete its suite of capabilities.
The main part of the book concludes with an explanation of what hardware
vendors are doing to permit direct memory placement of iSCSI messages
without additional main processor involvement.
Appendix A contains most of the truly technical details of the iSCSI protocol.
The message units are presented in alphabetical order for ease of reference.
Appendix B contains a compact listing of the various negotiation keywords and
values.
Readers may forget from time to time the meanings of various iSCSI and SCSI
terms, so a glossary is presented in Appendix E. As a further aid I have
included in Appendix F the various acronyms used throughout this book and
many of the referenced documents, especially the base IETF iSCSI drafts.
Finally, Appendix G contains the various reference sources, along with their
Web page locators (in most cases). Speaking of references, bracketed citations,
such as [SAM2], are fully referenced in this appendix.
A SCSI bus permits hard disks, tape drives, tape libraries, printers, scanners,
CD-ROMs, DVDs, and the like to be connected to server systems. It can be
considered a general interconnection technique that permits devices of many
different types to interoperate with computer systems. (See Figure 1-1.)
The protocol used on the SCSI bus is the SCSI Protocol. It defines how the
SCSI device can be addressed, commanded to perform some operation, and
give or take data to or from the (host) computing system. The operational
commands are defined by a data structure called a command description block
(CDB). For example, a read command would have a CDB that contained an
"opcode" defined by the protocol to mean, "read." It would also contain
information about where to get the data (e.g., the block location on the disk)
and miscellaneous flags to further define the operation.
The protocol that defines how a SCSI bus is operated also defines how to
address the various units to which the CDB will be delivered. Generally,
presenting the address on the hardware lines of the SCSI bus performs the
addressing. This address technique calls out a particular SCSI device, which
may then be subdivided into one or more logical units (LUs). An LU is an
abstract concept that can represent various real objects such as tapes,
printers, and scanners.
Each LU is given an address. This is a simple number called the logical unit
number (LUN). Thus, the SCSI protocol handles the addressing of both the
SCSI device and the LU. (Note: "LUN," though technically incorrect, will often
be used when "LU" is meant.) Servers may connect to many SCSI buses; in
turn the SCSI buses can each connect to a number of SCSI devices, and each
SCSI device can contain a number of LUs (8, 16, 32, etc.). Therefore, the total
number of SCSI entities (LUs) attached to a system can be very large. (See
Figure 1-2.)
The next thing to consider is what happens when many computers are in the
same location. If there are numerous disks (LUs) for each system, this
configuration creates a very large grouping of storage units. Many installations
group their servers and storage separately and put appropriate trained
personnel in each area. These people are usually skilled in handling issues
with either the computer system or the storage.
One of the most prevalent issues for the storage specialist is supplying the
proper amount of storage to the appropriate systems. As systems are actually
used, the amount of storage originally planned for them can varyeither too
much or too little. Taking storage from one system's SCSI bus and moving it to
another system's SCSI bus can be a major disruptive problem often requiring
booting of the various systems. Users want a pool of storage, which can be
assigned in a nondisruptive manner to the servers as need requires.
Another issue with the SCSI bus is that it has distance limitations varying from
1.5 to 25 meters, depending on the bus type (yes, there are multiple types).
The bus type has to be matched with the requirements of the host and the
SCSI (storage) devices (often called storage controllers), which seriously limits
the amount of pooling a SCSI bus can provide.
Further, many SCSI bus storage devices can have no more than one bus
connected to them, and unless high-end storage devices are used, one
generally has at most two SCSI bus connections per storage device. In that
case the storage devices have at most two different host systems that might
share the various LUs within the SCSI devices. (See Figure 1-3.)
Often the critical host systems want a primary and a secondary connection to
the storage devices so that they have an alternate path in case of connection
or bus failure. This results in additional problems for systems that want
alternate paths to the storage and, at the same time, share the storage
controllers with other hosts (which might be part of a failover-capable cluster).
Often an installation requires a cluster made up of more than two hosts, and it
uses a process called file sharing via a shared file system (e.g., Veritas
Clustered File System) or a shared database system (e.g., Oracle Cluster
Database). Often this is not possible without the expense of a mainframe/
enterprise-class storage controller, which usually permits many SCSI bus
connections but brings the installation into a whole new price range. (See
Figure 1-4.)
The term "logical connection" is used because Fibre Channel (FC) components
can be interconnected via hubs and switches. These interconnections make up
a network and thus have many of the characteristics found in any network.
The FC network is referred to as an FC storage area network (SAN). However,
unlike in an Internet Protocol (IP) network, basic management capability is
missing in Fibre Channel. This is being rectified, but the administrator of an IP
network cannot now, and probably never will be able to, use the same network
management tools on an FC network that are used on an IP network. This
requires duplicate training cost for the FC network administrator and the IP
network administrator. These costs are in addition to the costs associated with
the actual storage management duties of the storage administrator.
[*]There is at least one important exception: the University of New Hampshire, which has become
an important center for interoperability testing for Fibre Channel (and recently for iSCSI).
1. Fibre Channel does not yet replace any other curriculum item.
That the main university servers are not Fibre Channel connected is a problem
currently being addressed. However, the professors' local systems, which have
significant budget issues, will probably be the last to be updated.
There is another solution to the problem of training, and that is the hiring of
service companies that plan and install the FC networks. These companies also
train customers to take over the day-to-day operations, but remain on call
whenever needed to do fault isolation or to expand the network. Service
companies such as IBM Global Services (IGS) and Electronic Data Systems
(EDS) are also offering ongoing operation services.
The total cost of ownership (TCO) with Fibre Channel is very high compared to
that with IP networks. This applies not only to the price of FC components,
which are significantly more expensive than corresponding IP components, but
also to operation and maintenance. The cost of training personnel internally or
hiring a service company to operate and maintain the FC network is a
significant addition to the TCO.
The iSCSI (Internet SCSI) protocol was created in order to reduce the TCO of
shared storage solutions by reducing the initial outlay for networking, training,
and fabric management software. To this end a working group within the IETF
(Internet Engineering Task Force) Standards Group was established.
iSCSI has the capability to tie together a company's systems and storage,
which may be spread across a campus-wide environment, using the company's
interconnected local area networks (LANs), also known as intranets. This
applies not only to the company's collection of servers but also to their desktop
and laptop systems.
[*]
"Sawing" is a term used to describe the action of the voice coil on a disk drive that moves the
recording heads back and forth across the sector of the disk. The resultant noise often sounds like
sawing.
Data suggest that 500MHz Pentium systems can operate the normal host
TCP/IP (Transmission Control Protocol over Internet Protocol) stacks at 100
Mb/s using less than 10% of CPU resources. These resources will hardly be
missed if the I/O arrives in a timely manner. Likewise we can expect the
desktop systems shipping in the coming year and beyond to be on the order of
1.5 to 3 GHz. This means that, for 30 megabyte-per-second (MB/s) I/O
requirements (approximately 300 Mb/s), desktop systems will use about the
same, or less, processor time as they previously consumed on 500MHz desktop
systems using 100Mb/s links (less than 10%). Most users would be very happy
if their desktops could sustain an I/O rate of 30 MB/s. (Currently desktops
average less than 10 MB/s.)
The important point here is that iSCSI for desktops and laptops makes sense
even if no special hardware is dedicated to its use. This is a significant plus for
iSCSI versus Fibre Channel, since Fibre Channel requires special hardware and
is therefore unlikely to be deployed on desktop and laptop systems. (See
Figure 1-6.)
The real competition between Fibre Channel and iSCSI will occur on server-
class systems. These systems are able to move data (read and write) at up to
2Gb/s speeds. These FC connections require special FC chips and host bus
adapters (HBAs). As a rule, these HBAs are very expensive (compared to
NICs), but they permit servers to send their SCSI CDBs to SCSI target devices
and LUs at very high speed and at very low processor overhead. Therefore, if
iSCSI is to be competitive in the server environment, it too will need specially
built chips and HBAs. Moreover, these chips and HBAs will need to have TCP/IP
offload engines (TOEs) along with the iSCSI function. The iSCSI function can
be located in the device driver, the HBA, or the chip, and, in one way or
another, it will need to interface directly with the TOE and thereby perform all
the TCP/IP processing on the chip or HBA, not on the host system.
Some people believe that the price of FC networks will fall to match that of IP
networks. I believe that will not occur for quite a while, since most FC sales
are at the very high end of the market, where they are very entrenched. It
therefore seems foolish for them to sacrifice their current profit margins,
fighting for customers in the middle to low end of the market (against iSCSI),
where there are no trained personnel anyway. I believe that FC prices will go
down significantly when iSCSI become a threat at the market high end, which
won't happen for some time.
Studies conducted by IBM and a number of other vendors have concluded that
iSCSI can perform at gigabit line speed, with overheads as low as those of
Fibre Channel, as long as it has iSCSI and TCP/IP hardware assist in HBAs or
chips. It is expected that the price of gigabit-speed iSCSI HBAs will be
significantly lower than that of FC HBAs. It is also felt that two 1Gb iSCSI
HBAs will have a significantly lower combined price than current 2Gb FC HBAs.
Even though iSCSI HBAs and chips will be able to operate at link speed, it is
expected that their latency will be slightly higher than that of Fibre Channel's.
This difference is considered to be less than 10 microseconds, which, when
compared to the time for I/O processing, is negligible. iSCSI's greater latency
is caused by the greater amount of processing to be done within the iSCSI chip
to support TCP. Thus, there is some impact from the additional work needed,
even if supported by a chip. A key future vendor-value-add will be how well a
chip is able to parallel its processes and thus reduce the latency. This is not to
say that the latency of iSCSI chips will be unacceptable. In fact, it is believed
that it will be small enough not to be noticeable in most normal operations.
An odd thing about tape is that almost everyone wants to be able to use it
(usually for backup) but almost no one wants the tape library nearby. iSCSI
provides interconnection to tape libraries at a great distance from the host
that is writing data to it. This permits customers to place their tape libraries in
secure backup centers, such as "Iron Mountain." A number of people have said
that this "at distance" tape backup will be iSCSI's killer app.
At the bottom line, iSCSI is all about giving the customer the type of
interconnect to storage that they have been requestinga network-connected
storage configuration made up of components that the customer can buy from
many different places, whose purchase price is low, and whose operation is
familiar to many people (especially computer science graduates). They also get
a network they can configure and operate via standard network management
tools, thereby keeping the TCO low. Customers do not have to invest in a
totally new wiring installation, and they appreciate the fact that they can use
Cat. 5 cablewhich is already installed. They like the way that iSCSI can
seamlessly operate, not only from server to local storage devices but also
across campuses as well as remotely via WANs.
These customers can use iSCSI to interconnect remote sites, which permits
mirrored backup and recovery capability, as well as a remote connection to
their tape libraries. (See Figure 1-7.) On top of all that, iSCSI will be operating
on low-end systems and on high-end systems with performance as good as
what FC networks can provide. If that is not enough, it also comes with built-in
Internet Protocol security (IPsec), which the customer can enable whenever
using unsecured networks.
For over a decade now, there has been the concept of file serving. It begins
with the idea that a host can obtain its file storage remotely from the host
system. SUN defined a protocol called Network File System (NFS) that was
designed to operate on the IP network. IBM and Microsoft together defined a
protocol based on something they called Server Message Block (SMB).
Microsoft called its version LAN Manager; IBM called its version LAN Server.
The original SMB protocol ran only on small local networks. It was unable to
operate seamlessly with the Internet and hence was generally limited to small
LANs. Microsoft updated SMB to make it capable of operating on IP networks.
It is now called the Common Internet File System (CIFS).
It should be noted that Novell created a file server protocol to compete with
IBM and Microsoft.
A file server protocol places a file system "stub" on each host, which acts as a
client of the target file server. Like a normal file system, the file system stub is
given control by the OS; however, it simply forwards the host's file system
request to the remote file server for handling. The actual storage is at the file
server.
File serving began as a means to share files between peer computer systems,
but users soon started dedicating systems to file serving only. This was the
beginning of what we now call a network attached storage (NAS) appliance.
Various vendors started specializing in NAS appliances, and today this is a very
hot market. These appliances generally support NFS protocols, CIFS protocols,
Novell protocols, or some combination. Since NASs operate on IP networks,
many people see them as an alternative to iSCSI (or vice versa). In some
ways they are, but they are significantly different, which makes one better
than the other in various environments. We will cover these areas later in this
book.
Chapter Summary
In this chapter we discussed the various types of hard drives, and the type of
interconnect they have with the host systems. We also discussed their
applicable environment and their limitations. This information is highlighted
below.
SCSI drives are connected to a host via a SCSI bus and use the SCSI
protocol.
The SCSI command description block (CDB) is a key element of the SCSI
protocol.
The real or logical disk drive that the host talks to is a logical unit (LU).
SCSI bus distance limitations vary from 1.5 to 25 meters depending on the
type of cable needed by the host or drive.
Non enterprise storage controllers usually have only one or two SCSI bus
connections.
Enterprise storage controllers usually have more than two SCSI bus
connections.
Fibre Channel requires its own fabric management software and cannot
use the standard IP network management tools.
Personnel trained in Fibre Channel are scarce, and companies are pirating
employees from each other.
iSCSI can use much of the storage management software that was
developed for Fibre Channel.
iSCSI will work not only with server systems but also with desktop and
laptop systems via currently installed Cat. 5 cables and 10/100BaseT as
well as 10/100/1000BaseT NICs.
Desktop and laptop systems will probably be very happy even if they
utilize only up to 300 Mb/s on the 1000Mb/s-capable Cat. 5 cable.
The prices of iSCSI HBAs are currently significantly less than those of FC
HBAs.
FC prices won't fall significantly until iSCSI becomes a threat at the high
end of the market.
Tape backup is likely to become the killer app for iSCSI in the future.
NAS and iSCSI storage can be located on the same network; though they
have overlapping capabilities, they also each have capabilities that the
other does not (which will be discussed later in this book).
Chapter 2. The Value and Position of iSCSI
To the Reader
The Midrange
FC and iSCSI
Chapter Summary
To the Reader
This chapter will take you through the different environments in which
network-connected storage may be appropriate. Because market planners and
engineers may have different views on the potential market for iSCSI
products, we will discuss what the market looks like, in hopes of bridging the
divergent views.
Small installations are called SoHo (Small office, Home office) environments.
In such environments customers will have one or more desktop systems, which
they are tired of constantly opening up to install storage devices. They want
easy interconnects that permit them to operate with external storage as fast
as they can operate with internal storage.
The Home Office
Generally the home office will have computers that are connected on small
locally attached Ethernet 100Mb/s links. These systems typically have more
processing power than they have storage access capability. These installations
will find value in placing their storage in a central location, dynamically adding
it to their personal computer systems with a simple plug-and-play
configuration, thereby obtaining additional storage without having to open up
their systems.
For the home office, vendors are bringing to market a simple disk controller
that can attach from one to four (or more) low-cost ATA desktop-class drives.
These controllers and drives will be purchased at local computer superstores
for very low prices. Moreover, the customer will be able to buy one or two
drives initially with the basic controller and then add drives whenever they
wish.
Customers can now purchase 10/100/1000 Ethernet cards that can operate
over the inexpensive Cat. 5 Ethernet cable already installed for their existing
10/100Mb/s network.
Prices for 10/100/1000 Ethernet NICs and switches are dropping rapidly. In
early 2002, 10/100/1000 Ethernet NICs cost $60. The then current 10/100
Ethernet NICs cost only $30, down from $60 just a year previously. It seems
reasonable to assume that the 10/100/1000 Ethernet NIC will soon be the
default in most desktop systems, which means that the home office will have
gigabit capability and processors fast enough to at least utilize 300 Mb/s. This
will give desktop home office systems as much storage access as they can use.
Usually home offices obtain all their software either from the OS that came
with the unit or from a local computer store. There is almost no software in
this environment that knows how to work with shared files, and it is very rare
to see a file server in this environment.
Home offices use peer-to-peer file-sharing functions that come with the OS to
permit one user to operate on a file created by another (serial sharing).
However, the storage is considered to belong to the system to which it is
attached. As a rule, when one home system runs out of space, owners do not
use the space on another system but instead upgrade the storage on each
system independently as needed.
iSCSI will permit home office users to set up their own external storage pool
connected via a LAN. The owner will then assign new logical or physical hard
disk drives (HDDs) to each host system without needing to open or replace
them. (See Figure 2-1.)
We have been talking as if the home office had more than one host system.
This is because owners of home offices tend to keep their old systems, which
have old data and applications that still work. Also, the home/family use
dynamicthe multiple computer familyis often at work. Many times there are
two, three, or more computers in the same familyone for each adult and one
for the childrenbut only one person is responsible for maintaining them all. In
these environments a shared pool of storage is valuable for ease of both
access and administration.
At least one individual has made the claim that iSCSI's real competition is the
new serial ATA (S-ATA)a cabling protocol that travels from the controller chips
on the motherboard directly to the HDD. This is not a true consideration today,
since the S-ATA is currently 1 meter in length and has no sharing capability.
Therefore, it is unlikely that it will be used for a cabling interface that hangs
out of a home PC for a general storage interconnect. There is a proposal for
increasing the length, but this is for rack mount versions and it is not clear if
that will ever affect the desktop or home market. Further, the S-ATA
specification has not yet defined a technique for permitting the same storage
device to be attached to more than one system at household distances. Within
the iSCSI target, however, one may find iSCSI coming into a small box and
then interconnecting to the ATA disks with S-ATA cables. (See Figure 2-2.)
The real problem starts when the user has to migrate the data from an old file
server to a new unit, a difficult and very disruptive process. There are
technical solutions, of course. One of them is to buy another file server or a
network attached storage (NAS) appliance. Thanks to iSCSI there is another,
generally less costly approachpooled storage. Pooled storage may consist of
simple iSCSI JBODs (just a bunch of disks) or RAID controllers, all connected
to the same network via iSCSI. With iSCSI, users can also begin small and add
storage as needed, placing it wherever they have room and a network
connection. Regardless of how many units they add, all of them are logically
pooled and yet any of the individual systems in the office can have its own
private storage portion.
With iSCSI all the major storage placement decisions are performed by the
various host systems as if the storage were directly connected to them.
Because of this fact, iSCSI is fairly simple compared to NAS and this results in
low processing requirements in the iSCSI storage controllers. It is therefore
expected that iSCSI JBODs and RAID controllers will be significantly less costly,
and support more systems, than the same storage in a NAS appliance. Using
iSCSI units in this way is not always the right answer; however, it does give
the customer another option that may meet their needs for flexibility and
price. In other words, for the same processor power an iSCSI appliance can
support more clients and more storage than is possible with a NAS appliance.
Small offices are similar to home offices, except that they have more links and
switches. In most installations, a switch can be used to attach either a new
NAS appliance or an iSCSI storage controller. (See Figure 2-3.)
Like home office systems, small office systems are bound together with
100Mb/s links. As a rule, it will be possible for them to upgrade to 1000Mb/s
links whenever their 100Mb/s links become congested. They can usually ease
into iSCSI storage controllers with 100Mb/s links, but over time these links
will be upgraded to 1000Mb/s. They will operate on the same (Cat. 5) Ethernet
cable and provide all the bandwidth needed to support any normal demand for
iSCSI access, NAS access, normal interactive traffic, and Web traffic.
In a small office that has NAS, iSCSI storage can be added to the existing
network with no more network congestion than would exist if its current file
server were updated or if a new NAS server unit were added.
The clear trend in the industry is to tie applications to databases. When this is
done, for the most part the question of sharing files becomes academic since
almost all major databases[*] use "Shared Nothing." In this model the storage
is not shared with other systems. This is true even in larger "clustered"
installations, where the database query is "parsed" and divided up among the
different database systems in the cluster that "own" the data. In this way the
more database servers an installation has, the less it needs to share files and
the more useful iSCSI becomes.
[*]Oracle offers a special version of its database that uses shared disks (LUs) in clustered systems.
This is called Oracle Real Application Clusters, but it is usually found on high-end enterprise database
servers and is not normally used with NAS servers.
We can find a similar situation with chip engineers, who need to share macro
files and their own design files with other designers and the simulation system.
Software engineers are still another example. They share their development
libraries, their resultant components, and even their documentation library.
The engineering environment is the primary environment where files are
shared. Much of the rest of the world shares by sending things around in e-
mails. In general, they usually do not have a file sharing need.
When incidental serial file sharing occurs, the small business office, like the
home office, is more apt to use peer-to-peer sharing and not put in a NAS
server. However, this probability varies with the size of the office; the larger it
is, the higher the possibility that it will devote a system to file serving or
install a NAS appliance. Even when the file sharing is not very important, it
often finds use for the file server/NAS as a backup device.
One other solution is the single unit that supports both NAS and iSCSI
functions, which I call a "dual dialect" storage server. With this type of system
the installation does not need much precision in its NAS/iSCSI planning. As an
example, let's suppose the installation only needs 10% of its clients, or
applications, to have file sharing. In this case, one dual dialect storage server
can devote 70% of its processing capability to that 10% and use the remaining
30% for the 90% that have no file-sharing requirements. (See Figure 2-4.)
The pile-on HBA approach will have higher latency than an HBA that has
TCP/IP and iSCSI processing integrated onto a single chip. Even if the pile-on
HBA can operate at line speed (1 Gb/s), the latency caused by this type of
adapter is unlikely to permit its ongoing success in the market. That is because
HBAs with full iSCSI and TOE chips will permit not only operation at line speed
but also very low latency. We should consider the pile-on approach to be a
time-to-market product that vendors will replace over time with faster and
cheaper HBAs using iSCSI and TOE integrated chips.
The goal of iSCSI HBAs is to keep latency as close to that of Fibre Channel as
possible (and it should be close when using integrated chips) while keeping
costs significantly under those of Fibre Channel.
Some people have argued the price issue, saying that Fibre Channel can easily
lower its prices to match iSCSI's because an FC chip will have less silicon than
an iSCSI TOE chip. This is of course an important consideration, but sales
volume is the key, and iSCSI has the potential for high volume with a
technology that operates in an Ethernet environment. This includes operating
at gigabit speeds with normal Cat. 5 Ethernet cable attachments so that the
customer doesn't have to install and manage a new cable type.
As stated previously, I do not believe that FC vendors will give up their high
margins in the high-end market in order to fight iSCSI in the low-end and
midrange markets. This will only occur when iSCSI is considered a threat in
the high end, but by then iSCSI will have large volumes in the rest of the
market and will be able to push the price envelope against Fibre Channel. Also
remember that TCP/IP (and Ethernet) connections will always be needed on
these systems anyway. Therefore, since FC is always a "total cost adder,"
whereas iSCSI will have much of its cost supported by other host requirements
for IP interconnect, price advantage will clearly go to iSCSI.
There has been talk that FC vendors will attempt to move their 1Gb offerings
into the midrange while keeping their 2Gb offerings at the high end. However,
the total cost of ownership (TCO) to the midrange customer will still be higher
than iSCSI because of the shortage of FC-trained personnel, the use of new
special cables, and, as mentioned above, the fact that Fibre Channel is always
a total cost adder.
The goal is for the midrange environment to be able to obtain iSCSI-block I/O
pooled storage, with performance as good as that of Fibre Channel but at lower
cost. However, the midrange customer will still face the dilemma of iSCSI
versus NAS. The same consideration and planning should be done in this
environment as in the small office environment. The only difference is in the
capabilities and price of the competing offerings.
In addition to the normal NAS and iSCSI offerings in this environment, there
will be dual dialect offerings also. The difference is that the iSCSI-offload HBAs
and chips can be employed to reduce the iSCSI host overhead to a point where
they are competitive with Fibre Channel and direct-attached storage. This is
not currently possible with NAS.
With the new copper 1000Mb/s Ethernet adapters, users can have both a high-
speed interactive network and a high-speed storage network, all without
changing the Cat. 5 Ethernet cable already installed throughout their
company. iSCSI storage controllers can supply the needs of both servers and
client desktops and laptops.
Still, the argument is often made that a NAS solution can address the needs of
desktops and laptops. This is true, but at a higher cost. As pointed out earlier,
in the small office environment many applications are being written to use
databases. They generally use a "shared nothing" approach and therefore
provide an information-sharing environment in which NAS is not required.
Again, if files need to be shared, NAS is appropriate; otherwise, a block I/O
interface best meets the requirements. iSCSI is the most cost-effective
approach for non-shared pooled storage.
Many of these midrange companies will be building iSANs. These are logically
the same as FC SANs but are made up of the less expensive iSCSI
equipmentless expensive because the entire Ethernet and IP equipment
market is relatively low priced (at least low priced when compared to Fibre
Channel). Even iSCSI HBAs are cheaper than current FC components. Intel,
for example, has declared that its HBA will be available at a street price of
under $500. It is further expected that iSCSI HBAs and chips will have even
lower prices as sales volumes go up.
One significant difference between the midrange and small office computing
environments is that the I/O requirements of the various servers can be as
demanding as that found in many high-end servers. Therefore, in the
midrange one tends to see more use of iSCSI HBAs and chips in various
servers and storage controllers, and a smaller reliance on software versions of
iSCSI. (See Figure 2-5.)
High-end environments will have the same processor offload and performance
requirements that midrange environments have. However, they will probably
be more sensitive to latency, so it is expected that the pile-on type of HBA will
not be very popular. Because of the never-ending throughput demand from
high-end servers, it is in this environment that HBAs with multiple 1Gb
Ethernet connections and 10Gb implementations will eventually find their
most fertile ground.
The campus is the area adjacent to the central computing sitewithin a few
kilometerswhere private LANs interconnect the buildings containing local
department servers as well as the desktops and laptops that are also spread
throughout. The different department areas are analogous to the midrange
and small office environments. Their general difference is that, with the use of
iSCSI, they can exploit the general campus IP backbone to access the data,
which may be located at the central computing location.
Often these department areas have policy or political differences with the
organization that runs the central computing complex, and so they want their
own independent server collections. Generally they want the flexibility that a
storage area network (SAN) can provide (such as device pooling and failover
capability), but they do not want to get into the business of managing an FC
network.
Today, even if the FC cables could be pulled to the various campus locations,
since Fibre Channel has no security in its protocols, the access control
demands of the central computing location may be more than the departments
want to put up with. iSCSI, on the other hand, has security built into the basic
protocol (both at the TCP/IP layer and at the iSCSI layer), which permits fewer
invasive manual processes from anyone, including the disk storage
administrator at the central location. iSCSI also permits the department
servers to be booted as often as necessary, while still getting at central
storage, something that probably would not be done if located within the main
computing center.
Because of their needs and desires, campus departments are very likely to
view iSCSI as key to their strategic computing direction. However, the campus
environment is made up of more than just the department servers. It also has
individual desktops and laptops distributed throughout that look like home
office systems. A major difference, however, is that their users are not
encouraged to modify them. Instead, every time they need additional storage,
they have to justify it to either the department or the central computing
location. Then the central or department "guru" who handles the system, must
schedule time to come out and do an upgrade. Since these guru types handle
many different users, they take approaches that can be unpleasant for the end
user, often causing the loss of data or carefully constructed "desktop screens."
Gurus are in a no-win scenario. They do not like taking end users' systems
apart, especially since users can be abusive about procedures, scheduling, and
so forth.
Some installations have been known to just upgrade the entire system
whenever new storage or processor power is needed. In the past this was often
a reasonable approach since the need for processing power was keeping pace
with the creation of storage. Now, as a rule, this is not the case. The 1-to-2
gigahertz (Gh) processors seem to have reached a plateau where the
productivity of the office worker does not benefit from the additional speed of
the laptop or desktop. However, one can still generate a lot of storage
requirements with these processors, and it is beginning to occur to many
companies that replacing systems just to upgrade the storage is a waste of
time and money. Further, it greatly disturbs employees when they lose their
data, their settings, or their visual desktop. Even when things go right in the
backup and restore stages of bringing data from one system to another, the
process is lengthy and tedious. Companies that believe time and productivity
are money dislike these disruptions.
Both the end user and the guru will love iSCSI. To get additional iSCSI storage
the end user just has to be authorized to use a new logical volume, and the
issue is done. Often this can be accomplished over the phone.
Over time, desktops will be "rolled over" for newer versions, which will come
equipped with the 10/100/1000BaseT (gigabit copper) IP adapter cards.
1000BaseT-capable adapter cards permit desktop performance of up to 1 Gb/s,
which will greatly improve the performance of iSCSI storage. Note that most
installations use Cat. 5 copper cables for 10/100Mb/s Ethernet connections,
and these same Cat. 5 cables are adequate for gigabit speeds. Therefore,
installations do not have to rewire in order to get gigabit iSCSI storage access
for their ubiquitous desktop systems.
Since iSCSI also supports remote boot, one can expect many desktop systems
only to support storage connected via iSCSI in the future. The desktops can
then be upgraded as needed independently of the data.
The Satellite
A remotely located office, known as a satellite, will have an environment
similar to that of a campus. It often functions like a department or small office.
Satellites have their own desktop systems and sometimes their own servers.
They generally suffer from the lack of adequate "remote support," which often
means slow response to their needs.
As in the small office, satellite users do not usually touch the system but
instead get a guru to come to the remote location to fix things. With the use of
iSCSI many satellite installations can have their storage-related needs handled
via the phone. As they need more storage, they can call in the storage
administrator, who enables more logical volumes for their use. This is possible
since with iSCSI they are connected to the central location via a virtual private
network (VPN).
When the various satellite offices are located in a metropolitan area, a VPN
becomes very attractive, since there will not be a large problem with "speed of
light" latency issues. These network types are called metropolitan area
networks (MANs). However, the greater the distance, the more local (iSCSI)
storage will be deployed at the satellite location and the less central storage
will be used for normal operations. These more remote locations will like the
feature of local pooled storage that they get with iSCSI, without having to
learn Fibre Channel.
When metropolitan area satellite offices need more iSCSI-based storage, they
just ask the storage administrator at the central installation to logically attach
more virtual volumes to the user's iSCSI access list. All this is possible without
significant effort at the satellite location, assuming, of course, that adequate
bandwidth exists between the central location and the satellite office.
In the past, satellite office connections required private or leased phone lines,
but it is now becoming prevalent in many areas for carriers to offer "IP tone"
at a much lower cost than leased lines. Thus, the customer is now more likely
than before to have high-speed connections between the satellite office and
the central office.
Satellite locations may also have local servers and storage requirements, and
will want the flexibility offered by a SAN. They will find iSCSI a more cost-
effective solution than Fibre Channel, especially since the network
management can still be handled at the central location.
The satellite installation, like the campus environment, will also want to be
able to use centralized tape units for backup without having them located at
the satellite location. This also is an ideal exploitation of the capabilities of
iSCSI. (See Figure 2-8.)
The remote equipment can range from small RAID arrays to large disk RAID
arrays, tapes, and tape libraries. Even the central location prefers to have the
tape library "at distance" from the central site. Often this is a tape vaulting
area (such as "Iron Mountain"). iSCSI permits "natural" access to such remote
units, without undue gateways, routing, or conversions from one technology to
another. (See Figure 2-9.)
Figure 2-9. The at-distance environment.
Part of this "natural" access is the ability of either the servers, the storage
controller, or some third-party equipment to create dynamic mirrors at remote
locations, which can be "spun off" at any time and then backed up to tape. This
permits remote backup without impacting online applications. The remote
mirror can be located at the tape-vaulting site, or iSCSI can be used to send it
to another remote location. This type of process, though possible at a local
site, seems to be very valuable when located at a secure remote site.
Today this remote storage access is done with proprietary protocols, devices,
and often expensive leased lines. In the future it will be done with standard IP
protocols, primarily iSCSI, often utilizing carrier-provided IP tone
interconnects.
The central site will receive iSCSI storage requests from campus department
servers, from desktops and laptops, and from satellite locations. Likewise it will
issue storage requests to remote locations for backup and disaster recovery.
(See Figure 2-10.)
In the central site iSCSI must be able to perform as well as Fibre Channel and
match the same RAS requirements. These difficult but attainable requirements
dictate that hosts use top-of-the-line iSCSI HBAs and that these HBAs be
configured to operate in tandem such that failover is possible. Also, they need
to be usable in a parallel manner such that any degree of throughput can be
accomplished. The iSCSI protocol has factored in these requirements and
supports a parallel technique known as Multiple Connections per Session
(MC/S). This permits multiple host iSCSI HBAs to work as a team, not only for
availability but also for maximum bandwidth. The same set of capabilities
within the iSCSI protocol also permits iSCSI target devices to perform similar
bandwidth and availability functions.
The high-end environment will have both the FC and the iSCSI storage
controllers needed to service it. And since Fibre Channel is already there and
can't be ignored, the installation must be able to interconnect the two storage
networking technologies and expect them to "work and play" well together. The
installation will have the problem of how to begin and how to integrate the two
networks. Customers will want to invest in iSCSI storage controllers and yet
continue to capitalize on the FC SAN investments they already have.
Various vendors offer "bridge boxes" that convert iSCSI host connections to FC
storage connections. Some boxes convert FC host connections to iSCSI storage
connections. Both of these functions are accomplished via routers, gateways,
and switches (switch routers).
The thing that will actually make all this interconnection capability work is the
management software. Probably there will be storage network management
software that can operate with all FC networks and similar software that can
control the iSCSI network. Clearly, though, there is a need for storage
management software that can manage a network made up of both FC and
iSCSI.
Since iSCSI and Fibre Channel share the SCSI command set, most existing
LUN discovery and management software will continue to operate as it does in
SCSI and Fibre Channel today. Therefore, there should not be significant
changes to the SAN LUN management software.
The key problem is that the combined "SCSI device" discovery processes need
to be carried out when there are both FC and iSCSI connections. It needs to be
done both to and from the hosts and to and from the storage controllers. When
an FC network manager performs its discovery process and detects an FC
device, which just happens to be available to an iSCSI host (via a gateway
device of some kind), it is important that the iSCSI network manager also
know about the device. Therefore, an FC/iSCSI network manager needs to
combine the results of the FC discovery process with its iSCSI discovery
process so that all appropriate devices can be offered to the host systems as
valid targets.
To sum up, the high-end environment contains all aspects of the low-end
(SoHo) and midrange environments, plus additional requirements for high
availability and large bandwidth, along with campus and WAN (intranet)
connections. It also requires seamless interconnect between FC and iSCSI
networks.
FC and iSCSI
Many have asked, "Will the integration of iSCSI and Fibre Channel continue
indefinitely?" Since that question has an open-ended timeframe, the answer
has to be no. A more important question, however, is, "What is the timeframe
for when the customer will more likely buy iSCSI products than Fibre Channel
products?" Said another way, "When will the sales of iSCSI equipment surpass
the sales of Fibre Channel equipment?" My guess is 20062007. Other analysts
have said 2005 and even 2004, but this seems to me to be wishful thinking.
I believe that the volumes will tip in favor of iSCSI (regardless of the
timeframe) because iSCSI can perform all the functions that Fibre Channel
can. In addition, iSCSI will operate on the wide area network (WAN), the
Internet, and campus LANs up to the individual desktop and laptop level. Thus,
when an installation is considering what to purchase next, it is probably going
to choose the most flexible technology, especially since that seems to be the
technology with the lowest projected price. This applies to both the initial costs
to purchase and, as addressed earlier, the ongoing cost of management,
training, and the like.
I did not say that Fibre Channel was going to go away in 20062007, just that
iSCSI would begin to outsell it. That is because iSCSI will make significant
gains by playing to a larger market, not just by displacing current Fibre
Channel. I think Fibre Channel will continue to evolve for a while and continue
to support its customer base. In other words, if I could project myself to 2010,
I would still see sales of FC equipment. This means that not only will there be
iSANs (SANs made up of only iSCSI network entities), but there will also be
long-term business in the area of integrating iSCSI and FC SANs via various
gateways and switching routers.
Chapter Summary
Home offices:
Find iSCSI solutions useful for easily adding storage without opening
systems.
Small offices:
Midrange environments:
Have desktop and laptop systems that use software iSCSI device drivers.
The size of the midrange market will be so large that the price of iSCSI
adapters and chips will bring significant cost reductions to the iSAN market.
iSCSI will tend to dominate the markets
At the midrange.
Well-trained personnel.
Tape libraries.
Many small fiefdoms of key managers who need to own their own
systems.
At-distance installations.
They can have their own local iSCSI SANs with local storage.
They can have all their disk storage requirements coordinated with a
central location.
The bigger the satellite office, the more it will function independently, but even
a small office will have its logical SAN extend across VPNs to the central
location. This will be especially true in metropolitan area networks (MANs).
MANs will permit central iSCSI storage to be used as if it were local to the
satellite location.
These environments want access to devices such that tape can be located
in some remote area (called tape vaulting).
The central location holds all the big servers and big storage controllers.
The key requirement is to ensure that iSCSI and Fibre Channel "work and
play" well together.
Perform discovery.
iSCSI will be considered a very successful transport if it can meet all the
needs of SoHo and midrange environments.
Chapter Summary
To the Reader
This chapter will describe the events that brought iSCSI to the storage
industry. Key individuals involved in these events are acknowledged in the
Credits and Disclaimer section at the beginning of this book.
The chapter will focus on the early days of iSCSI development. Included is an
important set of measurements that helped IBM set its iSCSI direction with
respect to NFS, Fibre Channel, and the need for a TCP/IP offload engine.
Readers not interested in history should at least read the Measurements
section and then pick up with the Chapter Summary.
SCSI over TCP/IP
By October 1998 the researchers had the first working prototype of SCSI over
TCP/IP. Numbers of performance measurements were taken comparing the
overheads of Fibre Channel and of SCSI over TCP/IP and of SCSI over raw
gigabit Ethernet. The bottom line was that the TCP overhead was not as great
as feared, and it was felt that processor power, as it kept increasing, would
make the differences negligible.
This was not a sufficient condition to set new directions within IBM
development, however. There were more questions to answer, key among them
whether SCSI over TCP/IP could be made to perform well in "enterprise"
environments or would remain only a niche technology. Several things needed
to be measured and understood, to arrive at an answer:
Was there a future for the technology in areas where Fibre Channel would
not go?
Was there a definable path to get to the technology promises?
To get a handle on these questions, measurements were taken where data was
sent from client to target systems. In some cases the connection was Fibre
Channel and in some it was TCP/IP. The measurements showed very clearly
that TCP/IP had a long way to go before it could be competitive with Fibre
Channel in the enterprise machine room. It also became very clear that,
unless TCP/IP could be offloaded onto a chip or an HBA, as was done with the
networking overhead of Fibre Channel, there would never be a sufficient
reduction of CPU overhead to convince customers to use TCP/IP instead of
Fibre Channel in fully loaded servers, especially with high-end systems. In
addition, some key measurements also needed to be run to answer the
following questions:
Will SCSI over TCP/IP perform better or worse at the target than a
Network File System (NFS) server?
These were critical questions, especially since most pundits at that time held
the position that a reasonable TOE could not be created.
Note: During the second half of 1999, I met with a number of TCP/IP and NIC
vendors to discuss the possibility of a TOE. They said that a TOE on a chip was
not possible or reasonable.
Measurements
The question of NFS versus SCSI over TCP/IP was considered to be a key
consideration regarding the future of iSCSI. If NFS could operate as well as
SCSI over TCP/IP, why would one ever want to use SCSI over TCP/IP? A
number of measurements were taken in 19981999 to get a basis for
answering this question.
By early 2000 the measurements had led to the creation of an IBM internal
white paper. A key set of measurements showed how a file of a given size
could be transferred from a client to a target/server using TCP/IP NFS and
using SCSI over normal TCP/IP, and how the CPU utilization compared in these
two approaches. To ensure that they were measuring apples to apples, the
researchers put everything on a RAM disk at that target to eliminate some of
the variability.
The results can be seen in Figure 3-1, which shows a significant difference
between the CPU utilization in the NFS server and the SCSI-over-TCP/IP
server. The client side of the operation is not shown, but the client-side
overhead was about a wash between SCSI and NFS.
[*]
The base performance numbers used in this graph were provided courtesy of the IBM
Corporation [Performance].
Figure 3-1 shows that SCSI over unmodified TCP/IP consumed only 26% to
31.4% of the CPU cycles consumed by an NFS server with unmodified TCP/IP.
The single-buffer-copy version of SCSI over TCP/IP reduced those numbers to
9% when sending and 14% when receiving. A zero-copy version consumed
only 6% of the CPU cycles that were consumed by the NFS server when
sending and 4% when receiving.
Then
Assuming that
The total NFS TCP/IP overhead for data transfers is about the same as
that of SCSI over TCP/IP.
Then
This meant that one could handle around 12 to 16 times the I/O workload with
SCSI over TCP/IP than was possible with NFS, even if both offloaded the
TCP/IP data copy overhead.
In the fall of 1999 IBM and Cisco met to discuss the possibility of combining
their SCSI-over-TCP/IP efforts. After Cisco saw IBM's demonstration of SCSI
over TCP/IP, the two companies agreed to develop a proposal that would be
taken to the IETF for standardization.
The combined team from Cisco and IBM developed a joint iSCSI draft during
the fourth quarter of 1999. They had an initial external draft ready by
February 2000, when a meeting was held in San Jose attended by HP, Adaptec,
EMC, Quantum, Sun, Agilent, and 3Com, among others, to solicit support for
presentation of the draft to the IETF. At this meeting several proposals were
talked about that used SCSI over Ethernet. At least one suggested not using
TCP/IP; however, the general consensus of the group was for SCSI-over-TCP/IP
support. With backing from this group, the draft was taken to the IETF meeting
held in Adelaide, Australia (March 2000).
iSCSI and IETF
At Adelaide there was a BOF (birds of a feather) meeting at which the draft
was presented, and it was agreed that a group would meet in April 2000 in
Haifa, Israel, to do additional work on it. The goal was to enlarge the working
team, secure consensus, and prepare to take the proposal to the next IETF
meeting so that a new workgroup for iSCSI could be started. (By this time we
had coined the name iSCSI to represent the SCSI-over-TCP/IP proposal being
developed.)
The next meeting of the IETF was in Pittsburgh in August 2000. At that
meeting the draft was presented and a new workgroup was started. This group
was called IP Storage (ips) workgroup, and it included not only iSCSI but also
a proposal for bridging FC SANs across IP networks (FCIP). Subsequently a
similar draft from Nishan Systems Corporation, called iFCP, was added to the
workgroup. David Black and Elizabeth Rodriguez were chosen to be the co-
chairs of the IETF ips workgroup, and Julian Satran was made the primary
author and editor of the iSCSI working draft. Subsequently I was chosen by
David Black to be the technical coordinator of the iSCSI track.
The process moved the draft though several iterations until it was agreed that
all outstanding issues had been resolved.
It should be noted that parallel efforts were under way within Adaptec and
Nishan Systems. Adaptec was focusing on SCSI over Ethernet, and Nishan was
focused on Fibre Channel over UDP. These efforts were not accepted by the
IETF ips workgroup, but the Adaptec and Nishan efforts, as they joined the
iSCSI effort, have given additional depth to the project.
At about the same time Cisco started shipping its 5420 bridge/router
(iSCSI to Fibre Channel).
Adaptec is shipping iSCSI HBAs and chips with iSCSI and a TOE.
Most of the current FC chip vendors are shipping, or planning to ship soon,
iSCSI chips and HBAs.
IBM has stated their intention to make iSCSI part of their main line fabric
and not just a connection to its 200i product.
Membership in the IETF ips workgroup has grown to over 650 people
representing over 250 different companies. The SNIA IP Storage Forum (which
requires an additional membership fee above the fee to join SNIA) has 55
active companies contributing to joint marketing efforts for IP storage.
Chapter Summary
IBM performed some key measurements that showed that SCSI over
normal TCP/IP uses only about a third to a quarter of the CPU processing
power used by NFS with normal TCP/IP.
The performance analysis showed that, if all the overhead for TCP/IP data
movement was removed from both iSCSI and NFS, the remaining iSCSI
CPU utilization would be only 6% to 8% of the remaining NFS CPU
utilization.
IBM and Cisco developed an initial draft proposal for presentation to the
IETF.
A team of companies came together in Haifa to refine the draft for the
IETF (draft level 0).
IBM, Cisco, and the new team of supporting companies brought the draft to
the IETF, where it was part of the new IETF working group called IP
Storage (ips).
Adaptec and Nishan had parallel efforts but over time joined the iSCSI
bandwagon.
Many vendors are now shipping iSCSI products, and full gigabit line speed
has been demonstrated between an iSCSI initiator on the East Coast and a
target FC controller on the West Coast.
Chapter 4. An Overview of iSCSI
To the Reader
TCP/IP
Sessions
Chapter Summary
To the Reader
The following text is at a fairly high level. However, readers can skip to the
various summaries within the chapter and then pick up with the overall
chapter summary.
The host system is made up of applications that communicate with the SCSI-
connected devices via one of the following to a SCSI-class driver:
A file system.
There is a SCSI class driver for each type of SCSI device (tape, disk, etc.).
The SCSI class driver invokes the appropriate hardware device driver.
The hardware device driver interacts with the HBA via a vendor-specific
interface.
The HBA sends the SCSI commands (CDBs) and data to the remote SCSI
device's HBA and device driver.
The actual SCSI communication is thus from the SCSI class driver to the SCSI
process in the target device. The SCSI class driver gets its instructions from
the application, and the SCSI target process gives the instructions to the LU.
Communication is from application to LU via SCSI protocols. The section on
iSCSI protocol layers further on discusses this process.
In Chapter 1 we talked about the SCSI parallel bus protocol that delivers the
SCSI commands to the SCSI device. To observe the semantics of SCSI we must
deliver SCSI commands in order to the LU and deliver data to or from the LU
as that LU requires. We also need to understand that, even with the SCSI
parallel bus protocol, commands for different LUs can be multiplexed with each
other, as can the data.
TCP/IP
As will be seen later, error-free in-order delivery has great value, but storage
needs an even higher degree of error detection than is normally available to
TCP/IP. Therefore, iSCSI designers added optional additional enhancements
that can be used in some environments to ensure better end-to-end data
protection. These additional features will be described later.
iSCSI's basic unit of transfer is called a protocol data unit (PDU). A PDU is
designed to carry, along with other information, the SCSI CDBs and the
appropriate data components from the initiator to the target, and to receive
the required data or reply in response. However, a CDB only describes the
function that the initiator wants the target to perform. It does not carry any
information about the address of the SCSI target or the LU to which it is
directed, because SCSI follows a procedure call model, in which the CDB is
only one argument. The other arguments, such as the LUN, are encapsulated
by iSCSI in the same PDU with the corresponding CDB, and the IP address of
the target is encapsulated in the TCP and IP portions of the protocol packet.
Let's say the above in a different way. Assume that the initiator discovers the
location of an iSCSI device, which it is authorized to access. This device will
have an IP address. The iSCSI device driver, or protocol handler within the
HBA, will build a PDU with the appropriate CDB and LUN. It will then hand that
packet over to a TCP/IP socket. TCP will encapsulate the PDU further with TCP
headers, then turn the resulting packet over to IP. IP will determine the detail
routing address, place it in a packet header, and send it to the Ethernet
component, which will add the physical-link-level headers and trailers and
send the final packet on its way. (See the encapsulations shown in Figure 4-2.)
That, in a nutshell, describes the main processing path for the iSCSI transport
protocol. The next section will describe how the various layers of the iSCSI-
related protocol stack exploit the TCP/IP message structure.
TCP/IP Summary
Protocol data units (PDUs) are the basic form of message unit exchange
between hosts and storage controllers.
Note that TCP looks at IP as its transport for sending packets of information to
its remote counterpart. Further, IP looks to a set of link-level wire protocols to
transport its packets across the network. These protocols can follow the
Ethernet standard or one of the OC (optical connection) standards. We will
focus on the Ethernet links here.
As you can see, a definitive layering is associated with the transport of SCSI
commands and data across the network.
Figure 4-3 illustrates the layering structure. Notice that the objects at the top
of the layering are the application and the LU to which the application is trying
to execute an I/O operation. To do that, it sends an I/O request to the OS
kernel which is directed in turn to the appropriate SCSI class driver.
Sometimes this is done indirectly through a file system. The SCSI class driver
forms the I/O request into an appropriate SCSI command, which is placed in a
data packet called a CDB and sent to the SCSI device via the appropriate local
device driver, indicating the required LU.
The sending TCP/IP layer passes packets to the remote TCP/IP layer by
delivering them to the data link layer below it (for example, the Ethernet data
link layer). As was shown in Figure 4-2, the data link layer places a header
and a trailer on the physical link along with the TCP/IP packet. It then passes
the total Ethernet frame to the remote data link layer, which passes it to the
TCP/IP layers above it.
Note that the PDU might span more than one TCP segment and that each of
the TCP segments will be encapsulated within an IP datagram which is further
encapsulated by the data link layer into an Ethernet frame. It is the job of
TCP/IP to reassemble the TCP segments in the right order on the target side
and deliver them to the iSCSI layer in the same bytewise order in which they
were sent. iSCSI will then extract the CDB from the PDU and deliver it to the
SCSI target device code, indicating the appropriate LUN. Then the SCSI target
code will deliver the CDB to the actual LU. When the SCSI layer needs data for
the command (e.g., a write command) it requests it from the iSCSI layer and
passes it through to the LU. The target iSCSI layer extracts the requested data
from PDUs sent from the initiator, which can be one of the following:
PDUs that were unsolicited but sent by the initiator and contain only data
(called "unsolicited" data)
Data PDUs that the iSCSI target explicitly requested on behalf of a direct
SCSI layer solicitation
Protocol Summary
iSCSI uses TCP/IP as the "network transport" to carry the basic elements of
the SCSI Protocol. It carries all the SCSI functions that would otherwise take
place on a physical SCSI bus. TCP/IP provides iSCSI with inherent reliability
along with byte-by-byte in-order delivery. iSCSI added the additional capacity
and reliability of multiple physical/logical links (TCP/IP connections) while
ensuring that the commands and data, which are spread across the multiple
connections, arrive in order at the target SCSI device.
The API will deliver the "write" request to the SCSI layer (the SCSI class
driver).
The SCSI class driver will build a CDB for the "write" request and pass it to
a device driver in the iSCSI protocol layer.
The iSCSI protocol layer will place the CDB (and other parameters) in an
iSCSI PDU and invoke TCP/IP.
TCP will divide the iSCSI PDU into segments and place a TCP header on
them; the IP processing will additionally place an IP header before the TCP
header.
TCP/IP will deliver the segment to the Ethernet data link layer, which will
frame the segment with Ethernet headers and trailers.
At the target, the data link layer will check and strip off the Ethernet
framing and pass the remaining segment to TCP/IP.
TCP and IP will each check and strip off their headers, leaving only the
iSCSI PDU, which they will pass to the iSCSI layer.
The iSCSI layer will extract the "write" CDB from the iSCSI PDU and send
it along with related parameters and data to the SCSI device.
The SCSI device will send the SCSI "write" request and the data to the LU.
Sessions
The logical link that carries the commands and data to and from TCP/IP
endpoints is called an iSCSI session. A session is made up of (at least) one
TCP/IP connection from an initiator (host) to a target (storage device).
In general this concept is simple and straightforward. There are several things,
however, that conspire to make the protocol more sophisticated and involved.
The first is the need to send out more I/O commands and data than can be
accommodated with a single TCP/IP connection. In Fibre Channel this is
handled by the host system adding more sessions between the initiator and
the target. This can also be done for iSCSI.
The only problem with this approach is the lack of a SCSI-defined method for
sending commands and data across multiple links. As a solution each target
vendor has created what the industry calls an initiator wedge driver to
balance the workload over the multiple FC links. This means that storage
vendors need to include their own code in the host system. The more types of
storage controllers the customer has, the more vendor-specific wedge drivers
are needed around the SCSI class drivers. This can cause a lot of operating
system software conflicts. In order to prevent this confusion in the iSCSI
space, and to allow any small system in the network to get to the appropriate
storage controller without having to add vendor-specific wedge drivers, iSCSI
added some additional functions (complexity) in its protocol.
One such function is the concept of multiple connections per session (MC/S).
(See Figure 4-4.) This is the ability for several TCP/IP connections to make up
a session and be used as if they were parts of a single conduit between the
initiator and the target. It allows commands and data to be transported across
the different links (connections) and to arrive at the ultimate SCSI layer target
in the same order as if they had been transported over a single connection. To
enable this, the iSCSI PDU must contain, in addition to the CDB and LUN,
additional information. This is somewhat ironic, since TCP/IP was chosen for its
ability to deliver data in order, but now we have to add information, counters,
and additional flow control to ensure the appropriate order across multiple
connections.
Session Summary
These links carry TCP/IP protocols and the iSCSI PDUs, which in turn carry
the commands and data.
The iSCSI protocol defines how the commands and data can be spread over
all of the session links yet to be delivered in order to the target SCSI
device.
Protocol Data Unit (PDU) Structure
Figure 4-5 shows the overall format of a PDU. Notice that it is made up of
several segments. It has a 48-byte basic header segment (BHS), which
contains the CDB, LUN, and so forth. This BHS will be studied in more detail a
little further on. Descriptions of the additional header segments (AHS) will be
shown in detail in Appendix A.
As one can see in Figure 4-5, there are three other extension types. One of
those is the data segment, which contains the data being sent to the target
device. The other two extensions are digests. "Digest" is a fancy term for an
error check word. It works similarly to the TCP/IP exclusive OR checksums, but
with 32 bits (instead of 16) and a more reliable algorithm, called CRC-32c
(cyclical redundancy check, 32-bit wide, as proposed by G. Castagnioli et al.
[Castagnioli]). The reason for this CRC will be discussed later.
Figure 4-6 shows the basic header in some detail. We can see that it has some
flags (discussed later), an iSCSI opcode (which specifies the type of iSCSI
operation being performed), and the lengths of the additional header segment
and of the data segment, if any. It may also contain, depending on the opcode,
the LUN, and the SCSI CDB. For request PDUs, if the I bit is set to 1, the
request is "immediate."
We have examined the relatively straightforward main path, the use of TCP/IP
connections, and the format of the PDU that carries the SCSI commands. Now
we will make things a bit more complicated. As mentioned in Chapters 1 and 2,
the market requires the building of iSCSI HBAs. Many vendors building them
in order to increase performance and reduce cost have decided to embed their
iSCSI processing with their TOE. This permits them to move the TCP/IP
packets directly to the host system without additional moves within the main
host processor memory. However, to do this the iSCSI processor needs to see
the TCP/IP segments as they arrive so that the iSCSI PDU header can supply
the information that permits the placement of the data directly in the host
system memory, at the appropriate locations, with no additional data
movement. This will be covered in more detail in Chapter 8.
The protocol data unit (PDU), which is the basic message packet that travels
between the host and the storage device, is made up of
The BHS has the opcode for the PDU and various flags; some PDUs have the
CDBs, the LUN, and a series of counters.
iSCSI and TOE Integration on a Chip or HBA
If the PDU header arrives on the HBA before the rest of the PDU, then all parts
of the PDU can be directly placed in their ultimate system memory location.
However, if the header arrives later than the other packets, the TOE will need
to buffer the packets until the header's arrival (we call that the "reassembly
buffer"). When the header arrives, the iSCSI processor will store the packets
directly in the correct target system memory location.
In normal situations the reassembly buffer is not very large. This permits a
very-low-cost HBA, since the RAM memory on it is minimal. It is only when
error processing causes retransmission that there needs to be a considerable
amount of RAM to hold all the fragments that come in before the missing PDU
header.
HBA vendors, as a result of combining the iSCSI processing and the TOE on
the HBA or chip, are offering a product that can assist greatly in the
performance of the normal operation path. However, the amount of on-HBA
RAM needed for the reassembly buffer, in the presence of an error, can be
quite high. This is especially true if the distances between endpoints are great
(say from Los Angeles to New York). Therefore, iSCSI has added an optional
additional protocol, called Fixed Interval Marking (FIM). FIM will permit the
integrated iSCSI/TOE HBAs to locate the next iSCSI PDU header in the stream
without waiting for any missing iSCSI headers to arrive. This will enable the
vendor to build a highly functional HBA but still limit the amount of additional
RAM required on it. Thus, HBA vendors not only can offload the TCP/IP
overhead but also can keep their HBA RAM memory requirement low and
thereby keep their cost low.
Various vendors are building TOEs so that they can offload the CPU
processing of TCP/IP.
The TOE allows rapid processing and low host CPU cycles.
Vendors are combining iSCSI processing with TOEs to create an iSCSI HBA
that will be functionally similar to the FC HBA.
iSCSI and TOE HBAs can work together to provide direct data placement
into the final host memory location.
A protocol called FIM will help the iSCSI and TOE HBA vendors to perform
direct memory placement without stalling the data flow when a BHS
arrives out of order at the TCP/IP buffers.
Checksums and CRC (Digests)
Even though we chose TCP/IP for its ability to deliver data in error-free byte
order, the fact is that the TCP/IP 16-bit one's complement checksum is not
always strong enough to ensure that all transmitted data is in fact received
correctly. This should not be viewed as a frequent problem, but it is frequent
enough that major enterprise environments cannot ignore the possibility of
undetected errors.
Before we panic, let's see what the TCP/IP network has going for it. The
Ethernet links themselves are protected by a 32-bit cyclic redundancy check
(CRC-32) calculation that travels with the packet across the Ethernet links.
The weakness of this is that it offers no protection for the packets as they pass
through routers, gateways, and switches, many of which leave their circuits
susceptible to data corruption. Therefore, it is possible for the data to be
delivered error free to a switch or router, yet have the switch or router cause
some errors before the outgoing packet is sent. Since the packet gets a new
Ethernet CRC as it is sent on to the target, the resultant corrupted data packet
is delivered to the target without an Ethernet error. And since the TCP/IP 16-
bit checksum is a weak detector of errors, the corrupted data can be delivered
to the application. When I say a "weak detector of errors," I mean weak
compared to a CRC-32 error checker. Looking at the combination of Ethernet
CRC-32 error checking on the links and TCP/IP checksum end to end, we can
conclude that the data is reasonably protected.
The usual approach to handling the issue of undetected error loss is to use
better routers and switches, perhaps ones that have internal checks on the
integrity of their memory and data paths. In this way the probability of an
undetected error is greatly reduced. However, when it comes to data, many
storage vendors are very protective and they want to add even further
corruption protection.
The architects of iSCSI have determined that they cannot ensure the integrity
of the data in installations that have less than perfect routers and switches.
Therefore, they have decided to include as an execution option their own 32-
bit CRC. Also, in order to permit iSCSI-specific routers and gateways, the
iSCSI CRC (when used) will be applied to the PDU header and data separately.
In this way, the iSCSI router will be able to change the PDU header and
reroute the PDU with its data, yet not have to reconstruct the data CRC.
Thereby it leaves the 32-bit CRC able to accompany the data end to end.
The fact that iSCSI can detect an error that TCP/IP has not detected, and
therefore that iSCSI must handle itself, is both good news, from an installation
standpoint, and bad news, from a protocol standpoint.
When TCP/IP finds the error, it silently retransmits it; the iSCSI layer does not
see any of it and does not need to do anything special for error recovery. On
the other hand, if TCP/IP does not discover the error but iSCSI does, iSCSI
must do the retry without assistance from TCP/IP. This will be discussed further
in Chapter 11.
Note that the CRC-32 feature is optional in iSCSI, and customers can choose
not to use it. Perhaps they have top-of-the-line equipment that does not need
the extra protection, or they may be operating with the integrity
(cryptographic digest) or privacy mode (encryption) of IPsec (See Chapter 12).
It is also expected that for most laptops and desktops that have a software
iSCSI implementation, the risk will be low enough and the overhead so
noticeable that they also will operate without the CRC feature.
In order to explain some of the following, we must cover the naming issue.
Naming in iSCSI eases the administrative burden somewhat. As a goal, the
initiator node and the target node should each have a single name, which is
used as part of all sessions established between them. Also, there is a
requirement that iSCSI initiator and target nodes have one of two types of
names:
Both types are meant to be long lived as well as unique in the world and
independent of physical location. Both have a central naming authority that
can ensure their uniqueness. Examples of each are iqn.1998-
03.com.xyc.ajax.wonder:jump and eui.acde48234667abcd.
The eui name is a string constructed from the identity formed using the IEEE
EUI (extended unique identifier) format (EUI-64). (EUI-64 identities are also
used in the formation of FC worldwide names.) Each EUI-64 identity is unique
in the world. Its format in an iSCSI name is eui. plus its 16 hex digits (64
bits).
The most significant 24 bits of the EUI-64 identity are the company id value,
which the IEEE Registration Authority assigns to a manufacturer (it is also
known as the OUI, or the organization unique identifier). The manufacturer
chooses the least significant 40-bit extension identifier; however, the results
must be a string of 16 hex digits, which is unique within the world.
The iqn names are formed by a different set of policies. The characters
following the iqn. are the reverse DNS name (the name resolvable by the
Domain Name Service) assigned by a naming authority within a company,
which can assign additional unique qualifiers. For example, suppose the
company has the base name ajax.xyc.com. The naming authority would pick
an additional string of characters that make the initiator node name unique in
the world. Thus, if a fully qualified domain name (FQDN) format is
jump.wonder.ajax.xyc.com, the iqn. form of the iSCSI initiator node name will
be that string reversed: com.xyc.ajax.wonder.jump.
However, the name of an iSCSI node might last longer than the company that
created it, or the division of the company could be sold. Subsequent
administrators in the new company may not have a clue about what names
were previously allocated and thus could create a naming conflict with
subsequent allocations. To avoid this, a date field is required before the
Reversed FQDN name, the format of which is yyyy-mm where yyyy is the year
and mm is the month. For example, iqn.1998-03.com.xyc.ajax.wonder.jump.
The date should be when the root name was assigned to the companythat is,
when the company received the xyc.com DNS root name segments from a
Domain Names Registry company (such as VeriSign). In the case of a purchase
or merger, the date should be when that merger or purchase took place. To be
safe, choose a previous month in which the DNS was valid as of the first of
that month. The point is to ensure that a later sale of the company or the
name will not cause a conflict, since at the date of assignment the iqn name
was unique in the world.
Because some companies have suborganizations that do not know what other
parts of the organization are doing, iSCSI has added the option of demarcating
the name so that conflicts can be reduced. In other words, it is possible for one
organization to own the root and then hand out subdomain names like ajax to
one group and wonder.ajax to another. Then each suborganization can self-
assign its iSCSI node names. The ajax group might, however, assign the string
wonder.jump, and the wonder.ajax group might assign the string jump. The
resulting two locations might both have an iqn name that looks like the one
above. (Remember that the iqn name format is the reverse of the domain
name format.)
To avoid this duplication, iSCSI gives an installation the option to apply a colon
(:) as a demarcation between the DNS assign string and the iSCSI unique
string. The following are some examples (also see Figure 4-7): iqn.1998-
03.com.xyc.ajax:wonder.jump for the ajax.xyc.com group; iqn.1998-
03.com.xyc.ajax.wonder:jump for the wonder.ajax.xyz.com group; and
iqn.1998-03.com.xyc.ajax.wonder.jump for organizations without conflict.
Figure 4-7. iSCSI node names.
The reverse name format ensures that the name is not confused with a real
FQDN name. This is necessary since the address of a target or an initiator is
actually both the TCP/IP address port and the iSCSI name. Now, since the IP
address port can be resolved from a real FQDN name, the FQDN can be used in
place of the absolute address port. Because the iqn is the reverse of the FQDN,
there will never be any confusion when they are used together. Also, the fully
qualified address will not have the same name repeated. It is intended to look
different and so is not resolvable to an IP address yet is unique in the world.
While we are on the subject, let's define the fully qualified address of an iSCSI
node. This address can be specified as a URL (uniform resource locator) as
follows: <domain-name>[:<port>[*]]/<iSCSI-name>. Note that items
enclosed in angle brackets (< >) are variables to be specified. Anything in
square brackets ([ ]) is optional. The slash (/) is syntactically required.
[*] The port can be omitted only when the well-known iSCSI port number (currently 3260) is used.
The domain name is either an IPv4 or an IPv6 address. The iSCSI name is an
iqn or eui name. If an IPv4 name is supplied, it is made up of four sets of
decimal numbers from 0 to 255, separated by periodsfor example,
129.25.255.1. If an IPv6 name is supplied, it is enclosed in brackets with eight
groups of 2-byte hex numbers, separated by colonsfor example:
[abfe:124a:fefe:237a:aeff:ccdd:aacc:bcad]. Examples of fully qualified URLs
are 129.25.255.1:3260[*]/iqn.1998-03.com.xyc.ajax:wonder.jump and
[abfe:124a:fefe:237a:aeff:ccdd:aacc:bcad]:3270/iqn.1998-
03.com.xyc.ajax.wonder:jump.
[*] This port number could have been omitted (it is the default).
It is also possible to have a fully qualified domain name (a host name) instead
of the resolved address. Below is an example of using the FQDN within the
iSCSI address. As with normal mail, one needs more than the address of the
entry point; it is often the case that names are needed so that the mail is
delivered to the right person at the address. That is why a fully qualified iSCSI
address has both the FQDN (optionally resolved into an IPV4 or IPV6 address)
and the iSCSI name. For example, customer-
1.wonder.mother.us.com:330/iqn.2000-1.com.us.mother.wonder:dbhost-1.
The important thing is that at any given IP address there could be more than
one iSCSI name. This is usually the case at iSCSI gateways, but it is legal in
other iSCSI entities such as target devices. Of course, it is more customary to
see multiple IP addresses with the same iSCSI name.
The iqn name form can be made unique in the world. Because of this
unique identification, a session can be established between any two iSCSI
network entities anywhere in the world.
iqn.1998-03.com.xyc.ajax.wonder.jump
iqn.1998-03.com.xyc.ajax:wonder.jump (note the : variant)
eui.acde48234667abcd
eui.f73ab973e5ac89d1
An eui node name should be DNS based but reversed to avoid confusion
with the DNS name itself.
We also discussed how to record the IP address and TCP port (depending
on whether it is an IPv4 or an IPv6 address).
Now that we have explained the concept of a URL and how we might combine
the iSCSI name with the iSCSI IP address and TCP port, we should keep in
mind that the URL syntax, as such, is not directly used in the iSCSI protocol.
iSCSI has no direct reason to combine the name string with the IP address and
TCP port string. However, the two strings are used in separate keyvalue pairs
in the login request, login response, and text PDUs. In spite of all that, the
URL form is a convenient way of explicitly specifying a fully qualified iSCSI
name and TCP port address for written communication and for administrator
input to management software.
Chapter Summary
Vendors are integrating iSCSI and TOE functions on the same HBA or chip.
With that integration, the iSCSI code can place iSCSI PDUs directly at final
locations in main memory, thereby avoiding host CPU cycles moving the
data from buffer to buffer.
iSCSI exploits TCP/IP checksums, Ethernet CRC trailers, and optional iSCSI
CRC-32 digests:
Login PDUs
iSCSI Sessions
Login Keywords
Discovery Session
Chapter Summary
To the Reader
If you need a more in-depth understanding of iSCSI, read the entire chapter.
A much more detailed description of the login process (also in this chapter)
Detailed descriptions of the login request and response PDUs, along with
the exact formats (in this chapter and in Appendix A)
Details of the keywords and their values (in this chapter and in Appendix
B)
Start the iSCSI TCP/IP connection (which in turn can establish a secure
IPsec connection).
The login is composed of requests and responses. This process of sending and
receiving login requests and responses is handled as if it were a single task,
which means that it is not completed until one side or the other abandons it
and drops the connection or both agree to go to "full-feature" phase. The full-
feature phase is the normal mode, which can actually send or receive SCSI
commands (CDBs) in the iSCSI protocol data units (PDUs). The connection
may not perform any other function until the login is complete and the full-
feature phase is entered.
There will be one or more login requests, each followed by a login response
from the target until the target sends the final response. That response either
accepts or rejects the login.
Normal
Discovery
Until now we have been addressing the normal iSCSI session. The discovery
session will be addressed later.
The initial login must include in a field of the PDU called the DataSegment the
login parameters in text request format. This format consists of a series of
textual keywords followed by values. What follows is an example of login
keywords that must be sent on the first command of every connection's login.
These keywords are InitiatorName and TargetName (see Figure 5-1).
Initiator Name=iqn.1998-03.com.xyc.ajax.wonder:jump
TargetName=eui.acde48234667abcd
Login has several phases (discussed below), one of which deals with security
and authorization (covered later). For our purposes at this time, we will
assume that the security is handled appropriately when required and we will
focus on the other aspects of session establishment. Session establishment is
primarily identification of the remote site along with negotiations to establish
which set of functions (options) will operate between the initiator and the
target, along with the level of support or resources that will be used.
For example, it can be determined if the target can support the shipment of
data with the command PDUs. It can also be determined if the target can
support the initiator shipping data (in support of a SCSI write command, for
instance) without the target first soliciting the initiator for the data. Even if
the target does support the "unsolicited" arrival of data, it probably has some
limitations on the size of its "surprise"that is, how much "unsolicited" data
buffering capability it must reserve for surprises. Even the order of the data's
arrival (in order only or out of order accepted) can be negotiated between the
initiator and the target.
Generally, one side or the other may send a keyword and a single value, a
range of values, or a list of values that it supportsarranged in most preferred
to least preferred orderfor the function represented by the keyword. The
opposite side is supposed to pick the first option or value that it supports. This
process continues back and forth until both sides agree that they are through
with the negotiation. At that point either the connection is broken or the full-
feature phase of the session is established. Then the initiator can begin
sending SCSI commands to the target node. If desired, additional connections
can be started and related to this one, so that an iSCSI session can be made
up of multiple connections. MC/S (multiple connections per session) will be
discussed later.
There are actually three modes/phases through which the session
establishment (or additional connection establishment) progresses. They are
Only after the establishment of the full-feature phase can PDUs, containing
actual SCSI command CDBs, be sent and responses received. Likewise, it is
only after full-feature mode is established on the leading connection that other
connections can be created and made part of the session.
Login PDUs
The target, using the Login Response PDU, will respond to the Login Request
PDU sent by the initiator. These request and response PDUs can be issued
repeatedly, until both sides are satisfied with the parameters that are
negotiated.
The details in Figure 5-2 will be fully discussed in Appendix A,but some
highlights will be touched on here. Notice that the figure contains a field called
CID. This is the connection ID, which the initiator originates and sets in the
login PDU when it is starting a new session connection (either in a single
connection or in a multiple connection session). The CID is used by the logout
(explicit and implicit) function to identify a connection to terminate. Also
notice that the login PDU contains the initiator task tag (ITT) field. This field
accompanies all command PDUs so that responses can identify the command to
which they are responding. The iSCSI initiator sets the ITT, and the target
responds using the same ITT when the command completes. Technically the
login does not need ITTs since only one login command can be outstanding at
any one time on any specific connection. However, it is required here to enable
common code to handle both logins and normal commands.
All connections are between what iSCSI calls "portals," which are associated
with the initiator node and the target node. Portals carry an IP address and, in
the case of target portals, a TCP listening "port."
The target session identifying handle (TSIH) is set in the last target response
during the leading login. When a subsequent connection login is started, to be
part of the same session as a previous connection it may originate from the
same physical portal as that previous connection or from a different physical
connection. The target portal must be in the same portal group (i.e., have the
same TPGT) as the leading connection for the session. When the login for this
subsequent connection is issued, the initiator will replicate to the new login
PDU the ISID and TSIH fields obtained during the leading login of the session.
The version number fields (versionMax and versionMin) in the login PDU are
used to ensure that the initiator and target can agree on the protocol level
they support.
Let's look at moving from one login phase/stage to another in the login
process. (This topic will be revisited in Appendix A,but the information that
follows should give you a flavor of the login process.)
The CSG (current stage) field will specify the phase that the session is
currently in. The NSG (next stage) field will specify what phase is desired
next. When the initiator or target wishes to transit between its current phase
and another phase, the T bit (Transit bit) must be set and the desired phase
set in the NSG field. NSG is only valid when the T bit is set. (See example in
Figure 5-3.)
The target sends the Login Response PDU whenever it wants to accept, reject,
or continue receiving the initiator's login requests. This may be a response to
the initiator's initial Login Request PDU or a subsequent continuing Login
Request PDU. (See Figure 5-3.)
Notice in Figure 5-4 that the login response contains fields similar to those
found in the login request. These fields are the T bit, the C bit, the CSG and
NSG, the version information (maximum and actual), the DataSegmentLength,
the ISID, the TSIH, and the ITT. It also contains the StatSN (status sequence
number), which is set on the first response and incremented by one each time
a new status is returned on a specific connection.
One of the important things the target specifies in the last response to the
leading login is to pick a TSIH and return it to the initiator. On the first login
request of the leading login, the TSIH will be zero, but on the last response the
target will create a unique value for the session. During the login process both
the initiator and the target will continue to reflect back to each other, in the
login request and login response PDUs, the ISID, and TSIH fields. Also, the first
login response PDU should carry, in its DataSegment, the text field that
identifies the target portal group tag (TPGT) to which the initiator is
connected.
The most important item in the Login Response PDU is the return code found
in the status-class and status-detail fields, which represent one of the results
listed in Table 5-1. (Note: Refer to the IETF specification for the current codes
settings.)
Sessions in iSCSI can have single or multiple connections. Let's first examine
the single-connection session.
This "To-From" verification is the way all sessions begin and is part of the
security phase of login. Depending on the iSCSI authentication mode chosen,
there may have to be a further exchange of information, such as a user ID,
and a form of password or a certificate that can be verified by a third party.
The user ID can be either the iSCSI initiator name, already sent, or a true
user ID. It is recommended that installations use the existing iSCSI initiator
name as the user ID if at all possible, since this will reduce the administrative
load and keep the environment a little less confusing.
The iSCSI storage controllers need to keep their authorized user IDs in sync
with the iSCSI initiator names. Therefore, there needs to be a table within the
iSCSI target device that says, for example, "user ID Frank" is authorized to
use the iSCSI node iqn.1998-03.com.xyc.ajax:wonder:jump.
If the installation can avoid user IDs, it probably can reduce its storage
administrator's load. This is because the administrator will not have to deal
with managing the relationship between the ID and the iSCSI initiator names.
Authentication Routines
The authentication routines that iSCSI requires (must implement) and those it
permits (should/may implement) are listed below:
The above protocols are defined by their own IETF standard documentation
RFCs (Requests For Comment). SRP is defined in [IETF RFC2945]; CHAP, in
[RFC1994]; SPKM in [RFC2025]; and Kerberos v5, in [RFC1510].
iSCSI keywords are actually made up of something we call the key=value pair
(even when the value is a list). These characters are encoded in UTF-8
Unicode. The key part of the pair must be represented exactly as shown in the
iSCSI specification. Upper case and lower case are required in the key, as
shown in Appendix B (it is case sensitive), and there are no blank (white
space) characters or nulls. The = sign must immediately follow the key part,
without any blanks. The value(list) string must follow the = and is made up
of letters or numbers. The key=value pair is terminated by a null character
(hex 00). The numeric numbers can be represented by either decimals or
hexadecimals. Hexadecimal numbers are indicated by a leading 0x set of
characters. An example is 0xFA154c2B.
The key=list form of the key=value pair can be made up of several values
separated by a commas. Values should not exceed 255 characters, unless
expressly specified by the keyword writeup. This 255-character limitation
applies to the internal value, not the string used to encode the value in the
text field. For example, the hex value expressed externally as 0x2a579b2f will
take up 4 bytes of internal representation, not 10.
It is also possible for a vendor to use the form X#** where ** is replaced by a
character string registered with INIA (Internet Assigned Numbers Authority).
For example, X#ActionKeyNumber=25
When the key=list form is used, it is supposed to have the values in the list
arranged in most preferred order. Say the list is made up of value1, value2,
and value3, with value3 the most preferred and value1 the next most
preferred; the key=list should be shown as key=value3,value1, value2. In
this way, the other side will pick the first value in the list that it wants to
support.
It is also possible for the key=value pair to specify a range. In this case the
values in the list may be the minimum and maximum separated by the tilde
(~). It is expected that the other side will respond with a key=value pair made
up of a number between (or equal to) these minimum and maximum values.
In the iSCSI spec, and henceforth in this book, the syntactical notation key=
<something> denotes that "something" is a variable that may be used with the
corresponding key but without its surrounding angle brackets (< >).
With all that in mind, we will look at examples pulled from the iSCSI standard
specification. The specification includes flag bit settings and the like. However,
only the general ideas will be presented here. For more details see Appendix
A; the values themselves are shown in Appendix B. Note that in the
description I-> indicates that the initiator sends what follows and T-> means
that the target sends what follows. The quoted strings are comments that will
not flow on the wire.
Before we get into the examples, let's list some of the key=value items that
are important to the initial login PDU of any connection:
It should be noted that secondary connection logins must have the same
InitiatorName=value and the same TargetName=value as those of the first
connection in an iSCSI session.
The login normally starts in the security negotiation phase. In that case the
initial login PDU will have its CSG field set to the value that specifies security
negotiation phase (SNP) in progress. It should also set the NSG field to a value
that specifies that either the next phase should be login operational
negotiation phase (LONPif the initiator has additional values to
negotiate/declare) or full-feature phase (FFP). It is expected that the target
will decide whether to move into login operational negotiation phase (or full-
feature phase) or stay in the security negotiation phase.
If the initiator does not want to negotiate security, it can initially set its CSG to
login operational negotiation phase (LONP), and its next stage to either the
same, or full-feature phase (FFP). However, if the target is unwilling to operate
without security, it may just end the session by returning a login reject
response (with an Authentication Error), and drop the connection. (See Login
Response PDU in Appendix A.)
The general rule for phase moves is that the initiator can request a transition
from one phase to another whenever it is ready. However, a target can respond
with a transition only after it is offered one by the initiator.
If the header digest and/or the data digest are negotiated and accepted
(during login operational negotiation phase), every PDU beginning with the
first one sent after the start of the full-feature phase must have the
appropriate header and/or data digests.
The following example attempts to show not only the login processes dealing
with key=<values> but also how the security authentication process works.
In the example below, I-> indicates that the initiator sends; T-> indicates that
the target sends. Also, the target, via SRP, authenticates only the initiator.
I-> Initial Login Request PDU with CSG set to SecurityNegotiation and NSG set to
LoginOperationalNegotiation (with the T bit set). "The PDU also contains the following keys in the
data field (comments are enclosed in quotation marks):"
T-> Login response from target with NSG set to SecurityNegotiation and T bit set. "The PDU also
contains the following key in the data field:"
TargetPortalGroupTag=3
AuthMethod=SRP "target will use SRP for authentication"
I-> Login command (continuing in security phase of login using SRP processes). "And the following
keys:"
SRP_U=<user> "user name is sent, if <user> isn't equivalent to iSCSI initiator name; otherwise,
U= isn't sent"
SRP_GROUP=<G1,G2...>
SRP_s=<s>
SRP_A=<A> SRP_GROUP=<G>
SRP_B=<B>
I-> Login command finishes with the SRP security protocols and sets NSG to
LoginOperationalNegotiation with the T bit set:
SRP_M=<M> "if the initiator authentication is successful, the target proceeds with:"
T-> Login response (ending security phase), setting the NSG to LoginOperationalNegotiation and
with the T bit set.
I-> Login command (host indicates it's current phase by setting CSG to
LoginOperationalNegotiation but with the T bit not set):
HeaderDigest=CRC-32C,None "initiator wants to use CRC-32c but will accept no header digest"
T-> Login response (continues to exchange negotiations on parameters) with the T bit not set:
T-> Login (target indicates it's willing to go into full-feature phase by setting the NSG to
FullFeaturePhase with the T bit set), sets the TSIH and replies with an empty DataSegment.
At this point the login is complete and the session enters its full-feature phase,
in which commands and data will be sent from the initiator to the target for
execution by the appropriate LU. From this point on, any PDU on this
connection must have CRC-32C digests for the header and data.
You can see from the example that the login phase can be very chatty. Each
side exchanges its key=value pairs and receives the key=value pairs from the
other side. In spite of this verboseness, the base concept is very simple. Since
the sessions are very long lived, the overhead in this process is hardly
noticeable. However, since many of the key=value pairs can be included in one
request or response, and since there are many defaults, often the complete
login is accomplished with one exchange.
We have just described a leading login process, so named because there can
be parallel connections established within the same session between the same
iSCSI initiator endpoint (the SCSI initiator port) and the same iSCSI target
endpoint (the SCSI target port). Now we will describe how these nonleading
logins are processed in order to establish a secondary connection in a session.
First, the initiator finds an appropriate TCP/IP address and port on the target,
which is associated with the connection used to establish the "leading login"
(within the same target portal group). (See SendTargets in the Discovery
Session section to come.) Once the appropriate TCP/IP address and port are
determined, the initiator starts a new TCP/IP connection. The new connection
may originate either from a different physical network connection on the
initiator or from the same physical connection. In this way there can be
parallel connections from the host to the storage controller. If all the
connections in this set use different physical network connections, the total
bandwidth of a session will equal the sum of the individual connections.
A new connection within an existing session is started just like the original
connectionthat is, via a socket call, which in turn causes the establishment of
the IPsec coverage, and so forth. The initiator then sends a login PDU to the
target. The PDU will contain a new CID (connection ID), created by the
initiator, so that the different connections within the session can be identified.
The initiator will insert the current session-wide CmdSN into the login PDU.
The session's ISID and TSIH will also be included. The initiator name, ISID,
and TSIH will indicate to the specified target the session to which this
connection belongs. All the other PDU fields and processing required as parts
of the "leading connection" authentication are repeated for the secondary
connection. The initiator and the target must still go through authentication
via the exchange of login key=value pairs for the secondary connection, even
if that was already done on the leading connection.
Any values that were negotiated on the leading connection will apply unless
reset by key negotiations on the secondary connection. Some key=value pairs
can be set only during the leading login; some, only during the full feature
phase; and others, in all phases.
All the keywords associated with the security phase are legal only during the
login security phase.
Other than the key=value pairs associated with the security phase, the login
on a secondary connection can exchange only those key=value pairs that are
permitted in all phases. The entire list of valid key=value pairs can be found in
Appendix B.
Several vendors are making iSCSI HBAs that will be able to support the
spanning of secondary session connections across multiple HBAs. Others will
have multiple physical network connections mounted on the same HBAs and
will support secondary connections across those physical network connections.
Still others will support secondary connections both on their multi physical
connection HBAs, and across multiples of those HBAs. And for each physical
connection there could be multiple logical connections.
At this point it is appropriate to repeat the name given to the connection point
(the IP address and the TCP listening port on the target). iSCSI calls this a
portal, and it can be logical or physical. A collection of portals that are used
by an iSCSI initiator or an iSCSI target is called a portal group. (Figure 5-5
shows these portals and their addresses along with iSCSI initiator/ target
names.)
In this section we will cover the iSCSI discovery session and the information
that can be obtained from it. It should be understood that several different
techniques can be deployed as part of discovery and the discovery session is
only one of them. I cover it extensively here because it is a basic part of the
iSCSI protocol. The other types of discovery are only companion processes that
can be optionally implemented within iSCSI devices. A more extensive review
of the alternatives is offered in Chapter 12.
The discovery session is established like any other iSCSI session; however, the
initiator must send the key=value pair of SessionType=Discovery, which
must be in the initial login PDU of the session.
The discovery session may or may not be covered by IPsec and/or session
authentication. That decision depends on the installation. Like any iSCSI
session, it requires implementers (vendors) to support IPsec and session
authentication in their products. However, it is an installation's decision
whether to use this form of security. Some installations may decide that it is
not important to secure discovery sessions, which only provide the name and
address of iSCSI target devices. The iSCSI specification is silent on this issue.
The installation should bear in mind that security is always a set of fences, and
it is sometimes useful to set a fence that hides even information about
locations from potential intruders.
The following is the syntax of the response strings in the DataSegment of the
Text Response PDU for the SendTargets text request.
TargetName=<iSCSI-target-name>
TargetAddress=<ip-address>[:<port>],<portal-group-number>
[<additional-number-of-TargetAddress-text-responses>]
[<additional-TargetName-text-response>]
[<additional-number-of-TargetAddress-text-responses>]
(Note: The iSCSI default TCP/IP port is currently 3260; if it is used, the port
number field can be eliminated.)
If an all parameter value is used in the text request, the response will include
the TargetName response string for each TargetName known to the discovery
target. Following each TargetName will be a list of all IP addresses (and ports)
in the form of TargetAddress response strings, which can be used to address
that target.
When the SendTargets text request is issued on a normal session, with either
the null parameter value or the iSCSI target name value, only the information
about the implied/specified target is returned. The implied target for the null
request is the current target (the one sustaining the normal session). The
implied/specified target will have its and only its target name, target
addresses, and TPGTs returned in the DataSegment of the Text Response PDU.
The part of the discovery process just described is directly supported by iSCSI
and an iSCSI session (the discovery session). However, iSCSI has defined not
only this capability within its own protocol but also how other IP protocols can
be used in a more comprehensive discovery process. More information on the
discovery process is provided in Chapter 12.
Chapter Summary
It was shown how to use the Login Request and Login Response PDUs to
begin the session and related connections.
A target responds with a Login Response PDU and its own key=value
pairs.
Examples were given to explain the give and take that occurs during
parameter negotiation.
The concepts of portal, portal group, and target portal group tag (TPGT)
were defined.
Chapter Summary
To the Reader
The iSCSI protocol includes a method to send non-SCSI commands from the
initiator to the target. It also has a technique for negotiating values between
the two sides. This is the same process and negotiating routine that is part of
the login processeven the syntax is the same. (See Chapter 5, the section
Keywords and the Login Process.)
Text Requests and Responses
The Text Request PDUs and the Text Response PDUs permit the iSCSI initiator
and target to exchange information that is unavailable to the SCSI layer. An
example of this is the SendTargets command, which the initiator sends to the
target via the Text Request PDU. The target responds to it via the Text
Response PDU. The text request and response PDUs permit new standard
extensions to iSCSI, and they permit vendors to provide additional value-
added functions.
The text field is in the DataSegment of the text request or response PDU. It
must be in the key=value format as detailed below (similar to the login text
field). The initiator initiates all requests. The target can respond only with a
Text Response PDU. The exchange process can continue across several text
request and response PDUs in a sequence.
PDU Fields
The text request and response PDUs have four fields that control the semantics
of the exchange process (see Chapter 11 for PDU layouts). These are:
The initiator always assigns an ITT to each new task (SCSI or iSCSI) that it
starts. Each new Text Request PDU sequence will have an ITT assigned to it
that is unique across the session and which the target will return in its
responses. In this way requests and responses are clearly coordinated.
Text requests and responses can have more key=value pairs than fit within a
single PDU. This means that the pairs may continue on a subsequent PDU, in
which case the request or response PDU will have its F bit set to zero to
indicate that it is not the final PDU. When the F bit is set, the text request or
response is signaling that it has reached the end of its text entry and/or that
the negotiation is complete.
If it happens that a key=value pair spans a PDU entry, the sender must set the
C bit in its text request or response. The receiver of such a PDU must only
respond with empty text PDUs until it receives one with the C bit cleared.
Things may then continue as normal.
Whenever it seems the text response may solicit additional related come-
backs, it will have its TTT set to some value that is useful to the target. Upon
receiving a TTT, if the initiator has any additional text to send, it will return
that same TTT to the target via its Text Request PDU.
The initiator will determine when the text communication is at an end. It does
so either by sending no more Text Request PDUs following a text response with
the F bit set, or by sending a new text request with the TTT set to its reserved
value of hex FFFFFFFF. Likewise, when the target believes the appropriate
response is at an end, it should set the F bit to 1 and send a TTT of hex
FFFFFFFF on its final response. The target is more limited in its ability to end a
communication, however, since it can set the F bit to 1 only if the initiator has
given its permission by setting its own F bit to 1 in the previous Text Request
PDU.
The target will assume that any text request received with a TTT of hex
FFFFFFFF is the beginning of a new request series and will reset all its internal
settings (if any) for the ITT specified. Since there can be only one text request
outstanding on a connection at any time, the target can also clear any internal
settings it has regarding any ITT used for a previous text request on that same
connection.
In summary, the iSCSI initiator can send commands to the iSCSI target using
the Text Request PDU. The target can respond to the initiator with some value
in the Text Response PDU. The flow of messages is controlled by use of the F
bit, the C bit, the ITT, and the TTT. The text in the request and response PDUs
are in key=value format, which is further explained in the next section.
Text Keywords and Negotiation
The format for the text field in the login and in the text request and response
PDUs is the key=value pair, which can be either declared or negotiated. The
format of a key=value declaration (where -> means that the following string is
sent) is
Declarer-> <key>=<valuex>
Proposer-> <key>=<valuex>
Acceptor-> <key>=<valuey>|NotUnderstood|Irrelevant|Reject
The proposer, or declarer, can be either the initiator or the target, and the
acceptor can be either the target or the initiator. Target requests are not
limited to responses to key=value pairs proposed by the initiator but may
propose their own key=value pairs.
They have various fields and character sets (defined in Appendix D).
A value (the value of the key=value pair), which can be one or more of
the following:
They may have many, various instances included in the same text field of a
Login Request, Login Response, Text Request, or Text Response PDU. Each
instance must of course be unique and separated by at least one null (hex 00).
They must not permit their key names to exceed 63 bytes.
They can span text request or response boundaries (i.e., can start in one
PDU and continue in the next).
If they are split between PDUs, the sending side must set the C bit, and the
receiving side must not respond with any text value until it receives a PDU
with the C bit cleared. (During that time, null PDUs are acceptable responses
to the proposer.)
The data lengths of a text field in request or response PDUs must not exceed
MaxRecvDataSegmentLength, a per-connection and per-direction declared
parameter. See Appendix B for details on this value. Text operations are
usually meant for parameter setting/negotiations but can also be used to
perform some long-lasting operations. The DataSegment in login and text
request/response PDUs, which are likely to take a long time obtaining a
response, should be placed in their own text request.
1. All negotiations start out stateless; that is, the results are based
only on newly exchanged values. Each side keeps state during the
negotiations.
In literal list negotiation, the proposer sendsfor each keya list of options
(literal constants, which may include None) in its order of preference.
The accepting party answers with the first value from the list it supports,
and is allowed to use for the specific proposer.
The constant None is used to indicate a missing function. However, it is a
valid selection only if it is explicitly offered.
For numerical values the accepting party responds with the required key
and the value it selects. This is based on the results function specific to the
key and becomes the negotiation result.
All Boolean keywords have a result function, the value of which (specified in
Appendix B) is either AND or OR.
For Boolean negotiations (keys taking the value Yes or No) the accepting
party responds with the required key and the chosen value, or responds with
nothing if the result can be determined by the rules of that keyword. The last
value transmitted becomes the negotiation result.
The rules for selecting the value to respond with are expressed as Boolean
result functions (AND/OR) of the value received and the value that the
responding party would select in the absence of knowledge of the received
value. (See rule 12 and Appendix B for the keywords' appropriate Boolean
result functions.)
Based on rule 11, the two cases in which responses are optional are
When the Boolean function is AND and the value received is No. (This
makes the automatic outcome of the negotiation No and no response is
required.)
When the Boolean function is OR and the value received is Yes. (This
makes the automatic outcome of the negotiation Yes and no response is
required.)
Responses are required in all other Boolean cases, and the value chosen
and sent by the acceptor becomes the outcome of the negotiation.
For list value negotiation, the proposer arranges the values in the order it
prefers, and the acceptor chooses the first value in the list that it prefers. The
value chosen by the acceptor becomes the result of the negotiation.
If a specific key is not relevant to the current negotiation, the acceptor may
answer with the constant Irrelevant for all types of negotiation.
The acceptor, without affecting basic function, may ignore any key not
understood. However, it must send back <key>=NotUnderstood.
The initiator signals its intention to end the negotiation by setting the F bit
(final flag) to 1.
When the initiator sends a text request that has the final flag set to 1:
If the target has only one response, it should set its final flag to 1.
If the target has more than one response, it should set the F bit to 0 in
each response except the last, and set the F bit to 1 on its last response.
Text request sequences are independent of each other. Thus, the F bit
settings in one pair of text requestsresponses have no bearing on the F bit
settings in the next pair.
Can be answered by the target with an F bit setting of 0, in which case the
initiator may further respond with a request PDU that
has F=0 (indicating a response that has multiple text request PDUs)
Can be answered by the target with an F bit set to 1 (indicating that the
target's response sequence is completed).
Whenever the target responds with the F bit set to 0, it must choose and
set the TTT to a value other than the default hex FFFFFFFF.
The target, then the target must send a Reject PDU with a reason of
protocol error (see Appendix A). During login, however, the Login
Response PDU should reflect the error and the connection should then be
terminated.
The initiator, then the initiator must reset the negotiation in FFP, but
during login it must terminate the connection.
In this chapter
We explained the syntax of key=value pairs, which are the basic element
of exchange.
We also covered
Initiator Session ID
Connection Establishment
Sequencing
Chapter Summary
To the Reader
This chapter details the management of iSCSI sessions. Once again, skip
ahead to the Chapter Summary if you don't want an in-depth treatment of this
subject.
The iSCSI session begins with the login process described in Chapter 5 for the
main (first) connection. The initiator will set a unique connection ID (CID) in
the initial login PDU for the connection and set a new CID for each of the
connections that follow. The CID is a "handle" that the target and initiator can
use to queue various connection-related items. It will be presented not only
with each connection login, but also with each logout.
Initiator Session ID
In addition to the CID, the initial login sends the initiator session ID (ISID), a
48-bit field made up of the vendor ID and a unique qualifier that the vendor
assigns. Figure 7-1 is a general depiction of the ISID field. Figure 7-2 shows
the three ISID layouts that have been defined. The vendor type field is located
in byte 0, bits 0 and 1, of the ISID. Notice that the layout of the 6-byte ISID
depends on the setting of the vendor type field, as shown in Figure 7-2.
Figure 7-1. ISID field in the login request and login response
PDUs.
0 22 bits follow, i.e., the lower 22 bits of the IEEE OUI[*] (Organization Unique
Identifier) a.k.a. company ID. The remaining three bytes are a unique qualifier.
6 bits of zeros follow, then 3 bytes of an IANA enterprise number (EN), and then a
1
2-byte unique qualifier.
2 6 bits of zero follow, then a random 3-byte value and a 2-byte qualifier.
3 Reserved.
[*] The OUI is actually 24 bits, but only the lower 22 are of use here.
The important thing to understand is that the full ISID is made up of the
vendor ID and any additional qualifier that will create a unique identifier
within any host initiator system. Therefore, when the full ISID is concatenated
with the iSCSI initiator node name, it creates a unique name in the world,
which represents the iSCSI (SCSI) initiator port.
Unlike Fibre Channel, in which each FC HBA relates one to one with a SCSI
port, the iSCSI (SCSI) initiator port may actually span multiple HBAs.
Therefore, the initiator side of an iSCSI session can use more than one
initiator HBA. In fact, the iSCSI initiator may use several HBAs or portals
within various HBAs to support multiple connections in a single session.
Can be used instead of the preset default value so that the preset default
value does not need to be the ongoing value of the ISID.
Can become the ongoing ISID, but must be made persistent across the
following:
Hot-swap with another HBA (in theory even from another vendor but
at the least from the same vendor)
Reboot
Power down, swap-out old HBA, swap-in new HBA, and reboot
In order to take advantage of the ISID identity, the iSCSI initiator establishes
its first connection with the target by sending a Login Request PDU. In this
PDU the initiator places the appropriate values in the CID and the ISID fields,
as well as in the text field. The text field will contain, among other things, the
name of the iSCSI initiator node and the name of the target node. The iSCSI
target responds appropriately and, on the last login responseof the leading
login of the sessionincludes the unique (nonzero) target session identifier
handle (TSIH). The TSIH must be unique within the target node for each
session with the same named initiator. The implementation may assign to the
TSIH any value other then zero.
After the TSIH is returned to the initiator, the session is considered established
and in the full-feature phase (FFP). The initiator must then use that TSIH
when starting a subsequent connection with the target for the same session.
A session identifier (SSID) is unique for every session an initiator has with a
given target. (See Figure 7-3.) It is composed of the ISID and the TPGT.
When a target node sees that an initiator is issuing a login for an existing
SSID (with a nonzero TSIH), it assumes that the initiator is attempting to start
a new connection (or reestablish an old one, as discussed below). Once the
multiple connections are started, the initiator may send commands on any free
connection. The data and status responses related to a command must flow on
the same connection on which the command travels. This is known as
connection allegiance.
Note that, when the Login Request PDU is received by an iSCSI target, it first
checks to see if the TSIH is zero. If so, the target checks to see if the initiator
node name and ISID are currently active on that target. If they are, the login
forces a logout of the existing session and the establishment of a new session
in its place. In that case any tasks active on the old session are internally
terminated at the target and the new session starts fresh. (See Chapter 11 for
more details on handling session restarts.) However, if the session was
terminated some time before and all its tasks are now clear, a new session will
just be started. If the target has retained any "persistent reserves" from that
previous session, they will be associated with the new session.
If the initiator's Login Request PDU contains a nonzero TSIH, the target
assumes that it is a request to establish an additional connection to an existing
session. If that session is active and the new login request has a unique CID,
the request will be granted and the new connection will be started. If the CID
is the same as an existing CID, the current connection with the same CID will
be terminated. Any existing tasks assigned to that connection will be put on
suspension (if ErrorRecoveryLevel=2) until the initiator can get them
reassigned to this new connection or to another connection. It does this
through the Task Management Request PDU. (See Chapter 11.) If the error
recovery level is less than 2, the tasks associated with the terminated
connection are just terminated.
The data may be write data, which is sent from the initiator:
Or the data may be read data, in which case it is sent from the target to the
initiator within a Data-In PDU. In all cases the data is sent on the same
connection on which the command was sent. This connection is also used for
R2T PDUs, which the target uses to request that the initiator transmit the
write data.
The connection will have sequence numbers that it uses to keep track of the
flow within the connection. The primary connection-level sequence number is
the status sequence number (StatSN), a counter maintained per connection
that tallies the count of status response PDUs sent from the target to the
initiator. There are also sequence numbers dealing with the Data-In, Data-Out,
and R2T PDUs that flow between the initiators and the targets. The Data-
In/Out PDU and R2T PDU sequence numbers are all associated with
commands, so they start counting at zero for each command.
There are also sequence numbers that apply to the commands themselves.
Commands are sequenced across all connections within the session. Therefore,
the command sequence number (CmdSN) is a counter that applies across the
entire session (and any multiple connections therein as well). This sequencing
of commands enables the target to deliver them in order to the SCSI level of
processing in the target, regardless of the connection on which the command
traveled. CmdSN is the only counter maintained across the different
connections within a session.
It should be noted that the protocol maintains expected values for all these
counters. For example, the CmdSN has a corresponding expected command
sequence number (ExpCmdSN) counter. The various expected values are used
by the responder to acknowledge receipt of the various sequenced items, up to
but not including the next expected sequence number. Through the use of the
expected values, it is not required that a separate acknowledgment (ACK) of
the safe arrival of an individual PDU be sent. The responder may instead
"piggy-back" all previous arrival acknowledgments with the next command or
status response message. For example, the initiator can tell the target not only
what command number it is sending but also what it believes should be the
next status sequence number (StatSN) returned by the target. The target then
knows that the initiator has received all status responses up to that point. This
approach is especially useful given that an error seldom occurs that is not
detected and fixed by TCP/IP. Therefore, network bandwidth is not consumed
by acknowledgments that are almost never needed.
The SCSI Response PDU, which is sent by the target to acknowledge the
completion of a command, will not only contain its own StatSN but also will
contain the ExpCmdSN, thus informing the initiator of not only the status of
the specific command but also the safe arrival of all the commands up to the
ExpCmdSN.
The target has two ways to return ending status to the initiator. The normal
method is via the SCSI Response PDU; the other is via the last Data-In PDU (if
any). For example, if a read command completed successfully without any
exception conditions, that fact can be signalled in the ending Data-In PDU
without having to send a separate response PDU. This technique is called
"phase collapse," because the normal SCSI response phase is collapsed into
the PDU that ends the data transfer. In any event, the SCSI Response PDU's
StatSN field, which has the count of all status responses, whether on ending
Data-In PDUs or as part of the SCSI Response PDUs (also called status PDUs)
sent on this connection.
The various PDUs that travel from the initiator to the target will return a value
in a field called ExpStatSN. This field is generally one higher than the last
StatSN value received by the initiator. The target can free up all resources for
commands whose StatSN is less than ExpStatSN.
At this point let's recap the above. The CmdSN is what the initiator uses to tell
the target the current sequence number for a command; the ExpStatSN is
what the initiator uses to tell the target what status response it thinks is next.
The target, via the StatSN, tells the initiator what status response it is on and,
via ExpCmdSN, what command PDU sequence number the target thinks is
next. All this is a means of sending acknowledgments of received commands or
responses without having the overhead and latency of individual
acknowledgments for each PDU. Acknowledgments just become part of other
commands or responses, so there could be a number of commands received
before an acknowledgment is sent. Therefore, by sending the value that the
target thinks should be the next ExpCmdSN whenever it has a response to
issue, the target implicitly acknowledges the receipt of all commands up to
that number and does not need to acknowledge each command.
If there are no pending response messages for a period of time, the responder
should generate a NOP (no-operation) message (either NOP-In, or NOP-Out)
and specify the next expected sequence number.
The CmdSN not only has a corresponding ExpCmdSN, it also has another
related field, the maximum command sequence number (MaxCmdSN). The
combination of ExpCmdSN and MaxCmdSN are included in every PDU sent
from the target to the initiator and can be used to provide a "windowing"
technique for the acceptance of new commands.
When the MaxCmdSN plus 1 is equal to the ExpCmdSN, the command window
is closed[*] and the initiator must not send any more commands until the
window opens, an event signaled by a difference between the MaxCmdSN plus
1 and the ExpCmdSN. This is also a time when the NOP-In message, if no
other PDUs are pending, can be used by the target to signal the increase in the
MaxCmdSN. Whether a NOP-In or some other target-to-initiator PDU
technique is used to inform the initiator that the window is open, it will signal
that resources at the target have been freed up and that the target can begin
accepting additional commands. It should be noted that the window need not
go from full open to full close, but instead can be used to control the number
of concurrent commands, which are in flight at any time from a specific iSCSI
(SCSI) initiator port. This is done by keeping the difference between the
current or expected CmdSN within n of the MaxCmdSN.
[*] By its definition, serial-number arithmetic has no operation defined for subtraction, so the
technically correct way to determine if the window is closed is to compare MaxCmdSN +1 to
ExpCmdSN to determine if they are equal (see Appendix A for a description of serial-number
arithmetic).
If the initiator has not received all the data, it may request that the target
resend the missing Data-In PDUs. To request the resend, the initiator must
issue a SNACK[*] (selective negative acknowledgment) PDU to the target.
A SNACK is the only method that the initiator can use to request that the
target resend the missing Data-In or status PDUs. It is intended for
environments operating at ErrorRecoveryLevel>0. At that level, the missing
Data-In PDUs must be resent by the target exactly as they were sent
originally, except that the ExpCmdSN and the MaxCmdSN must be current.
The initiator will not know that Data-In PDUs are missing until the status
response occurs. This is important to understand. In normal operation, TCP/IP
will perform all error recovery and retransmission. It is only when iSCSI adds
a CRC digest that it can tell that it has detected an error that slipped by TCP/IP
error detection. If it turns out that the error was detected in the data part of
the PDU via the data CRC digest (meaning that the header passed the CRC
correctness test), the initiator or target can tell enough about the data PDU to
ask for the data to be retransmitted. As mentioned above, the initiator does
this by issuing the SNACK PDU and requesting the retransmission of the
corrupted data. The target, on the other hand, deals with the CRC data digest
error by issuing a Reject PDU and then sending a recovery R2T. (The details of
the various SNACK modes can be found in Appendix A in the section on SNACK
PDU.)
However, if the digest error is in the header of the PDU, the initiator cannot
trust the information in the header and therefore cannot immediately ask the
target to resend the appropriate Data-In PDU. In fact, it cannot be sure even
that it was a Data-In PDU. Only the arrival of the status response PDU or a
command's last Data-In PDU with phase collapse status permits the initiator to
detect that a Data-In PDU is missing. Likewise, the receipt of any other PDU
from the target that has the StatSN value will let the initiator know that there
is a missing Data-In or SCSI Response PDU. As specified above, it then has the
missing PDU retransmitted by making the request via the SNACK PDU. A
similar process occurs on the target side with Data-Out PDUs; the only
difference is that the target issues the R2T PDU (instead of a SNACK) to
request the missing data, and lets the initiator detect and retransmit missing
commands.
It is important to understand at what point the target can free up the buffers
for an operation. The target needs an indication that the initiator has
successfully received the status response as well as all the data the target
sent. To do this, it checks the values in the next PDU sent on the connection
from the initiator. If the ExpStatSN is what the target thinks it should be, the
target can then free the buffers without a lot of handshaking and round-trip
delays. This acknowledgment is accomplished via any PDU sent from the
initiator to the target during full-feature phase.
If the traffic from the initiator to the target is inactive, the target can issue the
NOP-In PDU (in "ping" mode). In a normal situation the initiator will return the
"ping" (in a NOP-Out PDU) and contain the latest values for the CmdSN and
ExpStatSN. Also in a normal situation, all values will be correct and things will
continue.
Now let's follow a situation that is not exactly normal. In this case, when the
target wants to check on the health and status of an initiator, it may send a
NOP-In PDU. The NOP-In will contain the StatSN value that the target plans on
issuing next. The initiator must then check to see if it is missing a status
response PDU (detected by the fact that its own StatSN count is behind the
value in the StatSN field of the NOP-In PDU).
If the initiator detects a missing response, it issues a SNACK PDU. This SNACK
will request that the target resend the missing status response PDUs. As it
receives these PDUs the initiator may find that other things are missing, such
as Data-In PDUs. The initiator should then issue the SNACK and request the
missing Data-In PDUs. Figure 7-5 depicts this sequencing and recovery after a
CRC-detected header error.
We might think that, if the link is not actively sending commands or data,
there is no harm in tying up the target resources with unacknowledged status
responses until the link becomes active again. However, even if this connection
or session is currently inactive, other sessions may be active and probably
need the resources.
Let's now examine the write operation. Assuming that the target was sent a
write command earlier, it needs to ensure that it has received all the data
required for the command's execution. If not, the target will request from the
initiator any needed data that it has not received. Even if the data was
originally sent via unsolicited Data-Out PDUs, the target can request any
missing data via R2T PDUs. Therefore, if the unsolicited Data-Out PDU was
discarded because of a header digest error, the target can have it resent via
the R2T (called a recovery R2T).
Tables 7-2 through 7-4 are examples of normal read and write operations
along with the settings of the various sequence numbers and the Final bit (F
bit).
Recap
Because there can be header digest errors as well as data digest errors, the
sequence numbering of StatSN, ExpStatSN, DataSN, and R2TSN permits the
initiator and the target to precede normally without extra acknowledgments
for each data, status, or request exchange. Yet when a digest error does exist,
sequence number management permits recovery from both error types.
Table 7-2. Read Operation Data Sequencing (Note: <<< or >>> Denote Direction of
Travel.)
Initiator Function PDU Type Target Function
Command request (read) SCSI command (read)>>> Prepare data transfer
Command complete
How iSCSI keeps track of the various connections that make up a session
(via the CID)
We also learned
How the Data-In and Data-Out PDUs are ordered with data sequence
numbers (DataSNs).
How the resultant status from a command has a status sequence number
(StatSN) carried in the SCSI response PDU.
How the PDUs that originate from the initiator generally carry a CmdSN or
a DataSN, along with the expected status sequence number (ExpStatSN)
to be received on the next response from the target.
The receiving and sending of data (data being read and written).
The concept of a target soliciting the initiator for data needed for a write
operation and how the request to transfer (R2T) PDU was used for that
purpose.
The use of NOP-In and NOP-Out to ensure that the expected counts
including those used for "window size" management are received and
synchronized between the initiator and the target, especially when link
activity is too low to piggyback the counts on other PDUs.
Finally it was explained that the main purpose of these sequence fields (with
the exception of MaxCmdSN) is to address the loss of a PDU because of a
header digest error or the loss of data because of a data digest error.
Chapter 8. Command and Data Ordering and Flow
To the Reader
Command Ordering
Command Windowing
Data Ordering
Chapter Summary
To the Reader
This chapter will present the details of flow control for both commands and
data. It will take the concepts introduced in Chapter 7 and expand them. As
always, if you desire just an overview, skip to the Chapter Summary.
Command Ordering
Commands such as read and write are sequenced by iSCSI initiators so that
the iSCSI target can present them to the target SCSI layer in the order in
which they were sent. As specified in Chapter 7, to do this each command is
given a unique number, called the command sequence number (CmdSN),
which is placed in the header of the various request PDUs sent from the
initiator to the target, starting with the initial login request of the first
connection within a session. The next sequence number is assigned to the
second nonimmediate request PDU following the final login message from the
initiator. The first request (command) PDU sent in full-feature phase (FFP)
contains the same CmdSN as the login request itself.
The thing to be understood here is that the login process is considered a single
immediate command, regardless of how many login messages are exchanged.
The login on any one connection can only have a sequential exchange of login
parameters with no overlapping of other commands or data in that connection,
before the process ends (when full-feature phase is reached on the
connection). Therefore, there is no need to increment the CmdSN until after
the login is complete, and, because it is considered to be an immediate
command, the CmdSN is not incremented even then. However, after the initial
login, for all nonimmediate commands from that point forward, the CmdSN is
increased by 1, regardless of the connection on which the command is actually
sent. The CmdSN is a global value, maintained throughout the entire session
across all connections.
Ignoring the error case, there is a potential exception to the in-order delivery
of commands, especially on a single-connection session. Let's focus on the
single-connection case, in which the iSCSI initiator HBA is given one command
(perhaps a write) that includes immediate data, and another command so close
behind that the second command is ready to be sent before the write
command is ready. This can happen when the first command is still gathering
the immediate data across the PCI bus, as the following command (perhaps a
read) arrives and is ready to be transferred across the PCI bus to the HBA. If
the commands were numbered before they were sent to the HBA, the second
command could actually be ready to be sent on the link before the first
command completed gathering all the data onto the HBA. If commands were
sent in this out-of-order manner, the physical link would be more completely
utilized; however, it would seem to the iSCSI target that the first command
was lost (as if it had experienced an iSCSI header digest error). Therefore, this
potential exception to the command ordering rules has not been accepted.
To prevent this error-processing trauma, and yet fully utilize the capacity of
the link, it is expected that many HBA vendors will open one or more
additional "logical" connections on the same physical link, so that commands
can be sent out immediately when they are ready without having to wait for
long data-related PDUs.
This same type of issue exists even when multiple physical links are used in
the same sessioneach link may face a similar problem. Therefore, it is
expected that some vendors will utilize both multiple physical links and
multiple logical connections per physical link within the same iSCSI session. In
this way they not only will be able to increase the bandwidth available to each
session, but also will be able to fully utilize the bandwidth of each physical
link.
Independent of the order in which the iSCSI target receives the command
PDUs (across several connections within a session), it is the job of the iSCSI
target to deliver the commands in CmdSN order to the SCSI target layer. Many
folks believe that SCSI only demands in-order delivery to the LU (logical unit)
from any specific SCSI initiator. However, if the iSCSI and SCSI layering are
kept completely isolated, the iSCSI layer will not understand the LUN (LU
number) received with the SCSI command. Therefore, the simplest thing for
any iSCSI target to do is ensure that the commands turned over to the SCSI
layer are delivered in CmdSN order. This way iSCSI will also be sure that the
LU ordering is maintained, and keep the iSCSI to SCSI processing as simple as
possible.
The CmdSN is set in all request PDUs sent from the initiator to the target,
except for the SNACK Request PDU. This means that the following PDUs sent
from the initiator to the target contain the CmdSN:
Text Request
Login Request
Logout Request
NOP-Out
Data-In
Asynchronous Message
Text Response
Login Response
Logout Response
Reject
NOP-In
Task Management Function Response
In other words, all PDUs that flow from the target to the initiator will have the
ExpCmdSN value set. Therefore, every target response will be able to
acknowledge commands previously sent by the initiator. The value set in the
ExpCmdSN is equal to the last CmdSN sent to the target SCSI layer plus 1.
This statement reflects the following important concept:
iSCSI requires the iSCSI layer to deliver the commands to the SCSI layer
in order. By implication, if a "hole" exists in the CmdSN sequence, the
commands that have a CmdSN higher than the "expected value of the
hole" cannot be delivered to the SCSI layer until the missing commands
are received by iSCSI.
Command Windowing
Even though the iSCSI command window management is not precise, it can
work very well in throttling back buffer consumption. In general, iSCSI storage
controllers have more open sessions/connections than Fibre Channel (FC)
storage controllers have. This is because there will often be not only servers
interconnected to the storage controllers but also desktop and laptop systems.
Therefore, the iSCSI sessions will have a lot of inactive connections. This
might be a problem for Fibre Channel but probably is not a significant issue for
iSCSI. In Fibre Channel the storage controller needs to advertise the "buffer
credits" that can be used by each connected initiator, which often means a lot
of underutilized memory. These credit values are maintained by
acknowledgments that, if employed "at distance," would degrade overall
throughput.
iSCSI does not advertise credits or guarantee buffers. Instead, it uses its own
iSCSI windowing capability to slowly reduce the volume of commands received.
When that is not enough, it exploits the TCP/IP connection windowing
capability to reduce the total flow of bytes (data and commands) between
initiators and targets by applying "back pressure" on the devices between
them. This can be done because TCP/IP is, in general, a store-and-forward
technology and so its windowing capability is used as "the final gate."
iSCSI, unlike Fibre Channel, has to deal with large distances between initiators
and targets. As mentioned previously, techniques for managing buffers that
employ the fine-grained control permitted by FC credits will not be as useful in
at-distance iSCSI networks, because there is a long turnaround associated
with larger distances. This causes significant latency in Fibre Channel's
continuous credit-response approach, and that latency in turn reduces the
total effective line utilization. Therefore, approaches that do not guarantee
buffers and have a large control granularity seem to be the best fit for these
environments, especially when backstopped with the back pressure approach
available to TCP/IP.
What all this means is, when the iSCSI storage controllers overcommit their
storage, they have windowing methods that reduce the command flow when
buffers are short.
Initiator Task Tag
There is another item included with the SCSI command PDU that is of major
importance, especially when it comes to buffer control on the HBA. That item
is the initiator task tag (ITT)a handle used to identify an active task within a
session. When the data arrives from the target, the ITT can be used by an HBA
to place the data directly into the target host buffers without extra moves or
additional staging buffers (see additional comments on direct memory
placement, to come). This quickly frees up the critical HBA buffers for reuse by
other tasks. In order for this to work, the target must place the ITT of the
command into the corresponding Data-In PDUs, response PDUs, and so forth.
(Refer to the design example, which follows.)
The ITT can be reused when a command is completed unless, of course, the
command is part of a linked set of commands (see Chapter 10, Task
Management, for more information on linked commands), in which case the
initiator must keep the ITT pending for reuse by the next command in the task
chain. Sooner or later, however, at the completion of the last command in the
chain, the ITT can be reused. That means it can be a pointer or a handle that
the initiator and the target can use to track a command and its related data
and status.
One other point about the ITT is that it should be unique within the session,
not just within the connection. This is because, via task management PDUs,
the initiator can move a command and its allegiances from a failing connection
to a healthy connection within the same session, or perhaps even to a
different HBA. (See Chapter 10 for more information on command and
allegiance connection reassignment.)
Imagine that a host device driver creates and enqueues an entry for a
"command queue" that contains a partially completed iSCSI read request PDU.
Also suppose that the entry has, among other things, pointers to the host
buffers in which the data, when read from the target, can be placed. The
address of this command queue entry (CQE) can be given to the HBA. The
address of the HBA's copy can become the value of the ITT. The HBA might
place a copy of the command queue entry in its own memory by DMAing
(Direct Memory Accessing) it from the host memory. This copy is made up of
the partially completed PDU, the addresses of the input buffers, and so forth.
Thus, as soon as the HBA fills out the rest of the read request PDU, it can send
the command on its way to the target. Then, when the target responds with
Data-In PDUs, the HBA can extract the ITT from the PDUs' headers and
perform a lookup based on the ITT pointer. This will enable the HBA to find its
copy of the CQE and use the host buffer address found therein to place the
incoming data directly in the appropriate host buffers. In this way the host will
not need to move the data an additional time.
Figure 8-1 shows the read data sent from the target storage controller in a
Data-In PDU that is broken across two TCP segments. The header contains the
ITT value of 8192, which the initiator HBA uses as a displacement into its RAM
memory where it finds its copy of the CQE with the locations for the data in
the actual host memory. Then the data from the PDU can be placed directly
into those locations. Note that in this figure the second TCP/IP segment arrives
before the segment with the PDU header, so it cannot be placed until the
segment with the header arrives. This is because the header has the ITT
pointer. Also, since each segment identifies the TCP byte stream displacement,
it is a straightforward calculation to determine where the various data pieces
are placed in the host memory.
This design example shows that, by carefully picking the values placed in the
ITT, an engineer can create an HBA that does not require additional host CPU
cycles to move or place storage data directly in the final host I/O buffers.
One other consideration regarding this use of the ITT is linked commandsthe
same ITT value needs to be used by each of the linked commands as they are
sent. In those cases, the ITT value will need to be around for a long time. For
that reason (and in order to support MC/S across HBAs), the vendor may want
the ITT to be created by the HBA device driver; then the HBA can use it as an
indirect pointer to the HBA's version of the CQE. In that case the ITT value and
pointers to the CQEs located on the HBA and in main memory can be an entry
on an HBA "hash" lookup table (based on the ITT value).
If HBA RAM space is needed, it is even possible to free up the CQE copy on the
HBA after the command is sent. The CQE can be fetched back when the
pointers to main memory data locations are needed. Of course, the main
memory version of the CQE will contain the ITT and other values needed to
relate the linked commands to each other as they are passed down from the
SCSI layer.
What all this means is that with appropriate use of the ITT there are a number
of ways to implement direct data placement (DDP) in the initiator's main
memory.
Data Ordering
The data sent to the target can be located in the SCSI command PDU that
contains the write request. It may also be sent to the target in an unsolicited
Data-Out PDU. Unsolicited PDUs contain the data to be written to the target
device and are sent without solicitation from the target. It is also possible for
the target to request the data needed for a write command from the initiator.
To do this it sends the request to transfer (R2T) PDUs to the initiator when it is
ready to accept the data.
To understand why some of these values are connection specific and others are
session wide, it may be helpful to understand that the receiving buffers on an
HBA may be completely independent from the main memory of the host or the
storage controller. The session-wide values address the limitations of main
memory, while the connection-specific values address the limitations of
memory on the HBA.
The immediate data is contained in the same PDU that holds the command.
The maximum amount of data that can be included in a SCSI command PDU is
limited by MaxRecvDataSegmentLength. This value also sets the limit of data in
an unsolicited Data-Out PDU.
The initiator needs to send the data in a sequence of eight 8K Data-Out PDUs
(with the last one in each sequence having the F bit set). Following that is a
sequence of one 8K Data-Out PDU and one 1K Data-Out PDU that has the F bit
set. Figure 8-2 depicts this example.
To recap:
MaxBurstLength and FirstBurstLength are both negotiated to 64K.
The SCSI (write) command PDU specifies a total data length of 137K.
The last Data-Out PDU in a sequence will have its header's F bit set to 1.
If the F bit is set in the command PDU, it means that no unsolicited data
follows this PDU and only the data with this PDU (immediate data) is being
sent without an explicit R2T. Therefore, the R2T can begin with the offset,
which follows the immediate data, if any.
If the F bit is not set, then the target knows that the initiator must send
FirstBurstLength of data, so the target can immediately begin sending
R2Ts starting with the offset, which follows the FirstBurstLength.
In either case, the target can send the R2Ts as soon as it decodes the
command PDU.
Since the value of Yes for DataPDUInOrder invokes what is probably assumed
to be the normal way to exchange data, we might wonder why there is even a
value of No. The assumption that data arrives in order, however, is often
incorrect, because extracting from or sending to a disk can be done whenever
the disk read/write head passes over the appropriate spot. Years ago it was
common for the data sectors to be interleaved on the disk media, and a No
value would be useful in that type of environment. However, that does not
apply today. Instead, a virtualized hard disk can have its data sent to or
received from several different places, and those places probably are not
synchronized. Sometimes one part of the data is in a cache and another part is
on the hard disk. In these cases, data may be available to be transferred to or
from the media out of order. That explains the need for and value of the
key=value pair DataPDUInOrder=No. For a normal host-based initiator, this
value has no bearing since the initiator will have already allocated the needed
main memory; therefore, the data can be placed in that memory in any order
independent of arrival time.
As for R2T solicited data, as mentioned above, the target might know how it
wants the data delivered so that it can apply the appropriate arrival time
placement to the resultant real disks and other media. Therefore, the initiator
must respond to the R2T PDUs in the order they were sent. To ensure that this
ordering is possible, each R2T PDU carries its own sequence number, known as
R2TSN. This number is unique only within the scope of the command to which
it applies.
It should be noted that the R2T PDUs can be sent one after the other, with no
waiting for the corresponding initiator response, up to the negotiated limit in
the value known as MaxOutstandingR2T.
Even though the host initiator can handle out-of-order data on reads, there is
no way for the tape to do so. In general, for most if not all storage controllers,
if the input to the device must be in order, the output of the device probably
also needs to be in order. Therefore, the …InOrder value applies to both
directions of the data. With third-party copy commands, which have their
secondary initiator function within the actual tape storage controller, the
secondary initiator cannot send or receive out-of-order data. That is one more
reason why the values for DataPDUInOrder and DataSequenceInOrder apply
for both sending and receiving.
It already has been mentioned but can stand some reiteration and additional
emphasis here: Data and status must be sent on the same connection as their
related SCSI command PDU. Commands can be load-balanced across the
various connections within a session, but the data associated with them,
whether Data-Out or Data-In, must be sent on the same connection on which
those commands were sent. This ensures that the HBA that carried the
command will be able to correlate the data with the command. This applies
whether the data is going out or coming in, and it also applies to the
status/response PDUs, which also must travel on the same connection as the
corresponding command.
All unsolicited data PDUs must be sent in the same order as their related
commands. That is, unsolicited data PDUs for command N must precede the
unsolicited data PDUs for command N+1. Command N+1, however, may
precede the unsolicited data PDU for command N.
Note that there can be more than one command linked together in a single
task (see Chapter 10, Task Management, for more information on linked
commands). However, the allegiance of the data and status to the same
connection applies only to the specific commands and not to the task. Linked
commands within one task can be individually placed on any connection within
the session.
We have been talking as though immediate and unsolicited data PDUs are
always supported. This is not completely correct. By default, immediate data is
supported and unsolicited data is not; however, this is negotiable. The initiator
and target can negotiate not to support immediate data and can also decide to
support unsolicited data PDUs. R2T support is always enabled.
Target Transfer Tag
Often all the data that can be sent via unsolicited techniques will not be
enough to satisfy the data size specified in the SCSI command PDU. In that
case, the target will need to solicit the remaining data from the initiator by
sending it an R2T PDU. This PDU will contain the command's ITT so that the
initiator will know to which command the R2T applies. It will also contain a
target transfer tag (TTT), a "handle" the target assigns and then places in the
R2T PDU. This TTT has to be unique only within the current connection of the
current session and remains valid only as long as the corresponding command
is active. In order to make some error recovery easier across connections in a
multiple-connection session, the target may make the TTT unique within the
session, but that is an implementation decision.
Since the TTT is only a handle made up by the target, when it sends R2T PDUs,
it can be anything the target can fit in a 32-bit field. One of the things the
target will probably do is allocate a buffer big enough to hold the data being
requested (via the R2T) and then put the address of this buffer into the TTT.
That address will then be used to place the returned data directly in the
appropriate locations, which means that the storage controller does not need
to do any additional data movement. The reason this is possible is that iSCSI
requires the Data-Out PDUs to reflect back to the target the TTT that was used
in the R2T. (See Figure 8-3.)
Because of the ITT, the iSCSI initiator can start putting the data directly in the
host memory as soon as the HBA receives the TCP/IP segment containing the
appropriate Data-In PDU header (with the ITT). This is true even if the
subsequent TCP/IP packets arrive on the HBA out of order. What we have is a
form of remote direct memory access (RDMA)[*] which we will call iSCSI
RDMA.
[*] Technically we are talking about direct data placement (DDP), which is a subset of RDMA.
The RDMA is not as useful as it would be if the TCP/IP packet itself contained
the appropriate RDMA address. In that case you never have to wait for the
iSCSI Data-In PDU header to arrive if it is received out of order in the TCP/IP
buffers. With or without TCP direct placement, data placement directly into the
appropriate host buffers is valuable for the following reasons:
It reduces latency.
It removes the need for the host to perform additional data movement and
therefore reduces the impact on host CPU cycles.
The eddy buffers are just holding areas, since as soon as the Data-In
PDU header arrives the data can be placed in the final main memory
location regardless of the order in which the Data-In TCP/IP packets
were received.
There is current work going on within the IETF to create a TCP/IP feature for
RDMA. However, iSCSI has direct data placement now, which works very
effectively.
In order to permit the above direct data placement with the smallest possible
amount of on-HBA memory (especially on the 10-Gb/s links), "markers and
framing" may be useful. (See Chapter 13, Synchronization and Steering, for
more information on markers and framing.)
Chapter Summary
The command sequence number (CmdSN) is set in all request PDUs sent from
the initiator to the target, except the SNACK Request PDU. That means that
the following PDUs contain the CmdSN:
Text Request
Login Request
Logout Request
NOP-Out
Data-In
Asynchronous Message
Text Response
Login Response
Logout Response
Reject
NOP-In
Thus, the two sides of the session can keep track of what each has received
without extra interactions and acknowledgments.
The concept of "command windowing" and how it is used to control the use of
memory in the iSCSI buffer pool was also explained. It was compared to the
credit approach used with Fibre Channel, and it was explained when and why it
is of value in the TCP/IP network.
The initiator task tag (ITT) was explained along with how it could be used to
perform remote direct memory access (RDMA), that is, direct data placement
(DDP) in the initiator's main memory without additional data movement.
Likewise, it was explained how the target transfer tag (TTT) performs RDMA,
that is, direct data placement into the storage controller's main memory
without the need for additional data movement.
The chapter also explained the meaning and appropriate handling of the
following key=value statements:
SCSI Nexus
Chapter Summary
To the Reader
By necessity this chapter is a bit tedious; however, if you work through it, you
will have a good understanding of iSCSI structures and iSCSI's relationship to
SCSI concepts. Try to stay with the chapter as long as possible, then skip to
the Chapter Summary. (Note: you will find more detail on the topics here in
Appendix C.)
iSCSI Structure and SCSI Relationship
SCSI was originally defined to be a single connection between a host and a set
of storage devices (SCSI devices). This was extended to permit the host to
connect to multiple sets of storage units (multiple SCSI devices), all with
one or more logical units (LUs). iSCSI has stretched the normal SCSI
definitions in order to cover its own operation and yet ensure that it adheres
to SCSI semantics. The following sections discuss the makeup of the iSCSI
architecture and its relationship to SCSI architecture.
Figure 9-1 shows the iSCSI network client and server entities. The iSCSI
client is the OS that resides in a host computer; the iSCSI server is a storage
controller of some type. Within the iSCSI client network entity is the iSCSI
initiator node. In almost all configurations the iSCSI client network entity
and the iSCSI initiator node have a one-to-one relationship, but this is not the
case on the server side, where the iSCSI server network entity may often
have more than one iSCSI target node (see Figures 9-2 and 9-3). It should be
pointed out that the iSCSI target node has a one-to-one relationship to a SCSI
device, which is usually made up of more than one LU. In Figure 9-1, the
iSCSI (SCSI) initiator port has a one-to-one relationship with the iSCSI
initiator node, although this is not always the case, as will be seen later.
The IP network and connections will map to what SCSI calls its Service
Delivery Subsystem. The network itself as well as the network interface
cards (NICs), or HBAs, are part of that subsystem. The network interface has
connections known by their IP addresses and TCP ports. iSCSI calls such a
connection point a portal. The iSCSI portal connects to an iSCSI (SCSI) port
(which is also what SCSI calls its service delivery port). The iSCSI (SCSI) port
is where the various network portals come together and are treated as a single
SCSI connection. This grouping of iSCSI portals is defined by iSCSI to be a
portal group. A portal group can be located in either the iSCSI client network
entity or the iSCSI server network entity. However, only the target portal
groups are given a special identification, a target portal group tag (TPGT). This
tag is a number between 1 and 64K that is unique to each iSCSI target node.
The target SCSI port delivers the commands to the LU in the order the
initiator SCSI port sent them. In normal SCSI bus environments, this is
straightforward. In iSCSI environments, however, it is more complicated. When
the iSCSI (SCSI) initiator port sends its commands through multiple
connections (often on different NICs), the commands may make their way
through the network on independent paths and may arrive at the iSCSI
(SCSI) target port in an order different from how they were sent. It is the
job of the iSCSI delivery subsystem to ensure that the commands are
delivered for SCSI processing as though they were sent on a single link.
Therefore, the endpoints of the iSCSI delivery subsystem are considered to be
the SCSI ports.
The iSCSI target node can be considered the SCSI device since both entities
contain the LUs. As shown in Figure 9-1, the iSCSI server network entity
contains only one iSCSI target node (or SCSI device); however, it may contain
more than one, as we will see later.
Now let's go back and define how instances of each of the objects described
above are given names.
In Chapter 4's Naming and Addressing section, we saw how to create a name
for an iSCSI initiator or target node (via either the iqn or eui name). The
iSCSI initiator node name almost always represents the total iSCSI client
network entity (or OS), but, as will be seen later, there can be multiple iSCSI
(SCSI) initiator ports, so it is important to create distinguishing names for
them. Concatenating the iSCSI initiator node name with an additional
identification of some type creates an iSCSI (SCSI) initiator port name. The
additional identification is a value we call the ISID (initiator session ID), which
is made up of a type flag followed by a vendor's coded ID and a qualifier. (Refer
to Chapter 7 for details.) In this way, the various HBA vendors can manage the
creation of their ISIDs independent of other vendors that might also have an
adapter in the same initiator node. The iSCSI (SCSI) initiator port now has a
complete name that is unique in the world, made up of the iSCSI initiator node
name concatenated to a vendor-created ISID. In Figure 9-1, the iSCSI initiator
node name was iqn.1999-12.com.ajax:os1, concatenated with the ISID. The
ISID was made up of a vendor ID type code of 1 plus the actual enterprise
number of the vendor, which is 5 plus a qualifier of 1. Thus, the iSCSI (SCSI)
initiator port name is iqn.1999-12.com.ajax:os1+[1+5+1].
Note: The iSCSI (SCSI) initiator port name is actually made up of the iSCSI
initiator node name concatenated with ",i," and the hexadecimal
representation of the ISID. We will not show the ",i," or the hex in this chapter.
Figure 9-1 shows three initiator network interfaces, each with a TCP/IP
address (10.1.30.3, 10.1.30.4, or 10.1.30.5). Notice that the TCP/IP port
numbers are not shown, since the initiator does not listen for the
establishment of a connection so there is no reason to advertise its port. SCSI
requires that the initiator always be the entity that originates the contact to
the target, so only the target's TCP/IP port needs to be identified and
advertised.
In the figure we can also see five network interfaces on the target side with IP
addresses of 10.1.40.21, 10.1.40.22, 10.1.40.25, 10.1.40.26, and
10.1.40.27. Two different TCP/IP port numbers are specified as connection
establishment IP network ports, on which the targets listen for a connection
request. These are TCP/IP ports 3000 and 5000. The NIC that contains the IP
address of 10.1.40.22 has both IP port numbers, meaning that it could have
connections from initiators at either or both TCP/IP ports.
The iSCSI (SCSI) target port name can also be formed by beginning with the
iSCSI target node name and concatenating the TPGT, making the iSCSI (SCSI)
target port unique within the iSCSI server network entity (as well as within
the world). In Figure 9-1 the iSCSI (SCSI) target port name has a TPGT of 1,
so the resulting name is iqn.1921-02.com.ibm.ssg.12579+[1].
Note: The iSCSI (SCSI) target port name is actually made up of the iSCSI
target node name concatenated with the ",t," and the hexadecimal
representation of the TPGT. We will not show the ",t," or the hex in this
chapter.
Figure 9-2 is a bit more complicated. The iSCSI client network entity is the
same as in Figure 9-1; however, the iSCSI server network entity is
significantly changed. In this depiction, we see two iSCSI target nodes: eui.
02004567A425678A (designated as node A) and iqn.1921-
02.com.ibm.ssg:12579 (designated as node B). It also depicts three iSCSI
(SCSI) target ports and three target portal groups. They are designated as
node A's portal groups 1 and 2 and node B's portal group 1.
Figure 9-2 also shows two NICs and two iSCSI HBAs. NIC 1 has one IP address
and two TCP/IP ports (3000 and 5000); NIC 2 has only one TCP/IP port (3000)
and one IP address.
NIC 1 TCP/IP port 3000 can access either node B, as part of B's portal group 1,
or node A, as part of A's portal group 1. NIC 1 TCP/IP port 5000 can access
node A, as part of A's portal group 2. This same group is shared by NIC 2.
In Figure 9-2, notice the one-to-one correspondence between iSCSI (SCSI)
target ports and portal groups. That means that the port names can be made
unique in the iSCSI server network entity (as well as within the world) by
concatenating the iSCSI target node name with the TPGT (the same way it was
done in Figure 9-1. Therefore, the names of the three iSCSI (SCSI) target
ports are
eui.02004567A425678A+[1]
eui.02004567A425678A+[2]
iqn.1921-02.com.ibm.ssg:12579+[1]
The iSCSI (SCSI) target port is the terminus for a session that starts in the
iSCSI (SCSI) initiator port. The iSCSI target node (which is also known as a
SCSI device) is the owner of the LUs; thus, the target port operates through
the iSCSI target node to access the appropriate LUs.
Figure 9-2 depicts two HBAs, one with two portals (TCP/IP connections) and
with only one. All three of these portals, along with the portal (TCP/IP port
5000) in NIC 1, make up B's target portal group 1.
In Figure 9-3 the iSCSI server network entity remains as shown in Figure 9-2;
however, the iSCSI client network entity is different. The iSCSI initiator node
has two unique iSCSI (SCSI) initiator ports, labeled iqn.1999-12.
com.ajax:os1+[1+3+1] and iqn.1999-12.com.ajax:os1+[1+5+1]. Either can
contact the iSCSI target nodes via the appropriate portals.
The problems with this type of configuration are that both iSCSI (SCSI)
initiator ports can address the same LUs, and the arrival of the application I/O
(SCSI commands) needs to be coordinated in some manner, since command
ordering is usually important to the various host applications. On the other
hand, SCSI's persistent reserves can be used to prevent one iSCSI (SCSI)
initiator port from accessing the LUs accessed by the other, but that too needs
to be carefully coordinated.
To handle a problem like this in Fibre Channel, vendors often devised software
known in the industry as a wedge driver. Usually wedge drivers were vendor
specific, which could be problematic in a multiple target vendor environment.
It should be noted that wedge drivers are not covered by any SCSI standard,
and their function and capabilities are completely at their vendor's discretion.
One of the reasons that the multiple connections per session (MC/S) function
was put in the iSCSI specification was to prevent iSCSI from requiring wedge
drivers.
Figure 9-3 shows the case where two different iSCSI (SCSI) initiator ports
exist in one iSCSI initiator node, which points out that the target's "access
controls" can be applied to only the initiator node name and does not need to
be applied to the individual port names. This capability saves a lot of
administrative work when it comes to instructing the targets about which
connections can be accepted. This is a significant reduction in administrative
effort from having to assign access rights to individual host adapters, as is the
case with Fibre Channel.
If different users use the same host physical hardware and boot different
operating systems, each OS image should have its own iSCSI initiator node
name (perhaps concatenating the user ID to a generic iqn iSCSI initiator node
name). Access controls should then be assigned to each resultant node name.
Note: Another consideration here is the access controls to hosts that operate
multiple "virtual machines." Such systems should probably give a unique iSCSI
initiator node name to each virtual machine, so that individual machines can
have their own access controls.
In each of the three configurations (Figures 9-1, 9-2, and 9-3), the initiator
should be able to establish a discovery session to any of the portals in the
iSCSI server network entity and issue the SendTargets command. Table 9-1
lists the information to be returned in response to SendTargets for each of the
configurations shown in the figures.
The SendTargets command will list all the iSCSI target node names that it
knows about, as well as the portal addresses (TCP/IP[port]) that can be used
to reach those nodes. (The TCP/IP port may not be shown if the default port[*]
is used.) The TPGTs are included so that the initiator can tell what portals can
be used as part of a multiple-connection session. Under any specific iSCSI
target node name is a list of portal addresses that can be used to access the
named node. If the initiator wants to start a multiple-connection session with
that node, it uses the portal address with the same TPGT.
[*]iSCSI currently has a well-known port of 3260, but IANA will assign a "system" port number and
that will be the default.
Notice that two different portal groups can connect to target node
eui.02004567A25678A (TPGT 1 and TPGT 2). Also notice that for Figures 9-2
and 9-3 the SendTargets output shows the portal with the TCP/IP address of
10.1.40.22[3000] can contact both target nodes A and B. These nodes
happen to have the same TPGT, but that is a coincidence. The tag is only
unique within the scope of an iSCSI target node.
Table 9-1. SendTargets Responses from Configurations Shown in Figures 9-1, 9-2, and 9-3
Figure 9-1 iqn.1921-02.com.ibm.ssg:12579 Figures 9-2 and 9-3 iqn.1921-02.com.ibm.ssg:12579
10.1.40.25[5000],1 10.1.40.25[5000],1
10.1.40.26[5000],1 10.1.40.26[5000],1
10.1.40.27[5000],1 10.1.40.27[5000],1
10.1.40.21[3000],1 10.1.40.22[3000],1
10.1.40.22[5000],1 eui.02004567A425678A
10.1.40.22[3000],1 10.1.40.21[3000],2
10.1.40.22[5000],2
10.1.40.22[3000],1
SCSI Nexus
The reason we bother with this is that SCSI defines the behavior within an I-T
nexus but has no definition of behavior between I-T nexus. iSCSI takes this
even further by defining two or more identical nexus to be an illegal
configuration.
We may have MC/S between two iSCSI (SCSI) ports (initiator and target), but
not two or more sessions between the same iSCSI (SCSI) port pairs.
The reason the configuration in Figure 9-3 is legal is that the initiator side is
anchored in two different iSCSI (SCSI) initiator ports. Yes, they are both
located in the same iSCSI client network entity and share the same iSCSI
initiator node name, but they are different ends of SCSI I-T nexus. The
operation of the wedge driver shown is not defined by SCSI, but most
implementations attempt to make the different SCSI ports work in a
coordinated manner. As one might imagine, this is a very tricky proposition.
SCSI defines the requirement that commands be delivered to the SCSI target
port across an I-T nexus (i.e., an iSCSI session) in the same order in which
they were given to the SCSI initiator port. However, the wedge driver is not
defined to SCSI and it is not normal to have one on the target side, making
the task of keeping things in order for delivery to the SCSI target port difficult.
However, the SCSI definitions are a bit vague on the in-order delivery
statements, giving rise to the interpretation that they only apply to the I-T-L
nexus. That interpretation gives the wedge drivers a method to ensure that
the commands for a specific LU are outstanding only on a single I-T nexus
(session) at a time. (This approach is much less efficient than iSCSI's MC/S,
but for the most part it seems to work.)
If the wedge driver attempts any path balancing between the two sessions (I-T
nexus) by placing LU traffic on the first available path, the design becomes
very sophisticated, especially when it is required that the wedge drivers also
handle the SCSI concept of reserves (especially persistent reserves).
Table 9-2. SCSI Persistent Reserve Nexus for Configurations Shown in Figures 9-1, 9-2,
and 9-3
Figure 9-1 Figure 9-2 Figure 9-3
iqn.1999-12.com.ajax: iqn.1999-12.com.ajax:
eui.02004567A425678A iqn.1921-02.com.ibm.
+[1] ssg.12579+[1]
iqn.1999-12.com.ajax: iqn.1999-12.com.ajax:
eui.02004567A425678A eui.02004567A425678A
+[2] +[1]
iqn.1999-12.com.ajax:
os1+[1+5+1] and
eui.02004567A425678A
+[2]
iqn.1999-12.com.ajax:
os1+[1+3+1] and
eui.02004567A425678A
+[2]
iqn.1999-12.com.ajax:
os1+ISID=1+5+1 and
eui.02004567A425678A
+[1]
SCSI reserves permit access to a specific LU only via a specific I-T nexus.
Persistent reserves permit reservations to last even across a restart of SCSI I-
T nexus (such as a host reboot). That requires the wedge driver to know how
to interpret SCSI reserves and then to ensure that the LU access is
constrained to the appropriate session even across boots. This complexity is
not a problem with iSCSI's MC/S since it has components in both the initiator
and target that work together to permit multiple connections to operate as if
they were one. Wedge drivers, on the other hand, operate with code that is
only on the initiator side, thus making the job harder and less efficient.
Though building a wedge driver is a sophisticated project and is usually vendor
specific, this is the way multiple FC connections/sessions are handled.
You should understand by this point that the iSCSI capability of MC/S greatly
simplifies multiple connection problems for the host (initiator) side of the
connection.
Chapter Summary
The configurations in Figures 9-1 through 9-3 show how the iSCSI naming
conventions map to the various iSCSI entities and how those entities map to
SCSI concepts and standards.
Both the iSCSI client network entity and the iSCSI server network entity
contain iSCSI nodes.
The portal groups connect to the remote entity via the IP network.
The iSCSI (SCSI) initiator port is identified by the iSCSI initiator node
name concatenated with the ISID.
The ISID is composed of the vendor code type, the vendor code, and a
unique qualifier provided by the vendor's software or HBA.
The iSCSI (SCSI) target port is identified by the iSCSI target node name
concatenated with the appropriate target portal group tag (TPGT).
Since the iSCSI initiator node name is usually common to all iSCSI (SCSI)
initiators within an OS, it is appropriate to use the iSCSI initiator node
name (InitiatorName) as the identifiable name in the access control
process.
The constructs just given fit into the SCSI concept of a nexus:
Chapter Summary
To the Reader
This is another very technical chapter, so again, if you only want an overview,
just skip to the Chapter Summary.
Each task is classified as "tagged" or "untagged." The SCSI initiator will allow
only one untagged SCSI task to be outstanding/in process/in flight for any
specific LU, but it will allow many tagged tasks to be outstanding/in process/ in
flight. The individual tagged commands, or the first one in a linked chain of
tagged commands, can have attributes that describe the type of queuing order
they follow:
Simple
Ordered
Head of queue
In iSCSI all commands have an initiator task tag (ITT), which can be used as
the SCSI task tag, but in iSCSI even the so-called untagged tasks are given an
ITT. Thus, the only way to understand what was intended to be an untagged
task is through the iSCSI attribute field in the SCSI command PDU. If the PDU
has none of the above attributes specified (a value of 0 in the attribute field),
the task must be untagged, which simply means, on the SCSI target side, that
it is treated with the simple queuing method. Other than that, there is no
special treatment performed by the iSCSI target.
Note that the initiator is supposed to control the untagged tasks so that no
more than one untagged task per LU is sent to the target at the same time.
This requires that the untagged tasks be queued in the SCSI initiator. In
contrast, tagged tasks are sent by the initiator as soon as it can do so. All
execution queuing for tagged tasks is done at the target SCSI layer. In
general, the iSCSI target is required to send tagged and untagged tasks to the
target SCSI layer as soon as it can ensure in-order delivery. Some queuing in
the iSCSI layer may occur, but that is the same for all nonimmediate
commands. All this means is that, at the target, there is no iSCSI difference in
the handling of tagged and untagged tasks. Nor does iSCSI do anything special
based on the task attributes.
SCSI initiator upper layer protocols (ULPs) control the submission of each
of the linked commands for execution.
The SCSI initiator, not the SCSI target, is the primary controller of queuing
for untagged tasks. Note: SCSI not iSCSI initiator controls the queuing of
untagged tasks.
The SCSI target controls the queuing of tagged tasks according to their
attribute field (simple, ordered, head of queue, and ACA).
The task attributes are carried by iSCSI in the SCSI command PDU, but
they have no effect on iSCSI operation.
The target SCSI layer applies the attribute field to the process of task
queuing in the LU's task set.
iSCSI initiators and targets treat tagged and untagged tasks the same.
Only the SCSI layer treats tagged and untagged tasks differently.
Both iSCSI and SCSI have a set of task management functions, invoked via
the iSCSI Task Management Request PDU. These functions provide the initiator
with the ability to control the execution of one or more tasks (SCSI or iSCSI).
Generally, "control" means the control to abort the action of specific tasks or
groups of tasks. However, iSCSI has the capability to reestablish a task's
connection allegiance to a different connection.
The Task Management Request PDU has a field called the referenced task tag
(RTT). This is an ITT that can be used to identify the task to be affected by a
task management function.
As you can probably understand, the iSCSI layer can identify a task by
referencing the ITT of the subject command. What I have not stated, up to
now, is that, when the iSCSI target layer hands off the command to the SCSI
target layer, it also needs to pass a "task tag" identifier. This identifier can be
anything, but the implementation is easier if it is the iSCSI ITT sent with the
iSCSI command PDU. If the ITT is used, the iSCSI and SCSI task management
service can easily identify the appropriate SCSI task and correctly apply the
requested task management function.
It should also be noted that task management commands are often marked as
"immediate" and so need to be immediately considered for execution as soon
as the target receives them. (It is possible that the operations involvedall or
part of themmay be postponed to allow the target to receive all relevant
tasks.) Also, as with all commands marked as immediate, the CmdSN is not
advanced.
The following paragraphs describe the functions that can be invoked by the
task management PDU.
Both the iSCSI layer and the SCSI layer must work together to
prevent any command covered by the abort but not yet executed
from being executed. Any command to be aborted but not yet
delivered to the SCSI target layer must be silently discarded and
not executed by iSCSI. However, the task management response
"function complete" does need to be sent to the iSCSI initiator. If
the command to be aborted has already been delivered by iSCSI to
the target SCSI layer, then this abort will just be passed to the
target SCSI layer and SCSI will handle the abort. There will be a
normal response to the command's execution with either good or
bad error status, depending on whether the task completed or
aborted in the SCSI layer.
Abort Task Set. This causes an Abort Task for every task associated with the
sessionon which this command is issuedfor the LU that corresponds to the
specified LUN. This is a multiple application of 1, Abort Task. No previously
established conditions, such as reservation, will be changed by this action.
Clear ACA. This clears the auto contingent allegiance (ACA) condition for
the LU specified by the LUN (not discussed in this book; see [SPC-3]).
Clear Task Set. This aborts all tasks from all initiators for the LU specified
by the LUN. It performs like 2, Abort Task Set, but additionally may throw an
abort "over the fence" to the SCSI processes handling other sessions, causing
the abort of every task in the LU's SCSI task set regardless of the session on
which the tasks arrived. (The targets' mode page will determine the SCSI LU
target reaction, which is not covered here.)
Even though the session on which this command is sent will silently discard all
tasks in the iSCSI queue and subject to the clear, this is not the case for other
sessions. The only effect on them is the abort of the tasks already in the SCSI
task set (the queue of commands within SCSI but not yet executed). For these
sessions, there will be no silent discard of commands that might still be in the
iSCSI layer. In addition, all sessions, except the one on which the Clear Task
Set management function PDU arrived, will have their aborted commands
reflected back to their initiators with one or more unit attentions (see [SAM-
2]). As stated regarding 1, Abort Task, the session on which the clear task set
arrived will have its undelivered, aborted tasks silently discarded by iSCSI.
Logical Unit Reset. SCSI will abort all tasks in its task set, which means that
a 4, Clear Task Set, will occur as explained above, that it will clear the ACA if it
exists and reset all reserves established via the reserve/release method.
However, persistent reservations are not affected. The LU operating mode is
reset to its appropriate initial conditions, and the mode select parameters
(parameters that apply to the operation of the specific LU; see [SPC-3]) will be
reset either to their last saved values or to their default values. This may
cause a unit attention condition to all other sessions using the LU (different
from the one on which the Logical Unit Reset was issued). (The target SCSI
action is defined by a mode page setting, which is not covered here.)
Target Warm Reset. This resets all LUs in the target device (iSCSI target
node) and aborts all tasks in all task sets. In effect, it performs an LU reset (as
explained in 5, Logical Unit Reset) for all LUs owned by the iSCSI target node.
Target Cold Reset. This performs all the functions of 6, Target Warm Reset,
plus termination of all TCP/IP connections to all initiators. In other words, all
sessions to the iSCSI target node are terminated and all connections are
dropped.
Task Reassign. This reassigns connection allegiance for the task identified
by the ITT field to this connection (the connection on which the Task Reassign
Task Management Request PDU arrived), thus resuming the iSCSI exchanges
for the task. The target must receive the task reassign only after the
connection used by the original task has been logged out.
Clearly the term "task management" has a positive sound; however, it is really
a form of clearing up a disappointing situation. The Abort Task can be used to
stop a specific task, but all of the other functions are a bit of a dull knife that
will affect many I/O tasks beyond those of a simple application's failure. The
higher the function number (up to and including 7), the more drastic the
action and the more widespread the effect. Target vendors should consider an
authorization permission list to ensure that none of those commands are
issued by accident or by people who want to inflict damage.
Chapter Summary
This chapter explained how a task is defined to SCSI and iSCSI and the
appropriate queuing techniques for each task type. It also explained the
various iSCSI or SCSI commands that need to be operated upon immediately.
Some of the commands that need immediate operation status will reset or
abort either iSCSI or SCSI status or actions.
Key commands that are part of task management include the following:
Abort Task
Clear ACA
Task Reassign
These are powerful and very disruptive to iSCSI and SCSI processes. For that
reason, they should not be used except in extreme emergencies.
Chapter 11. Error Handling
Error handling and recovery are two of the more difficult processes to
understand. A quick overview follows.
To the Reader
This chapter will now go into error recovery in some depth. If you do not need
this much information, skip forward to the Chapter Summary, as usual.
Protocol error. As its name implies, this is generally a program error and
requires restarting the session and error recovery by the SCSI layer.
CRC detected error. This error could have been detected on the PDU
header or data segment. It can be recovered by resending the data or
response PDU or by reissuing the command PDU, depending on what was
missing. Some implementations will not be able to recover from this error
and will respond as for a protocol error.
The session restart, which must be used on protocol errors, can be used on
any of the other failures also. Because only session restart is mandatory, some
implementations are likely to have only that technique. That is, all error
recovery can use what is called technique 0.
Error Recovery Levels
The three recovery techniques are classified into corresponding recovery levels
by iSCSI:
The link failures, which are likely to be the most prevalent, have been
addressed by iSCSI with an option to recover without SCSI or the application
being aware that anything happened. This is also optional, however, and
vendors can choose to revert to session recovery.
This form of error recovery must be implemented, but hopefully will only be
used when all other types fail or are inappropriate. There will be some simple
iSCSI implementations that may have only this level of recovery. When an
error of some sort is detected while operating at level 0, the implementation
will end its processing of SCSI and iSCSI tasks. This means that all executing
and queued tasks for the given initiator at the target will be aborted. This is
true for the iSCSI tasks as well as the SCSI tasks. Simulated SCSI service
responses also need to be returned by the initiator's iSCSI layer to its SCSI
layer.
Also, all session TCP/IP connections are closed and then, once everything is
cleaned up, the session may be reestablished by the initiator; and all of its
connections need to log in as they did the first time.
On the target side the equivalent of the task management Abort Task function
will be issued for each task known to iSCSI within the failed session (see
Chapter 10). All iSCSI status is discarded, and all data for tasks in the session,
whether coming or going, is discarded. For details about what SCSI-related
status is reset, refer to the latest version of the [iSCSI] draft standard, the
section Clearing Effects on SCSI Objects.
As a rule, the initiator makes the determination that the session is to be torn
down and rebuilt. It signals this with a logout PDU (in case of extreme
confusion, the initiator may just drop the session). After the target receives
the logout request PDU, all Task Aborts have been completed, and the cleanup
is finished, the target will return a Logout Response PDU with a status of zero.
Then it will drop all the TCP/IP connections that make up the session and clean
up its state.
The initiator, after receiving the Logout Response PDU, will drop all its TCP/IP
connections that make up the session and return pseudo task status to the
SCSI layer for each outstanding task, indicating an appropriate error code. The
iSCSI initiator will then perform task cleanup, disposing of all data and status
regarding all tasks it may have had.
Upon the return of pseudo task status to the SCSI layer, the iSCSI initiator
may start reestablishing the session, perhaps with all the same connections it
had when it was first established. It may also start more connections or fewer.
If the initiator simply dropped the connection, without a logout, it should wait
the default length of time specified by the DefaultTime2Wait (key-value
pair), which permits the target to clean up its state and the initiator to return
the pseudo task statuses to the initiator SCSI layer. After the
DefaultTime2Wait, it may log back in and reestablish the session.
There is one other technique for terminating and reestablishing a session. This
can be done with a new login that contains a zero TSIH and the ISID of the
session to be restarted (and is from the same initiator node name and is to the
same target node name and target portal group). In this case, all active tasks
are terminated in the target and the initiator returns pseudo task status to the
initiator SCSI layer for each outstanding task, indicating an appropriate error
code.
Error Recovery Level 1
Like level 0, this level of recovery can accomplish session recovery, but it can
also recover from most CRC-detected errors. Error recovery level 1 is based on
the following approaches:
If a header digest error occurs, the PDU may be silently thrown away, and
the "normal" (level-1) error recovery will realize that it is missing because
of a "hole" in a sequence of one of the tracked sequence numbers.
Either the target or the initiator silently throwing the PDU away and
handling it later as if it had been a header digest error.
Targets can retransmit R2Ts when they think doing so is necessary. These
are called recovery R2Ts.
We will first look at techniques for recovering from header digest errors.
A disclaimer is appropriate here. Unless the initiator and target are operating
with some form of synchronization and steering, such as FIM (see Chapter 13),
there is no absolute guarantee that in a header CRC error the boundaries of
the PDU can be absolutely determined. You cannot always determine the
length of the PDU because the CRC error could have been within the
TotalAHSLength field. Therefore, it will be difficult to silently discard the PDU
in all situations, and the implementation will have to restart either the session
or the connection.
Note: This section should be read as if the initiator and the target have
implemented the FIM support and agreed to use it. This means that they can
absolutely identify some valid PDU boundaries. In real life, it is expected that
many vendors will apply their own "secret sauce" (if FIM is not used) that
usually permits them to find a valid PDU boundary even if the PDU header has
a CRC error.
In error recovery level 1, if the initiator is able to detect that one of its
command/request PDUs is missing, it will attempt to recover it with the
techniques described below. (These recoverable PDUs are all the nonimmediate
PDUs that the initiator can send except Data-Out, NOP-Out, and SNACK.) The
initiator detects missing PDUs by inspecting those returned from the target to
see if the target's ExpCmdSN (expected command sequence number) is in sync
with what the initiator sent. It can do this because each of the pertinent PDUs
carries a CmdSN (command sequence number) and each side keeps a value of
what it thinks the next CmdSN (the ExpCmdSN) should be. When the target
sends back a value that the initiator thinks is too small, the initiator
understands that the PDUwith the CmdSN equal to the returned ExpCmdSN
valuemay have been lost and therefore needs to be resent.
The initiator then attempts to correct the problem by resending the missing
PDU. Because there may be a "race" condition in which the PDU arrives while
the initiator is resending the replacement, the target is required to discard any
duplicate PDUs that it receives. The key to this type of detection and recovery
is that the initiator not be too quick to resend what it thinks is a missing
command/request PDU. The full-duplex nature of a TCP/IP connection and the
different connections through the network may make it look as though some
commands are lost, when in fact their arrival has not been acknowledged yet.
Given some time they may well be acknowledged. In any case the discard
mentioned above will prevent any inappropriate action by the target.
Any missing Data-Out PDUs, though sent by the initiator, will be detected as
missing by the target, which will determine either that the appropriate amount
of data has not yet been received (and it thinks it should have) or that there is
a missing DataSN (data sequence number) in a sequence of Data-Out PDUs. In
either case, the target will issue an R2T PDU and explicitly request the missing
data.
The initiator handles missing SNACK PDUs simply by waiting for a period of
time to see if it receives the requested R2T, data, status, or other exception
event (which, among other things, might cause the session or connection to
terminate). If none of those responses occur within an implementation
determined time, the SNACK PDU can be reissued by the initiator.
The target may also complete the command and send final status while the
initiator is waiting for a missing Data-In PDU. If this happens, the initiator will
have received the final response PDU for the command. However, knowing it
has not received the Data-In PDU, it should send the SNACK requesting it. The
initiator should not advance its ExpStatSN (expected status sequence number)
until all missing Data-In PDUs have been received. This is because the target,
when it receives a PDU from the initiator with the ExpStatSN advanced, will
interpret that as an acknowledgment that everything is okay and clear all
information about the previous command, including the data.
On the other hand, the initiator may have a timeout in its ULP (upper-level
protocol), deciding that enough is enough, and then retrying, or not retrying,
the command.
Note: The initiator sets CmdSN on each PDU it sends, except for Data-Out and
SNACK, but ExpStatSN is set to the target on every PDU the initiator sends.
iSCSI requires ExpCmdSN and StatSN on all the PDUs that flow from the
target to the initiator.
The important thing the initiator does with the StatSN is to detect missing
status which it will have resent by issuing a SNACK. It can also check the
DataSN and ExpDataSN to see if there is a mismatch. If so it can retrieve the
missing data via SNACK as specified above. The initiator must also reflect the
value of StatSN back to the target (in the ExpDataSN) to show that the status
has arrived at the initiator, all data has been received, and the target can free
up its buffers and state for all commands with lower StatSN.
Recap:
The initiator uses CmdSN and ExpCmdSN to detect that the command is
lost and then resend the command.
The initiator uses StatSN and ExpStatSN to detect that status is lost and
then request the status be resent via SNACK.
The initiator uses DataSN and ExpDataSN to detect missing data and then
request the data be resent via a SNACK.
The initiator reflects the ExpStatSN back to the target to inform the target
that the status up to that point was received along with all appropriate
data.
The target can recover from almost all header digest errors just by silently
discarding the PDU and allowing the initiator to notice a discontinuance in the
ExpCmdSN (expected command sequence number) and cause it to resend the
missing PDU.
When a SCSI command PDU is discarded, for example, the target will not be
able to advance its ExpCmdSN value. In that situation, each PDU sent from the
target to the initiator that contains the ExpCmdSN value will indicate to the
initiator that the sequence was disrupted; the initiator will then resend the
missing SCSI command PDU. The same thing will occur for Text Requests, Task
Management Requests, and Logout Request PDUs.
When a Data-Out PDU is discarded, the target cannot advance its DataSN
value. Thus, if it receives other Data-Out PDUs that show a discontinuance in
the DataSN value, it can explicitly request that missing Data-Out PDU by
sending the initiator an R2T PDU.
If the target sends an R2T PDU that the initiator discards because of a header
digest error, unless it is the last R2T for the command it is expected that the
initiator will subsequently detect that something is missing and request that
the target resend it. It can identify the actual PDU in error when it does not
receive the expected value for R2TSN. At that point it will issue a SNACK to
request a resend of the missing PDU.
Note that discarded PDUs with DataSN, R2TSN, and StatSN are usually quickly
detected, because they are all connection allegiant and TCP/IP does not permit
them to be received out of order. Therefore, if one is missing, there must have
been a CRC digest error that caused it to be discarded. However, if the R2T
that is missing is the last R2T for the command, the initiator will not be able to
detect that it is missing. Instead the target must take overt action whenever it
does not receive the expected data within a reasonable time (on the last R2T
of a command). Initiators are required to discard duplicate R2Ts so no RACE
condition will occur in this situation.
This is probably not a significant problem with a normal disk storage controller,
because most implementations use a shared buffer pool with enough total
memory to last through events such as a missing SNACK (and in fact probably
do not even use the Data-In PDU with the A bit set). However, for memory-
limited storage devices, such as a tape drive, a missing DataACK SNACK may
be a problem. For that reason targets like these should be designed to keep at
least two to three times the value of MaxBurstLength in old, sent buffers for
each outstanding I/O they support, just in case the A bit PDU is lost or ignored
by the initiator.
The only time that SNACK support is not required is when operating at error
recovery level 0. In that case the initiator may still issue a SNACK for some
PDU recovery capability. However, the target implementation may only know
how to operate at level 0 and therefore silently discard the SNACK, respond
with a Reject PDU, or request logout via the Asynchronous Message PDU.
(Anything other than discarding the SNACK is not defined by the specification,
so anything might happen.) Therefore, it is probably best to stick to level 0
recovery and restart the session.
The significant thing about data digest recovery is that the error condition can
be signaled to the opposite side and a requested recovery action thus can take
place without delay. This is because the same recovery actions are possible
when a data digest error is found directly, as when its discovery is delayed
until the sequence values can determine what is missing.
In the case of a data digest error, the PDU header is considered valid.
Therefore, it is possible to determine directly what the problem is and request
a resend of the PDU. That is, if the initiator detects the error it can issue the
SNACK, requesting the Data-In PDU again. If the target detects the data digest
error, it is required to issue a Reject PDU with a reason code of "Data (payload)
Digest Error" and to discard the in-error PDU. Then it should either request
that the data be resent via an R2T PDU or terminate the task with SCSI
Response PDU with the reason "Protocol Service CRC Error" and perform the
appropriate cleanup. In the latter case, it is up to the initiator to retry the
command or restart the session (as it would if the error recovery level were
0).
It is possible for all digest errors to be treated as header digest errors so that
the PDUs can just be discarded. That will cause a postponement in the
recovery until the sequence numbers can detect the problem. Of course, to be
sure that the problem is detected before the SCSI layer times out and retries
the command, we need to ensure that even an error at the end of a command
or data burst is addressed without delay. This requires a rule that a NOP be
added to the data flow whenever there is no data or command flowing on the
link for a period of time. Such a rule permits any holes in the sequence
numbers to be quickly detected, even at the end of the command or data
burst.
Error Recovery Level 2
Level-2 error recovery focuses on the recovery of the connection and the
session along with the tasks therein, given that there has been some type of
connection problem. The initiator may detect that a TCP connection has failed,
or it may have received an Asynchronous Message PDU from the target stating
that one or all the connections in a session will be dropped or requesting a
connection logout. Any of these messages from the target, or any of its own
information, can be enough for the initiator to attempt to recover the
connection.
Once that is done, the initiator either logs in a new connection to recover the
commands from the failed connection or uses an existing connection to recover
the commands. To do this it sends the target, via one or more other
connections, a series of Task Management Function Request PDUs, each of
which tells the target to change the allegiance of a suspended task from the
old connection to the connection on which the Task Management Function
Request PDU is issued. These change-of-allegiance task management functions
must be done task by task until all tasks that the initiator had pending on the
old connection are changed to another connection. At the completion of this
process, all the targeted tasks that were once on the old connection will
continue as if they had originally been issued on the new connection. (Any
immediate commands, however, are not recoverable.) If a new connection is
not started, it might be advisable to reinstate the suspended tasks across a
number of the other connections within the session, so as to balance the
workload. Since the link failure could have occurred at any point in the
command processing, when the new allegiance has been established, all
unacknowledged status and data will be resent automatically by the target. If
no status or data exist, the initiator may need to retry the command.
Even if the session is made up of only one connection, the logout of the single
connection, by logging in a replacement (with the same initiator node name,
ISID, TSIH, and CID), can be used to reestablish the single connection session
on a new connection. This implies the opening of a second connection with the
sole purpose of cleaning up the first. The allegiance of the commands on the
old connection, even though it has the same CID, still needs to be transferred
to the new connection as soon as it is instantiated into full-feature phase.
Two of the values that can be set at login time are DefaultTime2Wait and
DefaultTime2Retain. These key=values, which each have a default of three
seconds if not changed by negotiation at login, are used as follows:
If the connection goes away unexpectedly the initiator has until the
DefaultTime2Wait (in seconds) before it can attempt to reconnect. This
gives the target a chance to notice that the link is gone, do what ever
cleanup is needed, and prepare for a reconnection.
Specify that the target must wait a certain amount of time before
reestablishing the connection
Specify that after the wait time the initiator has an additional amount
of time to reestablish the connection and the allegiance of suspended
tasks. If this is not accomplished in that time, the target aborts and
cleans up all tasks and state (except persistent reserves).
Whenever there is a problem with the connection, the session is torn down
and reestablished.
The initiator can use SNACK to request missing status PDUs, Data-In PDUs,
and other-than last R2T PDUs.
The initiator can detect missing command/request PDUs and resend them.
The target can detect missing Data-Out PDUs and issue R2Ts to recover
them.
The SCSI ULP (upper-level protocol) timeout will cause iSCSI to abort the
task; but the ULP may or may not know about the best retry process to
follow at that time. So, completion of iSCSI recovery should be attempted
before SCSI times out.
An implicit logout and re-login can be done keeping the same connection
ID; however, allegiances must still be transferred to the new connection.
The session does not need to be failed or restarted, but can be continued
by establishing another connection and using it to clean up and recover
the allegiances of the first. In this way the session can continue (on a new
TCP/IP connection) even if MC/S is not supported.
For foolproof operation at error recovery level 1 or 2, both the initiator and
the target should operate with some form of synchronization and steering,
such as iSCSI-defined FIMs (see Chapter 13).
Chapter 12. Companion Processes
The companion processes that make up the iSCSI protocol suite are:
Boot
Discovery
Security
We will cover these lightly in this book, but I encourage you to study the
appropriate IETF drafts if you need additional information.
To the Reader
The boot process exploits the features of the Dynamic Host Connection
Protocol (DHCP), the Service Location Protocol (SLP), or the Internet Storage
Name Service (iSNS). DHCP is used to obtain not only an IP address for the
host (if needed) but also the address of either the boot device or the server
that has its name and location. Once the name and location are obtained, an
iSCSI session can be established with the boot device, and the boot process
can continue normally. Note that, if the LUN is not included, the boot process
will assume it to be LUN 0.
It is expected that the BIOS (Basic Input Output Service) that is built in to the
booting system will incorporate the boot process. Alternatively, the iSCSI HBA
will respond like a normal SCSI HBA to the unmodified system BIOS, and the
extended BIOS on the HBA will permit the iSCSI HBA to have the boot
procedures built in.
The normal remote PXE process, supported by current NICs and DHCP (and
optionally BIS)
The extended BIOS, to pose as a normal SCSI HBA, and then use an IP
Address:Port (and LUN) as a boot target, as any SCSI HBA does except in
this case using iSCSI.
The IP Address:Port (and LUN) can be obtained from any of the following:
DHCP
SLP
iSNS
Note: In all of the following cases the initiator and target node name may be
set by an administrator, or default to factory settings.
First is an administrative technique for defining the targets to the host system.
This process lets the administrator specify the target node name and IP
Address:Port to the system or its HBA. All vendor iSCSI HBAs should allow an
administrator to do this. This type of discovery is useful in small installations
where more elaborate discovery is not needed. (See Figure 12-1.)
The use of environment probes such as broadcasts and pings is not defined by
the iSCSI protocol, and they are only useful on a small network or subnet.
However, it is possible for an implementation to perform such functions in
order to make the equipment acceptable in small environments.
It is also possible for the administrator to set the iSCSI addresses of all
discovery targets in the initiator. Then the initiator can issue SendTargets to
each of those addresses. Or the administrator can just enter a single IP
address in the initiator and then set up one of the discovery targets, if
permitted by the device, to contain all the other target devices' iSCSI
addresses (including TPGTs [target portal group tags]). It is possible for this
discovery port to contain the iSCSI addresses of targets not located within the
same physical device as the discovery port.
IP storage area network (IP SAN) management software can support the
SendTargets command while pretending to be an iSCSI discovery target. It
responds with all the targets to which any specific initiator is permitted access.
In this case the access control list (ACL) used with each real iSCSI target
device should be set up to permit only the management node to establish a
discovery session with it. This is an important reason why vendors should
ensure that their storage controllers have ACL capability to permit (or not) the
use of a discovery session by selected systems or software.
Figure 12-2 shows the administrator setting the IP addresses in the storage
controllers and the SendTargets address in the initiators. The various host
systems discover the devices by logging into a discovery session on each
controller and issuing a SendTargets command. Note that the iSCSI target
network entities permit the login and the issuance of SendTargets only if the
initiator is authorized via ACLs.
The third discovery approach uses SLP (Service Location Protocol) to locate the
iSCSI target devices. SLP operates with three elements: a user agent, a
service agent, and a directory agent (see Figure 12-3).
The service agent (SA) works on behalf of one or more services (in this case
iSCSI targets) to advertise their services and their capabilities. The directory
agent (DA) collects service advertisements (in this case information for
accessing an iSCSI target). iSCSI target advertisements are like register
entries that are made up of the iSCSI target node name and all the URLs of
the various paths to that node, along with the appropriate target portal group
tag (TPGT).
The SA (iSCSI target) can either give the advertisements to the DA or keep
them itself. If it keeps the advertisements, then when contacted by a UA
(iSCSI initiator) it can let the UA see its advertisements directly.
If a DA is present, all the SAs will be able to give their advertisements to it.
The DA can then answer the UA (iSCSI initiator) with all the installation's
targets that the UA can access, and the initiator's query need go only to the
single DA to get that information.
This implies that there must be a method for the administrator to specify to
the target which hosts can contact it. The target SA will then advertise the ACL
to the DA, which will use that information in pruning the target list given to
the initiator UA when it requests a list of appropriate targets.
It will also be possible for a storage management node to perform the duties of
the DA and then apply direct administrator control over the configuration and
access list from a central location. Or it could leave the DA intact but proxy
information, such as access permissions. However, even if the storage
management node proxies the permissions, it is still necessary for the iSCSI
storage controller to receive that same ACL information. For this the storage
management node will probably use a vendor-specific protocol or SNMP to set
the ACLs. (See the section MIB and SNMP later in this chapter.)
The iSCSI target, or the storage management node, can also export the list of
boot devices (IP Address:Port [and LUN]) it may contain. In this way, the boot
process can be configured to use the SLP, and the administrator can assign any
of the boot images to one or more initiators.
Figure 12-3 depicts the iSCSI initiator and target in an SLP environment. The
iSCSI initiator node has a driver that includes the SLP UA. The iSCSI target
node includes the SA.
If the network permits it, the UAs or SAs can Multicast to address
239.255.255.253 to see if any DA responds. In non-Multicasting networks, to
provide a general method of support the SLP requires its agents to be able to
get the DA's address from DHCP and then Unicast (or broadcast) their
messages to that location (using TCP or UDP port 427). This permits the
customer to place the DA in the DHCP database (option fields 78 and 79).
Obtaining the location of the SLP's DA (and iSNS; see the section
Discovery Using iSNS).
The bottom line is this: The UAs and SAs need to implement both Unicast and
Multicast, and must be able to accept the administrator configuring the IP
address of the DA in each UA or SA. The UAs and SAs must also have the
ability to request the DA's location from the DHCP. With this combination of
support, the customer will be easily able to locate the iSCSI target nodes in
any size network.
Figure 12-4 shows the administrative component and the other components in
the discovery process. Once the SLP DA address is set in the DHCP (and the
various IP addresses have been set), the storage controllers will be able to find
the DA and advertise its existence and features. The host will be able to query
the DA and receive information about the iSCSI storage device locations. From
the DA the host can obtain all the information necessary to start a session with
an iSCSI storage controller.
The fourth discovery approach uses a new protocol going through the IETF
standards organization in parallel with iSCSI. This protocol is called Internet
Storage Name Service (iSNS), and it will permit targets to register with the
central control point and allow the administrator to set discovery domains
(DD)that is, zones of accesssuch that when the host queries the central
control point for the locations of the iSCSI storage controllers, only the
authorized controllers are reported. iSNS also has a "Notification" protocol that
will permit the host to determine when a change in the DD occurs or when a
new storage controller it is authorized to access comes online.
iSNS has a "heartbeat" service that tells clients where the current operational
iSNS server is located. It also notifies a backup iSNS server that the primary
iSNS server is down so that the backup can start servicing iSNS requests itself.
Because of the new heartbeat, initiators and targets begin sending their
messages to the backup instead of to the primary. In this way critical iSNS
functions can continue.
iSNS supports its own special protocol as well as SLP. Therefore, except for the
optional heartbeat and notification processes, iSCSI storage controllers and
host systems need only implement the SLP protocol to be compatible with both
SLP and iSNS discovery processes. Clearly things work better if the full iSNS
protocol is utilized. However, a centralized iSNS discovery management system
is good even when SLP support alone is implemented in a host or storage
controller.
Figure 12-5 shows the administrator actions and the resulting host and the
storage controller actions with the use of an iSNS server. After the iSNS server
is set up and its address is placed in the DHCP, the storage controllers send all
their discovery information to it. The hosts can then discover from it the
names of all the storage controllers they are permitted to see. The iSNS server
may be configured further as to which host is allowed to access which storage
controllers, and so forth.
When an installation uses SLP or iSNS, the iSCSI storage controllers' discovery
sessions should be set up to permit no access by any requester. In this way
any initiator making such a request will be denied and have to go through the
SLP or iSNS servers.
iSNS also permits the registration of iSCSI devices as well as Fibre Channel
(FC) devices. (See Figure 12-6.) It supports both the FC SNS (Storage Name
Server) protocols and the iSNS iSCSI protocols. When the installation is using
an iSNS server, the FC devices can still invoke their SNS protocol and receive
their normal FC worldwide names (WWNs), and the like. However, when the
network has an FCiSCSI router, the SNS request can also receive an FC
compatible name, which represents the iSCSI device. In addition, iSNS will not
only operate in the normal iSCSI manner described above but, when queried,
returns to an iSCSI initiator an iSCSI-compatible node name (pseudo name)
and address for an FC device (as long as a router path exists between the
iSCSI entities and the FC units). This should be of significant help with the
high-end integration of iSCSI and FC devices. Finally, iSNS can be used to
house public key certificates that may be used in iSCSI security.
Nishan Systems has made its implementation of iSNS open source and thus
available to any vendor. This includes both server and client code. Its client
protocols, which hosts and storage controllers use, are very lightweight and
can be easily included in all iSCSI implementations.
Security Process
The basic must-implement protocols are IP Security (IPsec) and Internet Key
Exchange (IKE). When used together, they identify legitimate endpoints and
permit the encryption of the connection that binds them. IKE authenticates the
endpoints (and creates an encryption key) using either certificates or
preshared keys. A preshared key is a string of bits given to the various iSCSI
endpoints, probably via a manual process, so that they can be used in the
establishment of a secure connection.
With IPsec (along with IKE) the endpoints can be secured from a number of
threats, one being the "man in the middle" attack where a system changes the
packets as it pretends to each side to be the endpoint. The authentication and
integrity features of IPsec (with IKE) ensure that the endpoint is authorized.
IPsec can also ensure that each packet received is from that endpoint and
contains exactly what the other endpoint sent (leaving no chance for a "man in
the middle" attack to change the packet or pretend to be the endpoint).
Another threat is replay. This is where recording a session and playing it later
may seem to a target to be valid, but the replaying of a previous session may
be destructive to values set in intervening sessions. IPsec's Anti-replay feature
can protect from this damage.
An installation may also require a security feature for privacy. Privacy (i.e.,
confidentiality) is the encryption of the session so that no one except the
endpoints can understand the message. These IPsec features are options that
an installation can attach to its iSCSI connections. Figure 12-7 depicts the IKE
and IPsec processes.
In actuality, with self-signed certificates or group preshared keys, all that can
be accomplished is that the remote systems can validate the target as a valid
target. However, the target cannot determine that the initiator is a valid iSCSI
initiator. The endpoints can have a secure connection established, with
integrity and even privacy (as specified above); it is just impossible to know
that the initiator side is trustworthy.
iSCSI not only exploits the IPsec and IKE protocols to secure the links, but also
authenticates the endpoints via the security processes and protocols that
computing centers use to authenticate their other users. These processes and
protocols can be Kerberos (v5), SRP, CHAP (which uses RADIUS servers),
SPKM-1, or SPKM-2. (For additional information, see Introduction to the
Login Process in Chapter 5.)
It should also be noted that all the various security processes and features are
optional to use, but some (e.g., CHAP and IPsec) are required to implement.
Therefore, the customer may choose, for example, not to turn on any security
(which is probably okay for small isolated networks), not to use IPsec
authentication and integrity, or not to use IPsec encryption. It is also possible
that the customer may not want to turn on iSCSI authentication, whether or
not IPsec is used. In other words, the customers can select only the security
features that they want.
To the Reader
You may not be interested in the level of detail to come; if so, you can skip to
the section Access Control Lists with little loss in overall understanding. The
information that follows can serve as a base for further investigation.
IPsec Features
IPsec has a set of features that vendors must, should, or may implement. I will
not go into them in any depth here except to mention them so that you may
do further research as appropriate.
ESP permits a number of security algorithms that fill in the ESP headers:
Encryption algorithms
[*]
At the time of writing, AES in counter mode was still subject to IETF IPsec
workgroup's standardization plans.
AES in CBC MAC mode with XCBC extensions was designated as should-
implement because of the code savings if AES is supported for encryption
IKE (Internet Key Exchange) must be used for peer authentication, negotiation
of security associations, and key management. It has the following
implementation details:
Manual keying cannot be used because it does not provide the necessary
re-keying support.
Authentication of digital signatures can be done with IKE Main Mode or IKE
Aggressive Mode.
The point is that, with the appropriate LU ACLs, the iSAN can be used without
worry that hosts will interfere with each other's LUs. It is strongly suggested
that each iSCSI storage controller implement ACLs for accessing the storage
controller and for accessing the various LUs within it. In this way the storage
controller will fit easily into both high-end and low-end environments, even if
full-feature storage SAN management software is not installed on the various
hosts. More important, the storage controller needs the full set of ACL
functions, because iSCSI may be connected via any network, including campus
intranet and the Internet.
MIB and SNMP
iSCSI has defined both read and write fields within the MIB that can be used to
manage the entity from afar. Therefore, version 2 or 3 of SNMP (SNMPv2 or
SNMPv3) should be used with iSCSI because both versions have the security
functions needed to permit the management node to both read and write MIB
fields. There has been notable security exposure even with SNMP, however,
and so the installation may need to use IPsec for any session that could
change the MIB. The implementation should thus support IPsec on the SNMP
connection.
One of the traditional uses of MIBs and SNMP has been the recording and
extraction of information on the performance or health of the object containing
a MIB. The iSCSI MIB is no different in that it has a number of defined entries
that hold statistics, which can be queried and used in management reports.
Figure 12-8 depicts the use of MIBs and SNMP in an iSCSI environment.
This chapter covered many of the companion processes that go into creating a
complete iSCSI environment. For example, any transport protocol that
supports SCSI should support the boot process. In this way the host can use
that protocol not only to access the application production data but also to
perform the operating boot so that some other type of hard disk drive is not
required.
The boot process was missing from FC implementations for some time; this
was a setback for full FC utilization.
The iSCSI framers and architects defined the boot process to accomplish
this needed function.
When storage devices are connected via a network (IP or FC), they can
only be permitted to access the storage controllers they are meant to
contact.
Limiting which hosts can see which storage controllers is both a security
feature and an error avoidance feature.
Manual
IPsec (with IKE), which operates transparently to iSCSI and protects the
links from being compromised.
IPsec support for authentication and integrity assurance, along with anti-
replay protection.
A number of security processes that work in IPsec for iSCSI to provide the
authentication and integrity checks:
ESP must operate in tunnel mode and may operate in transport mode.
Successful iSCSI storage controller vendors will offer access control list (ACL)
management for both storage controllers and LUs. In so doing they can sell to
installations of any size (especially smaller ones) while protecting the
customer's data from unauthorized users.
In some cases SNMP may even set values and fields, thereby providing
completely remote management of the iSCSI network entity. Because this is a
companion standard protocol and because the layout of the MIB is being
standardized for iSCSI in general, the same management program should be
able to manage many different vendors' iSCSI storage controllers.
Chapter 13. Synchronization and Steering
As I began writing this book, there were two techniques before the IETF
dealing with the synchronization and steering of iSCSI data. One is included in
the base iSCSI document since it only applies to iSCSI. Called fixed interval
marking (FIM), it involves placing markers directly in the TCP/IP stream that
will identify the start of iSCSI PDU headers. The other technique, a framing
proposal, is called TCP upper-level-protocol framing (TUF). Since TUF is also
applicable to non-iSCSI upper-level protocols (ULPs), it has another path
through the IETF standards group and will be only lightly covered here. As I
complete this book, another proposal has been developed. I call this TUF/FIM,
as it is based on both TUF and FIM.
To the Reader
FIM adds things to the TCP/IP byte stream before TCP/IP sees it. Therefore, it
causes no modification to the TCP/IP stack for sending or receiving the
markers. In contrast, TUF requires changes to the base TCP/IP stack and can
only be used with TCP/IP implementations that have been upgraded to perform
TUF operations. Since it cannot be required that all TCP/IP implementations
change their TCP/IP stacks, iSCSI cannot carry TUF as a must-implement.
Also, because of TUF's lack of determinism (that is, it cannot always be known
where the TUF frame starts), it cannot become a part of the iSCSI protocol,
even as a may-implementat least not until it has shown its usefulness as an
experimental IETF protocol. TUF is covered here for completeness, but it is not
part of the iSCSI standard and may or may not continue to a standard on its
own. Moreover, the TUF proposal is considered experimental at this time.
Main Memory Placement
Direct memory placement can begin as soon as the appropriate iSCSI PDU
header arrives in a TCP/IP buffer. When a basic header segment (BHS) is found
in the buffers, it can be used by the iSCSI function to place, or "steer," that
BHS and the rest of its PDU directly into the final main memory locations. In
this way the host or storage controller needn't expend the extra CPU cycles
and put up with the bus interference that occurs when extra data moves are
required. These extra moves normally occur when the main processor moves
the data from "anonymous" main memory "staging buffers" into final main
memory. The speedy and direct placement of data from the HBA into final main
memory also permits the HBA to keep its "on-board" RAM to a minimum.
There is one situation that prevents the HBA from using a small on-board
RAM, and that has to do with TCP segments that get dropped because of errors
detected by TCP or the link layer or because of congestion somewhere along
the transmission path. These include errors detected on the HBA.
The longer the transmission distances and the higher the bandwidth, the more
on-board memory is needed for the reassembly buffers.[*] If a TCP segment is
missing from the TCP buffersbecause of errors or network congestiona
relatively long time may pass before the segment is retransmitted and
received at the HBA. However, if that TCP segment does not contain an iSCSI
BHS (basic header segment), there is not much of a problem. The rest of the
data can be placed in main memory and will not cause a significant back-up in
the HBA's buffers. However, if the missing segment contains a BHS, the eddy
buffer[**] needs to be large enough to hold the stream of TCP/IP bytes as they
come down the line, until it receives the missing segment and can then begin
placing the data in main memory. Depending on line speed and distance, the
size requirements for the eddy buffer may become quite large.
[*]
See Chapter 8 for an explanation of reassembly buffers.
When the BHS is missing, the iSCSI HBA or chip cannot tell where the data
should be placed and, even more important, cannot tell where the next PDU
begins. When this happens, the iSCSI session connection is stalled until either
the missing segment arrives or TCP's error recovery process causes the TCP
segment to be resent and finally received without error. This is where marking
and framing come into the picture.
Markers assist the HBA by identifying subsequent BHSs and thereby relieve
the pressure on the eddy buffers. This is possible because subsequent BHSs
(those beyond the missing one) can be used to place themselves and their
data directly in main memory. Yes, the TCP/IP sections of the PDU with the
missing BHS still need to be queued until that missing BHS arrives. However,
that is a relatively small amount of buffer space, limited in size by the value of
the iSCSI variable called MaxRecvDataSegmentLength (plus the maximum
size of the PDU header). Even so, it should be noted that this might occur
concurrently on many of the connections in the HBA. In fact, with server
network congestion problems, many if not most of the connections may have
missing TCP segments, several of which may include BHSs. The actual amount
of needed eddy buffer in any specific HBA will be determined on the basis of
the vendor's probabilistic calculations.
A major error event, such as an Ethernet link or switch problem, can cause
missing TCP segments on most of the connections to an HBA. However, if the
event is quickly resolved, the HBA can continue to receive the subsequent
segments. Since most, if not all, connections will be queuing TCP segments
waiting for the resend, the probabilistic total eddy buffer size may then be in
jeopardy of overrunning. In this case, TCP generally reduces its window and
starts throwing TCP segments away so that the needed missing segments can
arrive and enable the normal flow of TCP traffic. Still, it is possible that the
discarded segments may also contain a BHS and, if discarded, could prevent a
quick return to normalcy. Because markers are able to point to the BHSs, the
iSCSI-integrated function can help TCP determine what segments not to
discard. That is, iSCSI should be able to identify segments with a BHS and
hopefully prevent them from being discarded.
Fixed-Interval Markers
The FIM techniques require something to be added to the TCP/IP byte stream
that permits the receiving process to recognize a subsequent BHS, even in the
presence of missing ones. For example, FIM inserts pointers into the data
stream, each of which points to the location of the next PDU. A FIM can be
readily inserted into the TCP/IP stream by iSCSI software implementations
without any changes in the operating system's TCP/IP stack. This makes
markers useful in environments where desktop and laptop systems are
connected to an iSCSI storage controller that has iSCSI HBAs or chips.
There is one problem with direct placement in memory without using eddy (or
reassembly) buffers. That is the issue of validating the CRC (cyclical
redundancy check) digest. Many vendors believe that it is necessary for the
complete PDU to be available before the CRC digest can be computed.
Therefore, when operating with CRC digests these vendors will require a larger
eddy buffer size than when not in CRC mode. Some other vendors have
determined that they can perform incremental CRC digest computation, and
for them the full value of markers is available without reduction, whether
using CRC digests or not.
FIM Pointers
FIMs place two one-word pointers in the outgoing TCP/IP stream at fixed
intervals that point to the next PDU in the stream. This permits the receiving
side to determine where the next PDU is located or, if that segment is missing,
where the next marker is located. It is possible that the next marker's
segment may also be missing, but subsequent marker positions can be
computed, and when those segments arrive one will again have a pointer to a
subsequent PDU.
The two one-word pointers are required because the partitioning of data into
TCP/IP segments may occur at any point within a TCP/IP byte stream. This
implies that the segmentation can occur within the marker itself, thereby
making the single pointer unusable. To avoid this problem, the pointer is
doubled to guarantee that a whole pointer will be located within a segment.
Marker Implementation
The iSCSI specification states that vendors may implement the sending of
markers and may implement their receipt. Clearly, if the vendor does not
implement sending markers, the customer cannot turn them on, and the
receiving side won't be able to receive them. Likewise, if the receiver does not
implement receiving markers, it does little good to send them.
All this is necessary because TCP/IP delivers just a string of bytes and there is
no method of ensuring that the PDU header is always at a known displacement
within a segment.
FIM presents a simple scheme for synchronization that places a marker with
synchronization information at fixed intervals in the TCP stream. A marker is
shown in Figure 13-1 and consists of two pointers indicating the offset to the
next iSCSI PDU header. It is 8 bytes in length and contains two 32-bit offset
fields that indicate how many bytes to skip in the stream in order to find the
next header location. The offset is counted from the marker's end to the
beginning of the next header. The marker uses two copies of the pointer; thus,
a marker spanning a TCP segment boundary would leave at least one valid
copy in one segment.
The offset to the next iSCSI PDU header is counted in terms of the TCP stream
data. Anything inserted by iSCSI into the TCP stream is counted for the
offsetspecifically, any bytes "inserted" in the TCP stream but excluding any
"downstream" markers inserted between the one being examined and the next
PDU header. The marker interval can be small enough so that several markers
are placed between the one currently being examined and the next PDU.
However, the value placed in the current marker should be the same as the
value that would have been placed there if the interval were larger, with no
intermediate markers.
Markers are placed at fixed intervals in the TCP byte stream. Each end of the
iSCSI session specifies during login the interval at which it is willing to receive
the marker, or it disables the marker altogether. If during negotiation a
receiver indicates that it desires a marker, the sender should agree (if it
implements markers) and provide the marker at the desired intervals.
In iSCSI terms, markers must indicate the offset to the next PDU and must be
a 4-byte word boundary in the stream. The last 2 bits of each marker word are
reserved and are considered zero for offset computation.
To enable the connection setup, including the login phase negotiation, marking
(if any) is started only at the first marker interval after the end of the login
phase. However, the marker interval is computed as if markers had been used
during the login phase. Figure 13-2 depicts a TCP/IP stream with PDUs and
markers.
A related technique for managing data backup on HBAs is the TUF proposal,
which is moving through the IETF on a separate track. If it moves from
experimental status into "Last Call" and beyond, a future iSCSI specification
may be updated to reference TUF and recommend it as a "should" or "may"
implement. However, the IETF has other interrelated efforts in flight to develop
new protocols, called DDP (Direct Data Placement) and RDMA (Remote Direct
Memory Access), and these protocols may choose instead a related technique
which I call TUF/FIM (covered later).
The TUF Scheme
TUF can be used as a way to package iSCSI PDUs so that they fit into a TCP
segment. It ensures that the PDU starts at the beginning of a TCP segment.
The application, in this case iSCSI, needs to ensure that it makes the PDU no
bigger than the TUF frame. The maximum size PDU an application such as
iSCSI can give TUF is specified by a yet-to-be-defined TUF API. TUF sets its
frame size by determining the effective maximum segment size (EMSS) of the
network and subtracting 8 bytes for a header of its own. It is possible to place
more than one iSCSI PDU within a TUF frame; however, no PDU can span
multiple frames. Therefore, every TCP segment that arrives at the iSCSI HBA
or chip can be placed directly in main memory since it always has a BHS at its
beginning and it might have more than one within the segment. In any case,
the BHS can always be located, and based on the information it holds, it and
the data within the PDU can be placed directly in main memory. Moreover, if
more than one PDU is present in the segment, all of them can be placed
directly in main memory.
The important thing about TUF is that there is no requirement for an eddy
buffer within the HBA/chip; hence, the cost of the HBA can be greatly reduced
and the latency caused by reassembly buffers can be eliminated.
That last statement is not completely true. One minor condition demands that
TUF still have a small eddy buffer. To understand it we need to look at the TUF
header. TUF needs its header to determine if its frame has been segmented
while traveling through the network. The header, currently made up of a 2-
byte frame length and a 6-byte randomly chosen key, is used in an important,
though rare, situation: when the EMSS dynamically changes as TCP segments
travel through the network. In any case either the TCP segment contains a
complete TUF frame or TUF needs to detect, via the header, that a
segmentation occurred and then gather up smaller TCP segments and present
them to iSCSI. This is a type of eddy buffer, but it is generally less than two
times the size of a TUF frame and it is only used on a highly exceptional basis.
In fact, its use should be so rare that the probability of an error condition at
the same time is small enough to treat the segment recovery in a normal
TCP/IP manner. Even then it would only have a very small total memory set
aside for use in this rare case. (The total buffer is probably small enough to be
maintained directly on an iSCSI chip.)
Advantages and Disadvantages
The fixed interval markers for TUF/FIM are placed every 512 bytes and they
point to the beginning of the frame. This means that instead of pointing
forward to the next PDU, the marker points backward by carrying the number
of bytes between the beginning of the frame and the start of the marker. A
special case exists if a frame begins in the first byte following the marker; in
that case the marker has a value of 0. (See Figure 13-4.)
Using this approach, the marker is not doubled (as in FIM, described
previously). If synchronization to the frame is ever lost, the receiver may
compute the location of the next marker and use it to reestablish the frame's
beginning. Like TUF explained above, it is assumed that the frame can be
made to start on a segment boundary; however, if not, the TUF/FIM will still
operate.
Chapter Summary
In this chapter we covered the advantages of integrating the TOE function with
the iSCSI on the same HBA or chip.
This integration permits the iSCSI process to pull TCP/IP segments directly
from the TCP/IP buffers and place them in their final main memory
positions.
Three proposals were covered: FIM, currently part of the iSCSI standard;
TUF, which is only in an experimental state; and TUF/FIM, which is being
proposed as part of a possible RDMA/DDP standard.
The FIM specification requires insertion of two 4-byte pointers into the
TCP/IP byte stream.
The first marker is not inserted until after the login is complete, at the first
marker interval in full-feature phase.
The markers are placed at a position that would have been correct if the
marker had been inserted during login.
The TUF protocol is not currently part of iSCSI and is on the IETF's
experimental track.
TUF places PDUs in segments so that the segments always start with PDUs.
TUF detects when networks re-segment the TCP/IP segment and adjusts
accordingly.
The TUF/FIM protocol may be developed as part of a possible RDMA/DDP
standard.
The fixed interval markers in TUF/FIM are backward pointers that are
placed every 512 bytes and point backward to the beginning of the frame.
Chapter 14. iSCSI Summary and Conclusions
To the Reader
Summary
Conclusions
The Future
Summary of Conclusions
To the Reader
The first part of this chapter is a summary of previous chapters. Following that
are my conclusions as well as my expectations for iSCSI and how it may be
deployed. I also discuss how iSCSI may evolve in the future. All readers should
be able to follow this information.
Summary
iSCSI is a transport for the SCSI protocol, carrying commands and data from
the host processor to the storage controller and carrying data and status from
the storage controller to the host processor. The host processor is considered
the initiator system; the storage controller, the target device.
iSCSI has chosen to base its protocol on TCP/IP because TCP/IP is the most
prevalent and robust networking technology available today. Customers are
familiar with TCP/IP and currently depend on it for their "bet-your-business"
applications.
Because it uses TCP/IP, iSCSI's protocol data units (PDUs) are wrapped in TCP
segments, which have IP routing protocol headers. Thus, iSCSI flows on top of
many different link protocols such as Ethernet, SONET, and ATM. Therefore,
iSCSI protocols can cross almost any physical network on their way from the
initiator system to the target storage controller.
The advantage of network storage is, of course, the distance between the
initiator and the target. More than that it is the flexibility obtained by having
storage externally connected and the interconnect shared with several
different systems and storage controllers. This permits the storage to have
more than one system connected to it, which means that the installation can
balance its total resources among all its connected systems. Instead of one
system having unused storage and another one being storage constrained, the
total storage resource can be balanced across the systems without either
overcommitting the resource or underutilizing it. Also, pooled storage is
generally less expensive per megabyte.
The value of pooling can best be seen in its use with a tape library. Clearly it is
not practical for every server or desktop system to have its own tape library,
especially if we are talking about the enterprise type. With a network connect
and the right software, however, the library can be shared by all systems. The
story is less obvious with disk storage, but as customers begin to pool their
storage, they will see the economies of scale in both the total price and the
robustness of the resultant storage controller.
Fibre Channel (FC) offers a network solution that permits pooling. However,
Fibre Channel, though robust, is generally more expensive than what a
midrange to low-end system can afford. In fact, the cost of Fibre Channel may
reduce or eliminate any savings that would otherwise have accrued from
pooling.
iSCSI, on the other hand, is being priced low enough by its vendors to be
readily accepted by midrange and small office servers, and it is as robust as
Fibre Channel on the high end. Further, it can be used with desktop and laptop
systems that have software iSCSI device drivers.
This book explained the range and flexibility of iSCSI by describing the
environments in which it can operate. These environments are
Computer-store software
Computer-store software
Specialty software
Optionally a dual-dialect device that replaces both the NAS and the
iSCSI storage controllers
Multiple servers
Optionally a dual-dialect device that replaces both the NAS and the
iSCSI storage controllers
High-end environmentcampus area (departmental)
Computer-store software
Optionally a dual-dialect device that replaces both the NAS and the
iSCSI storage controllers
iSCSI storage devices connected via office LANs for use by both
servers and desktops/laptops
Optionally a dual-dialect device replacing both the NAS and the iSCSI
storage controller
FC network
As you can see, iSCSI has a position in almost every environment. However,
Fibre Channel has not gone away and will probably be ingrained in the high-
end environment, especially in the central location, far into the future. This
will ensure a strong business for FC to/from iSCSI bridges, routers, and
gateways for a very long time. That said, it is important to know that many
pundits predict that iSCSI will surpass Fibre Channel. I believe that the
crossover point will be in 20062007.
As they worked on the protocol, the designers began to see how hardware
HBAs and chips could be created to make the protocol work not just on low-
end software implementations but on high-end performance-critical systems.
At the core of this hardware approach was theTCP/IP offload engine (TOE). As
development progressed it became obvious how an iSCSI engine could be used
with the HBA/chip to also offload most of the iSCSI processing from the host.
The designers also realized that when iSCSI and TOE were combined, the
iSCSI processor could see into the TCP/IP buffers and place the data directly
into the final main memory locations. This saved even more processor
overhead and meant that the resultant host CPU cycles would be as low as
those of SCSI or Fibre Channel.
The iSCSI designers knew that they had to make some resource-heavy
features optional, so they created a key=value negotiation process for the
initiator and target to agree on items that could support different vendor
solutions and customer needs. In order not to put the customers' data at risk,
they added additional CRC digests to ensure that no data error would go
undetected.
To make the administrative process easier and to solve some of the problems
encountered with Fibre Channel, the iSCSI designers created world-wide
unique names that applied to the initiator node (the whole OS) and target
node. In this way, customers could apply the same access control to any iSCSI
session from the same OS.
The designers also decided to define both a boot and a discovery protocol as
part of the family of iSCSI-related standards. This included definitions of the
iSCSI MIB and how it would be used with SNMP and the normal IP network
management functions. Boot and discovery capabilities came very late to Fibre
Channel, and the FC MIB is just now arriving. In contrast, these items were
defined from the very beginning of the iSCSI standardization effort.
iSCSI's designers realized that discovery needs were dependent on host and
storage network size. Thus, simple administrator settings were all that was
needed for home office environments, and iSCSI discovery sessions were all
that was needed for small office configurations. The Service Location Protocol
(SLP) was appropriate for the midrange networks. It was also determined that
the iSNS protocol was appropriate for enterprise configurations, especially
since iSNS could work for the various installations across a campus (including
the departmental systems) and work remotely for satellite locations. The
designers worried about tape operation at-distance, and so ensured that the
protocol had the necessary capabilities to permit continuous streaming of data.
The IETF would not accept iSCSI unless it had strong security that included
privacy (encryption). In today's world the solution is known as IPsec. As things
stand now, the iSCSI standard implements IPsec with a selection of pertinent
features and functions.
By the time the iSCSI Protocol Standards Draft document was written, the
designers were sure that iSCSI could be successful in environments of all types
and sizes. It would perform well, it could be implemented in both hardware
and software, and it would come with a complete suite of support protocolsall
of which would make it robust, secure, and reliable.
Conclusions
When IETF assigns an RFC number to the iSCSI protocol, iSCSI will be applied
to the various topologies described above. This will entail heavy education and
marketing to the storage industry. Expect the SNIA (Storage Networking
Industry Association) to take the lead in this via the SNIA IP Storage Forum. It
will explain what iSCSI will do for the customer without reference to any
specific product. This will be done to build general demand for the various
iSCSI products. The various vendors will then be able to tell customers about
their own specific products, without having to build and organize demand by
themselves.
Probably the most important factor in iSCSI's market success will be storage
network management. There will be a demand for storage network
management similar to Fibre Channel, but integrated with IP network
management. Customers will expect to have a rendering of both nonstorage
network entities and iSCSI entities. They will also expect the management
software to help the administrator monitor storage usage, as well as assign
and authorize the use of the various storage target devices by the appropriate
host systems. Customers will also expect LUN management and allocation to
be part of the total storage management package.
Ease of Administration
Customers will expect the iSCSI network to be easy to use as well as manage.
Storage controller setup needs to be straightforward. For example, a customer
should feel that it is as easy to add a logical LU to, say, "Frank's" system as it
is to add Frank to a NAS file system. If that happens, then iSCSI will be
assured of success.
Expect heavy use of iSCSI as part of remote backup and disaster recovery.
Remote backup has always been considered the "killer app" for iSCSI, because
users want the backup location for their data to be somewhere their primary
system is not, and iSCSI clearly facilitates that. However, since the 9/11
disaster, there has been much additional interest in disaster recovery. This
includes remote site recoveryand not just a single remote site but perhaps an
extra backup site located at a "significant distance" to protect it should a
disaster spread from the local site to the near-remote site.
In the aftermath of 9/11, many businesses found, not only that their primary
site was taken out, but also that they could not get into the backup site
because access in the general area was restricted. Others had only limited
contracts with their backup site provider and within a few weeks needed yet
another backup site but did not have one.
A new focus of "disaster recovery backup" thus permits key applications to be
quickly started at a remote location. This is called "active remote backup."
Today, the most prevalent active remote backup configuration uses a set of
"edge connect" boxes (one local and one remote) that convert the I/O
protocols into a form that permits operation across various WANs (wide area
networks). The techniques and equipment used, along with the amount of
bandwidth to the remote storage devices, determine the amount of data that
can be sent to a remote location, with as little data loss in a disaster as is
consistent with a company's business impact tolerance. Some companies find a
data loss of more than a few minutes intolerable, and others are willing to lose
hours. iSCSI will not directly affect the amount of data loss or the bandwidth
needed to minimize that loss. However it may have an indirect effect given
that it doesn't need the special "edge connect" protocol converters and so is
cheaper to implement. This permits the purchase of additional (IP tone)
bandwidth from various common carriers.
Edge connect devices are not needed because iSCSI can directly address the
remote storage unit across the WAN, and can write anything directly to the
storage controller regardless of its location. With the carrier delivering an "IP
tone" to the host and the storage devices, a virtual private network (VPN) can
be established that allows iSCSI to send data directly from hosts to remote
recovery centers. As a rule, common carriers that provide the "IP tone" charge
much less than they do for dedicated lines or "dark fiber," again providing
more bandwidth at lower cost. With more and cheaper bandwidth made
available, budget-conscious iSCSI solutions can be put together that permit
less data loss than with non-iSCSI solutions.
Even if the host is using an FC network as its primary storage path, mirroring
software can write directly to the remote storage unit just as easily as it can
mirror the writes to a local FC storage controller.
To sum up, with the sudden revived interest in disaster preparation and
recovery, we can look at iSCSI as a JIT (just in time) technology.
Performance
At first, one of the key worries about iSCSI was whether it would perform as
well as Fibre Channel. However, HBA and chip vendors are now reporting that
their products do perform as well. This brings us to two additional ways to
gauge iSCSI performance:
Since Ethernet is now taking the lead in defining the optical link characteristics
of 10Gb (and above) links, we should see the jump from 10Gb to 40Gb links in
Ethernet before Fibre Channel. However, there is even more potential with
iSCSI. Since you will be able to use MC/S with any speed connection, it will be
possible to create a single session made up of multiple 10Gb links and, beyond
that, multiple 40Gb links.
Remote direct memory access (RDMA) with direct data placement (DDP) is a
new protocol you will see coming to market in the next few years. It will
permit the RDMA protocols developed for InfinaBand to be used on TCP/IP as
well as SCTP/IP. You can expect current iSCSI vendors to support this protocol
on their HBAs and chip sets while also supporting iSCSI. You should also
expect an extension of iSCSI to operate with the RDMA transmission protocols,
perhaps called iSER (iSCSI Extension for RDMA). However, since these vendors
will also be required to support iSCSI version 1, the movement of the industry
into iSER, if at all, should be gradual and well planned.
The Future
iSCSI is a reality today. How successful it will become will depend not just on
its compelling possibilities, but also on how well IP network management
software can integrate with storage networking software to make the
environment easy to use and administer. Part of this will also depend on how
well the iSAN will integrate and be managed with upgraded versions of
existing SAN management software.
SAN/iSAN coexistence needs will last a long time. Thus, the key to quick and
long-lasting deployments will be how soon, and how well, they can inter-
operate and be managed together (in bet-your-business configurations).
It is temping for iSCSI bigots to speculate on how long Fibre Channel will last
once iSCSI hits its stride. In fact, it will last for as long as it provides a solution
to customer needs that are not acceptably addressed by other technologies.
iSCSI is one of those technologies; its only important difference is that it can
play in areas where Fibre Channel cannot play, as well as in the same central
computing facilities in which Fibre Channel currently dominates. Whether this
is reason enough to drop Fibre Channel and move to iSCSI is not clear.
In any event, expect both Fibre Channel and iSCSI to be around for the
foreseeable future, and for iSCSI to continue to evolve and be supported by
several additional transmission technologies that will permit it to extend into
even wider use.
Summary of Conclusions
iSCSI networks will contain more initiators than Fibre Channel has, so its
ease of administration is critical.
Backup is the "killer app" for iSCSI, because iSCSI was made for "at-
distance" I/O.
iSCSI will be part of active remote backup, either in the host system or in
the storage controller.
iSCSI will perform well "at distance," since that was a key part of the
design of the iSCSI protocol.
In the descriptions, as in all things in this book, the official IETF iSCSI drafts
are the last word on any differences between it and what is written here.
Serial Number Arithmetic
The 32-bit serial number fields in the PDUs are treated as 32-bit serial
arithmetic numbers. Serial number arithmetic is fully defined in [RFC1982]
and will be lightly covered here.
An asynchronous message may be sent from the target to the initiator without
corresponding to a particular command. The target specifies the reason for the
event and sense data. Some asynchronous messages are strictly related to
iSCSI whereas others are related to SCSI. (See [SAM2] for more information
on SCSI messages.)
LUN (logical unit number) is the number of the SCSI logical unit to which
the message applies. This field must be valid if AsyncEvent is 0. Otherwise this
field is reserved.
StatSN (status sequence number) is a number that the target iSCSI layer
generates per connection that enables the initiator to acknowledge reception
of status. Asynchronous messages are considered acknowledgeable events,
which means that the StatSN local variable is incremented.
AsyncEvent is an asynchronous event code. The following table lists the codes
used for iSCSI asynchronous messages (events). All other event codes are
reserved.
Parameter2 is the time to wait (Time2Wait). See the description in the table.
A SCSI asynchronous event is reported in the sense data. Sense data that
accompanies the report in the data segment identifies the condition. If the target
0 supports SCSI asynchronous event reporting (see [SAM2]) as indicated in the
standard INQUIRY data (see [SPC3]), its use may be enabled by parameters in the
SCSI control mode page (see [SPC3]).
Target requests logout. This message must be sent on the same connection as the
one requested to be logged out. The initiator must honor this request by issuing a
logout as early as possible, but no later than Parameter3 seconds. The initiator
must also send a logout with a reason code of "Close the connection" (if not the
only connection) to cleanly shut down the connection or with a reason code of
"Close the session" to completely close the session, thus closing all connections.
1 Once this message is received, the initiator should not issue new iSCSI commands
on this connection. The target may reject any new I/O requests on this connection
that it receives after this message with the reason code "Waiting for Logout." If the
initiator does not log out in Parameter3 seconds, the target should send an async
PDU with the iSCSI event code "Dropped the connection" if possible, or simply
terminate the transport connection. Parameter1 and Parameter2 are reserved.
The Parameter1 field indicates on what CID the connection will dropped.
The Parameter3 (also known as Time2Retain) field indicates the maximum time
2 to reconnect and/or reassign commands (say, after a reconnection) after the
initial wait (Parameter2).
If the initiator does not attempt to reconnect and/or reassign the outstanding
commands within the time specified by Parameter3, or if Parameter3 is 0, the
target will terminate all outstanding commands on the connection. No other
responses should be expected from the target for the outstanding commands
on this connection.
The Parameter3 (also known as Time2Retain) field indicates the maximum time
3 to reconnect and/or reassign commands after the initial wait (Parameter2).
In this case, the target will terminate all outstanding commands in this session; no
other responses should be expected from the target for those commands.
A value of 0 for Parameter2 indicates that reconnect can be attempted immediately.
The initiator must honor this request by issuing a text request (that can be empty)
on the same connection as early as possible but no later than Parameter3 seconds,
4 unless a text request is already pending on the connection or the initiator issues a
logout request.
If the initiator does not issue a text request, the target may reissue the
asynchronous message requesting parameter negotiation.
255 Vendor-specific iSCSI event. The AsyncVCode details the vendor code and vendor-
specific data that may be included in the DataSegment.
Note: This PDU does not support digests, since such checking cannot occur
until after the session goes into full-feature phase (which, by definition, is
after the login is complete).
There are three stages/phases through which the login process must transit;
they are listed and numbered as follows:
3Full-feature phase
These phases are recorded in the current stage (CSG) and next stage (NSG)
fields described below.
T (transit bit), when set to 1, indicates that the initiator is ready to transit to
next stage/phase. If the NSG is set to the value of full-feature phase, the
initiator is ready for the final login response PDU. The target may answer with
the T bit set to 1 in a login response PDU only if the bit was set to 1 in the
previous login request PDU.
C (continue bit), when set to 1, indicates that the text (i.e., the set of
key=value pairs) in this login request PDU is not complete (it will be continued
on a subsequent login request PDU); otherwise, it indicates that this login
request PDU ends a set of key=value pairs. If the C bit is set to 1, the F bit
must be set to 0.
CSG and NSG are fields that associate the login negotiation commands and
responses with a specific stage/phase in the session (security negotiation,
login operational negotiation, and full-feature) and may indicate the next
stage/phase to which the initiator or target wants to move. (See Chapter 5,
the section Introduction to the Login Process.) The next-stage value is valid
only when the T bit is set to 1 and is reserved otherwise.
Version-Max is the maximum version supported. It must be the same for all
login requests within the login process. The target must use the value
presented with the first login request.
Version-Min is the minimum version supported. It must be the same for all
login requests within the login process. The target must use the value
presented with the first login request.
Type
Naming Authority and Qualifier Format
(T)
Lower 22 bits of IEEE OUI across fields A and B (the bits known as I/G and U/L are
00b
omitted); qualifier is in fields C and D.
IANA enterprise number (EN) across fields B and C; qualifier is in field D (field A is
01b
reserved).
10b "Random" number across fields B and C; qualifier is in field D (field A is reserved).
11b A reserved value.
If the value in the type field is 2 (10b), the naming authority field should be
set to a random or pseudo-random 24-bit unsigned integer value in network
byte order (big-endian). The random value only needs to be unique within the
specific host initiator node. It is intended to be used by universities and by
individuals who do not have an OUI or an EN.
The qualifier field is a 16- to 24-bit unsigned integer value that provides a
range of possible values for the ISID within the type and naming authority
namespace. It may be set to any value within the constraints specified in the
iSCSI protocol.
The same ISID should be used by an initiator iSCSI (SCSI) port in all its
sessions to all its targets. This is considered conservative reuse (see [iSCSI]).
If the ISID is derived from something assigned by a vendor to a hardware
adapter or interface as a preset default value, it must be configurable by an
administrator or management software to a new default value. The ISID value
must be configurable so that a chosen ISID may be applied to a portal group
containing more than one interface. In addition, any preset default value
should be automatically adjusted to a common ISID when placed in a network
entity as part of a portal group. Any configured ISID must also be persistent
(e.g., across power cycles, reboots, and hot swaps). (See [iSCSI] for name and
ISID/TSIH use.)
TSIH (target assigned session identifying handle) must be set in the first
login request. The reserved value, zero, must be used on the first connection
for a new session. Otherwise, it must send the TSIH that was returned by the
target at the conclusion of successful login of the first connection for this
session. The TSIH, when nonzero, identifies to the target the associated
existing session for this new connection. It must be the same for all login
requests within a login process.
The target must respond with login response PDUs that contain the value
presented with the first login request of the series. All subsequent login
responses in the same series must also carry this value, except the last if the
TSIH has a value of zero. The last login response of the login phase must
replace a zero TSIH value, if set, with a nonzero unique tag value that the
target creates.
The TSIH is the target-assigned tag for a session with a specific named
initiator. The target generates the TSIH during session establishment, and its
internal format and content are not defined except that it must not be zero.
Zero is reserved and used by the initiator on the first connection for a new
session to indicate that a new session is wanted. The TSIH is returned to the
target during additional connection establishment for the same session.
ITT (initiator task tag) is the initiator-assigned identifier for this login
request PDU. If the command is sent as part of a sequence of login requests
and responses, the ITT must be the same for all requests within the sequence.
CID (connection ID) is a unique ID for this connection within the session. It
must be the same for all login requests within a login series. The target must
use the value presented with the first login request of the connection.
If the TSIH is not zero, and if the CID value is not in use within the session, a
new connection within the session is started. However, if the CID is currently
in use, that will cause the corresponding connection to be terminated and a
new connection started with the same CID. If the error recovery level is 2, any
active tasks will be suspended and made ready for their allegiance to be
reassigned (one at a time in response to individual task management requests
for task reassign). This reassignment will be to this new connection or to some
other connection within the same session. If the error recovery level is less
than 2, the tasks that were active within the old CID are internally terminated.
During secondary connection logins, commands may continue to flow from the
initiator to the target on any connection that is in full-feature phase. These
commands continue on other connections, as they would if the login process
were not currently active. The CmdSN used by the logging-in secondary
connection is also used for the very next command flowing to the target within
the session. This is because the login command is treated as an immediate
command and, as such, does not advance the CmdSN or the ExpCmdSN.
Therefore, the very next command sent within the session will also use the
same CmdSN.
The following table lists the values of ISID, TSIH, and CID and the actions to
be taken when the indicated values are set.
The login response PDU indicates the progress of and/or the end of the login
process.
Note: this PDU does not support digests, since such checking cannot occur
until after the connection goes into full-feature phase (which by definition is
after the login is complete).
The three stages/phases through which the login process must transit (with
their values) are
2(not used)
3Full-feature phase
These stages/phases are recorded in the CSG and NSG fields described below:
T (transit bit), when set to 1, indicates that the target is ready to transit to
the next stage/phase. When it is set to 1 and the NSG field is set to the value
of the full-feature phase (3), the target is sending the final login response.
When set to 0, it is a "partial" response, which means "more negotiation is
needed."
If the status-class is zero, the target can respond with the T bit set to 1 only if
that setting was in the previous Login Request PDU. A login response with a T
bit set to 1 must not contain key=value pairs that may require additional
answers from the initiator within the same stage/phase.
C (continue bit), when set to 1, indicates that the text (a set of key=value
pairs) in this login response PDU is not complete (it will be continued on a
subsequent login response PDU); otherwise, it indicates that this PDU ends a
set of key=value pairs. A PDU with the C bit set to 1 must have the F bit set to
0.
CSG and NSG are fields that associate the login negotiation commands and
responses with a specific stage/phase in the session (security negotiation,
login operational negotiation, and full-feature) and may indicate the next
stage to which the initiator or target wants to move (refer to stage/phase
values above). The next-stage value is valid only when the T bit is set to 1; it
is reserved otherwise.
All login responses within the login phase must carry the same Version-
Active. The initiator must use the value presented as a response to the first
login request.
ISID (initiator session ID) is the same value specified in the corresponding
login request PDU, which the target must copy into this PDU.
TSIH (target-assigned session identifying handle) is the tag for use with
a specific named initiator. The target generates the TSIH during session
establishment, and its internal format and content are not defined except that
it must not be zero, which is reserved and used by the initiator to indicate a
new session. The TSIH value is generated by the target and returned to the
initiator on the last login response from the target on the leading login. In all
other cases the field should be set to the TSIH provided by the initiator in the
first login request of the series.
ITT (initiator task tag) matches the tag used in the initial login request
PDU. The target must copy the ITT from that PDU into the corresponding login
response PDU.
StatSN (status sequence number) for the first login response PDU (the
response to the first login request PDU) is the starting status sequence number
for the connection, which can be set to any "in-range" value. The next
response of any kind, including the next login response if any in the same
login phase, will carry this number plus 1. This field is valid only if the Status-
Class is 0.
2 (initiator error, not a format error)The initiator likely caused the error,
maybe because of a request for a resource for which the initiator does not
have permission. The request should not be tried again.
The following table lists all of the currently allocated status codes, shown in
hexadecimal. ("ITN" stands for iSCSI target node.)
Status- Status-
Status Description
Class Detail
Success 00 00 Login is proceeding okay.[*]
Target Moved Requested ITN has moved temporarily to the address
01 01
Temporarily provided.
Target Moved Requested ITN has moved permanently to the
01 02
Permanently address provided.
Initiator Error 02 00 Miscellaneous iSCSI initiator errors.
Authentication
02 01 Initiator could not be successfully authenticated.
Failure
Authorization Failure 02 02 Initiator is not allowed access to the given target.
Not Found 02 03 Requested ITN does not exist at this address.
Requested ITN has been removed and no forwarding
Target Removed 02 04 address is provided.
[*]If the response T bit is set to 1 (in both the request and the response) and the NSG is in full-
feature phase (in both the request and the response), the login phase is finished and the initiator
may issue SCSI commands.
If the Status Class is not 0, the initiator and target must close the TCP
connection. If the target rejects the login request for more than one reason, it
should return the primary reason to the initiator.
All rules dealing with text requests/responses hold for login requests/
responses. Chapter 6 discussed the rules dealing with text keys and their
negotiation. Keys and their explanations are listed in Appendix B.
Logout Request PDU
Reason code indicates the reason for logout as one of the following:
0 Close the session (the session is closed). All commands associated with
the session (if any) are terminated.
ITT (initiator task tag) is the initiator-assigned identifier for this Logout
Request PDU.
After sending the Logout Request PDU, an initiator must not send any new
iSCSI commands on the closing connection. Moreover, if the logout is intended
to close the session, no new iSCSI commands can be sent on any of the
connections participating in the session.
When receiving a logout request with a reason code of "close the connection"
or "close the session," the target must terminate all pending commands,
whether acknowledged via ExpCmdSN or not, on that connection or session,
respectively. When receiving a logout request with the reason code "Remove
connection for recovery":
The target discards all requests not yet acknowledged via ExpCmdSN that
were issued on the specified connection.
The target then issues the logout response and half-closes the TCP
connection (sends FIN).
After receiving the logout response and attempting to receive the FIN (if
still possible), the initiator completely closes the logging-out connection.
The logout request to clean up the target end of a failing connection and
enable recovery to start
The login request with a nonzero TSIH and the same CID on a new
connection.
In sessions with a single connection, the connection can be closed and a new
one opened; and a reinstatement login can be used for recovery.
Successful completion of a logout request with the reason code "close the
connection" or "remove the connection for recovery" results in some
unacknowledged commands received on this connection being discarded at the
target. (An example is tasks waiting in the command-reordering queue, which
are allegiant to the connection being logged out for one or more commands
with smaller CmdSN.) These "holes" in command sequence numbers have to
be handled by appropriate recovery unless the session is also closed. (See
Chapter 11, Error Handling, the section Error Recovery Level 1.)
Note:
A target implicitly terminates the active tasks for three reasons having to do
with the iSCSI protocol:
When a connection fails and the connection state eventually times out and
there are active tasks allegiant to that connection
If the tasks terminated in any of the above cases are SCSI tasks, they must be
internally terminated with CHECK CONDITION status with a sense key of unit
attention and ASC/ASCQ values of 0x6E/0x00 (COMMAND TO LOGICAL UNIT
FAILED). This status is meaningful only for appropriate handling of the internal
SCSI state with respect to ordering aspects such as queued commands,
because it is never communicated back as a terminating status to the initiator.
Logout Response PDU
The logout response is used by the target to indicate that the cleanup
operation for the connection has completed. After logout, the TCP connection
referred by the CID must be closed at both ends. If the Logout Request PDU
reason code was for session close, all connections in the session must be
logged out and the respective TCP connections closed at both ends.
2 Connection recovery not supported (if logout reason code was "remove
connection for recovery" and target does not support itas indicated by
the error recovery level being less than 2)
Initiator task tag (ITT) matches the tag used in the Logout Request PDU.
The target must copy it from there into the corresponding Logout Response
PDU.
If the logout response code is 2 or 3, this field specifies the minimum time to
wait before attempting a new implicit or explicit logout. If Time2Wait is 0, the
reassignment or a new logout may be attempted immediately.
Time2Retain is the maximum time after the initial wait of Time2Wait that the
target waits for the allegiance reinstatement for any active task, after which
the task state is discarded. If the error recovery level is less than 2 and the
logout response code is 0, this field should be ignored. It is not valid if the
logout response code is 1.
If the logout response code is 2 or 3, this field specifies the maximum time, in
seconds, after Time2Wait that the target waits for a new implicit or explicit
logout. If this is the last connection of a session, the entire session state is
discarded, after the Time2Retain is passed. If the Time2Retain is 0, the target
has already discarded the connection (and possibly the session) state along
with the task states. No reassignment of tasks or logout is required or possible
in this case.
NOP-In PDU
Either the initiator or the target may originate a NOP-type PDU. However, the
target cannot send data with a NOP-In that it originates.
Zero is a valid value for the DataSegmentLength and indicates the absence of
ping data. An unsolicited ping request from the target will always have this
field set to zero, since the target cannot originate ping data.
LUN (logical unit number) is the logical unit number that accompanies a
ping from the target to the initiator. The initiator must return it along with the
TTT whenever the TTT is not set to hex FFFFFFFF.
The LUN is included here because a hang-up may often be seen as relating to
a specific LUN, or an implementation that is queuing based on a LUN might
detect at that point that something is amiss. This permits the target to force a
round-trip of the PDU, which causes the hardware and software path to be
completely exercised. Clearly if this is not a LUN-related issue, the target
might place any valid value here, such as zero.
This must be set to a valid number whenever the TTT is not set to the reserved
value of hex FFFFFFFF.
The StatSN field always contains the next StatSN. However, when the ITT is
set to hex FFFFFFFF, the StatSN for the connection is not advanced.
DataSegment (ping data) contains the ping data to being reflected back to
the initiator that the initiator requested from the target. This data must be
reflected by this NOP-In PDU. No data can be included when the target is
originating the NOP-In PDU (ping). Also, the ITT must be set to the value of
hex FFFFFFFF.
NOP-Out PDU
I (immediate bit) is set to 1 when the initiator just wants to send the target
the latest value of ExpStatSN. It is useful when there has been no other PDU
carrying the ExpStatSN in a long time. When this flag is set, the CmdSN is not
advanced and the ITT is set to hex FFFFFFFF.
LUN (logical unit number) is the number that may accompany a ping from
the initiator to the target. The target must return it with the ITT and the ping
data whenever the ITT is not set to hex FFFFFFFF.
When the TTT is set to a value other than hex FFFFFFFF, the LUN must also be
copied from the NOP-In PDU. When the NOP-Out being sent by the initiator is
not a response to a ping by the target, this TTT is set to hex FFFFFFFF.
DataSegment (ping data) is the ping data that the initiator wants the target
to send back to it via a NOP-In PDU.
Ready To Transfer (R2T) PDU
When the initiator submits a SCSI command that requires it to send data to
the target, such as in a write, the target may need to ask for that data
explicitly using an R2T PDU. The target may specify which blocks of data it is
ready to receive, and it may request, via the R2T, that the data blocks be
delivered in an order convenient for the target at that particular instant. These
instructions are all sent to the initiator from the target in this R2T PDU.
As explained in Chapter 8, the section Data Ordering, R2T PDUs are used when
the immediate or unsolicited data PDUs have handled all the data permitted to
them and additional data remains to be transferred.
After receiving an R2T, the initiator may respond with one or more SCSI Data-
Out PDUs with a matching TTT.
ITT (initiator task tag) is the unique value that the initiator gave to each
task to identify the commands (as explained in Chapter 8). It is returned with
this PDU. In this case it serves as an identifier to enable the initiator to find
the corresponding output (e.g., write) command that has more data to send
and is waiting for this (R2T) request before sending the additional data.
TTT (target transfer tag) is the value the target assigns to each R2T request
it sends to the initiator. The target can easily use it to identify the data it
receives. The TTT and LUN are copied into the outgoing data PDUs by the
initiator and used by the target only. The TTT cannot be set to hex FFFFFFFF,
but any other value is valid.
StatSN (status sequence number) contains the next StatSN, but when this
number is assigned the target's local StatSN for this connection will not be
advanced after this PDU is sent.
R2TSN (R2T sequence number) is the number of this R2T PDU. Its values
start at 0 and are incremented by 1 each time an R2T PDU for a specific
command is created. The maximum value is 2321 (hex FFFFFFFF). The ITT
identifies the corresponding command. R2T and Data-In PDUs used with
bidirectional commands must share the numbering sequence (assign numbers
from a common sequencer).
Buffer offsetThe target can request that the order of data sent to it from the
initiator in response to this R2T PDU be something other than the actual data
byte order. The buffer offset field specifies a displacement (offset) into the total
buffer where the data transfer should begin.
The R2T can be used to request the data to arrive in a number of separate
bursts and in a specific order.
Desired data transfer length is the amount of data (in bytes) that should be
transferred by the initiator in response to this R2T PDU. It will begin at the
point in the buffer specified by the buffer offset and extend to the length
specified here. The value should be greater than zero and less than or equal to
MaxBurstLength.
The target may request the data from the initiator in several chunks, not
necessarily in the data's original order. Order is actually determined by the
setting of DataSequenceInOrderif set to Yes, consecutive R2Ts should refer
to continuous nonoverlapping ranges; if set to No, the ranges can be
requested in any order.
R2T PDUs may also be used to recover Data-Out PDUs. Such an R2T
(Recovery-R2T) is generated by a target upon the detection of the loss of
one or more Data-Out PDUs due to a header digest error, a sequence error, or
a sequence timeout.
A Recovery-R2T carries the next unused R2TSN, but requests part or all of the
entire data burst that an earlier R2T (with a lower R2TSN) had already
requested.
Reject is used to indicate that a target has detected an iSCSI error condition
(protocol, unsupported option, etc.).
Reason is the reason for the reject. The codes are shown in this table. All
other values are reserved.
[*] For an iSCSI Data-Out PDU, retransmission is done only if the target requests it with a recovery
R2T. However, if this is the data digest error on immediate data, the initiator chooses how to
retransmit the whole PDU, including the immediate data. It may decide to send the PDU again
(including the immediate data) or resend the command without the data. If the command is sent
without the data, the data can be sent as unsolicited or the initiator can wait for an R2T from the
target.
[**]A target should use this reason code for all invalid values of PDU fields that are meant to
describe a task, a response, or a data transfer. Some examples are invalid TTT/ITT, buffer offset,
LUN qualifying a TTT, or an invalid sequence number in a SNACK.
Targets must not implicitly terminate an active task just by sending a reject
PDU for any PDU exchanged during the task's life. If the target decides to
terminate, it must return a response PDU (SCSI, text, task, etc.). If the task
was not active before the reject (i.e., the reject is on the command PDU), the
target should send no further responses since the command itself is being
discarded.
This means that the initiator can eventually expect a response even on rejects
if the reject is not for the command itself. The noncommand rejects have only
diagnostic value in logging the errors but they may be used by the initiators
for retransmission decisions as well. The CmdSN of the rejected PDU (if it
carried one) must not be considered received by the target (i.e., a command
sequence gap must be assumed). This is true even when the CmdSN can be
reliably ascertained, as in the case of a data digest error on immediate data.
However, when the DataSN of a rejected data PDU can be ascertained, a target
must advance ExpDataSN for the current burst if a recovery R2T is being
generated. The target may also advance its ExpDataSN if it does not attempt
to recover the lost data PDU.
SCSI (Command) Request PDU
For bidirectional operations, either or both the R bit and the W bit may be 1
when the corresponding expected data transfer lengths are 0, but they cannot
both be 0 when the corresponding expected data transfer length and the
bidirectional read expected data transfer length are not 0.
ATTR (task attributes) have one of the following integer values (see
[SAM2]):
0Untagged
1Simple
2Ordered
3Head of queue
4ACA
57Reserved
TotalAHSLength is the total length (in 4-byte words) of the additional header
segments (if any). This value will include any padding.
LUN (logical unit number) is the number of the SCSI logical unit to which
the command applies.
ITT (initiator task tag) is the unique value given to each task, used to
identify the commands (as explained in Chapter 8, Command and Data
Ordering and Flow).
For bidirectional operations (both R and W flags are set to 1), this field
contains the number of data bytes involved in the write transfer. An additional
header segment (AHS) must be present in the PDU that indicates the
bidirectional read expected data transfer length.
If the expected data transfer length for a write and the length of the
immediate data part that follows the command (if any) are the same, no more
data PDUs are expected to follow. In this case, the F bit must be set to 1. If the
expected data transfer length is higher than the FirstBurstLength (the
negotiated maximum length of unsolicited data the target will accept), the
initiator must send the maximum length of unsolicited data or only the
immediate data, if any.
Upon completion of a data transfer, the target informs the initiator (through
residual counts) of the number of bytes actually processed (sent and/or
received) by the target.
AHS (additional header segment) has the general format shown on the
next page. It contains the fields described in the following paragraphs.
Bits 01Reserved
0Reserved
1Extended CDB
359Reserved
6063Non-iSCSI extensions
Extended CDB has the format shown in the figure below. This type of AHS
must not be used if the total CDB length is less than 17. Note that the CDB
Length 15 is used (instead of 16) to account for one reserved byte. (The
reserved byte is counted in AHSLength.) The rest of the CDB is contained in
the 16-byte CDB field in the BHS. The padding is not included in AHSLength.
The SCSI Response PDU is sent from the target to the initiator to signal the
completion of a SCSI command and carries information about the command,
such as whether it completed successfully or not, residual data counts, ending
status, and in some cases sense data.
Note: if a SCSI device error is detected while data from the initiator is still
expected (the command PDU did not contain all the data and the target has
not received a Data-Out PDU with the F bit set), the target must wait until it
receives a Data-Out PDU with the F bit set in the last expected sequence
before sending this response PDU.
O (overflow bit) is set for residual overflow. In this case, the Residual
Count indicates the number of bytes that were not transferred because the
initiator's expected data transfer length was not sufficient. For a
bidirectional operation, the Residual Count contains the residual for the
write operation.
U (underflow bit) is set for residual underflow. In this case, the Residual
Count indicates the number of bytes that were not transferred out of the
number of bytes expected to be transferred. For a bidirectional operation,
the Residual Count contains the residual for the write operation.
Notes:
All other response codes are reserved. A nonzero response field indicates a
failure to execute the command, in which case the status and sense fields are
undefined.
Status is used to report the SCSI status and is valid only if the response code
is "Command completed at target." Some of the status codes defined for SCSI
are
0x00 GOOD
0x08 BUSY
A complete list and definitions of all status codes can be found in [SAM2].
ITT (initiator task tag) is the unique value given to each task to identify the
commands (as explained in Chapter 8, Command and Data Ordering and
Flow).
SNACK Tag contains a copy of the SNACK Tag of the last R-Data SNACK
accepted by the target on the same connection and for the command for which
the response is issued. Otherwise it is reserved and should be set to 0.
After issuing an R-Data SNACK, the initiator must discard any SCSI status
unless contained in an SCSI Response PDU carrying the same SNACK Tag as
the last issued R-Data SNACK for the SCSI command on the current
connection.
In case MaxCmdSN changes at the target and the target has no pending PDUs
to convey this information to the initiator, the target should generate a NOP-In
to carry the new MaxCmdSN.
Bidirectional Read Residual Count is valid only when either the u bit or the
o bit is set. If neither bit is set, it should be zero.
If the o bit is set, the Bidirectional Read Residual Count indicates the number
of bytes not transferred to the initiator because the initiator's expected
bidirectional read transfer length was not sufficient. If the u bit is set, it
indicates the number of bytes not transferred to the initiator out of the
number of bytes expected.
Residual Count is valid only when either the U bit or the O bit is set. If
neither bit is set, the field is supposed to be zero.
If the O bit is set, Residual Count indicates the number of bytes not
transferred because the initiator's expected data transfer length was
insufficient. If the U bit is set, it indicates the number of bytes not transferred
out of the number of bytes expected.
DataSegment (sense and response data) iSCSI targets have to support
and enable a function called Autosense. Autosense requires the SCSI layer to
retrieve the SCSI sense information automatically so that iSCSI can make it
part of the command response whenever a SCSI CHECK CONDITION occurs. If
the status is a CHECK CONDITION, the DataSegment contains sense data for
the failed command. If the DataSegmentLength field is not zero, then the
format of the DataSegment field is as shown in the figure.
Sense Data is the sense data returned as part of the status response and
includes the sense key, additional sense codes, and a qualifier. It contains
detailed information about a check condition; [SPC3] specifies its format and
content.
Note: Certain iSCSI conditions result in the command being terminated at the
target as outlined in the following table.
The target reports the "Incorrect amount of data" condition if, during data
output, the total data length is greater than FirstBurstLength and the initiator
sent unsolicited nonimmediate data, but the total amount of unsolicited data is
different from FirstBurstLength. The target reports the same error when the
amount of data sent as a reply to an R2T does not match the amount
requested.
The SCSI Data-In PDU for read operations has the following format.
The SCSI Data-In PDU carries read data and may also contain status on the
last Data-In PDU for a read command, as long as the command did not end
with an exception (i.e., the status is GOOD, CONDITION MET, or
INTERMEDIATE CONDITION MET). For bidirectional commands, the status is
always sent in a SCSI Response PDU.
If the command is completed with an error, the response and sense data will be
sent in a SCSI Response PDU, not in a SCSI data packet.
F (final bit), for incoming data, is 1 for the last input (read) data PDU of a
sequence. Input can be split in several sequences, each one having its own F
bit. This does not affect DataSN counting on Data-In PDUs.
The F bit may also be used as a "change direction" indication for bidirectional
operations that need such a change. For bidirectional operations, the F bit is 1
for the end of the input sequences as well as for the end of the output
sequences.
The target should use the A bit moderately, setting it to 1 only once every
MaxBurstLength bytes or on the last Data-In PDU that concludes the entire
requested read data transfer for the task (from the target's perspective).
On receiving a Data-In PDU with the A bit set to 1, if there are no holes in the
read data up to that PDU, the initiator must issue a SNACK of type DataACK.
The exception to this is when the initiator can acknowledge the status for the
task immediately via ExpStatSN on other outbound PDUs. That assumes that
the status for the task is also received.
If the initiator has detected holes in the read data before that Data-In PDU, it
must postpone the DataACK SNACK until the holes are filled. Also, an initiator
cannot acknowledge the status for the task before the holes are filled.
O (residual overflow bit) When set, the Residual Count indicates the
number of bytes not transferred because the initiator's expected data transfer
length was not sufficient.
U (residual underflow bit) When set, the Residual Count indicates the
number of bytes not transferred out of the number of bytes expected.
The S bit can only be set if the command did not end with an exception (i.e.,
the status must be GOOD, CONDITION MET, INTERMEDIATE, or INTERMEDIATE
CONDITION MET).
The StatSN, status, and Residual Count fields have meaningful content only if
the S bit is set to 1.
Status This PDU can return only status that does not generate sense (error
conditions). Following are some of the acceptable status values:
Hex 00Good
Hex 10Intermediate
LUN (logical unit number) is the number of the SCSI logical unit from which
the data was taken. (See additional information under TTT, below.)
ITT (initiator task tag) is the unique value that the initiator gives to each
task, used to identify the commands (as explained in Chapter 8, Command and
Data Ordering and Flow) and returned with this PDU. In this case it is used as
an identifier to enable the initiator to find the corresponding command that
issued the request for the data, which arrives as part of this PDU.
TTT (target transfer tag), on incoming data, must be provided by the target
if the A bit is set to 1. The TTT and the LUN are copied by the initiator into the
SNACK of type DataACK that it issues as a result of receiving a SCSI Data-In
PDU with the A bit set to 1.
The TTT values are not specified by this protocol except that the value hex
FFFFFFFF is reserved and means that the TTT is not supplied. If the TTT is
provided, the LUN field must hold a valid value and be consistent with
whatever was specified with the command; otherwise, the LUN field is
reserved.
DataSN, for input (read) or bidirectional Data-In PDUs, is the input data PDU
number (starting with 0) in the data transfer for the command identified by
the initiator task tag. R2T and Data-In PDUs used with bidirectional commands
must share the numbering sequence (i.e., assign numbers from a common
sequencer). The maximum number of input data PDUs in a sequence is 232
(counting the first PDU, which is numbered 0).
Buffer Offset contains the offset of the PDU data payload within the complete
data transfer. The sum of the buffer offset and the length should not exceed
the expected transfer length for the command.
Residual Count is valid only where either the U bit or the O bit is set. If
neither bit is set, it should be zero.
If the O bit is set, this field indicates the number of bytes not transferred
because the initiator's expected data transfer length was not sufficient. If the
U bit is set, it indicates the number of bytes not transferred out of the number
of bytes expected.
DataSegment contains the data sent from the target to the initiator with this
PDU.
SCSI Data-Out PDU
The SCSI Data-Out PDU for write operations has the format shown below.
F (final bit), for outgoing data, is 1 for the last PDU of unsolicited data or for
the last PDU of a sequence answering an R2T. For bidirectional operations, it is
1 for the end of the input sequences as well as for the end of the output
sequences.
LUN (logical unit number) is the number of the SCSI logical unit to which
the data applies. (For additional information, see TTT below.)
ITT (initiator task tag) is the unique value given to each task, used to
identify the commands (as explained in Chapter 8, Command and Data
Ordering and Flow). In this case it enables the target to find the corresponding
command that requires the data carried in this PDU.
TTT (target transfer tag), on outgoing data, is provided to the target if the
transfer is honoring an R2T. In this case, it is a replica of the TTT provided with
the R2T.
TTT values are not specified by this protocol. However, the value hex FFFFFFFF
is reserved and means that the TTT is not supplied. If the TTT is provided, the
LUN field must hold a valid value; otherwise, LUN is reserved.
DataSN (data sequence number), for output (write) data PDUs, is the data
PDU number (starting with 0) within the current output sequence. The current
output sequence is identified by the ITT (for unsolicited data), or it is a data
sequence generated for one R2T (for data solicited through R2T). The
maximum number of output data PDUs in a sequence is 232 (counting the first
PDU, which is numbered as 0).
Buffer Offset contains the offset of this PDU data payload within the complete
data transfer. The sum of the buffer offset and the length should not exceed
the expected transfer length for the command.
Support for all SNACK types is mandatory if the supported error recovery level
of the implementation is greater than zero.
LUN (logical unit number) contains the LUN field from the Data-In PDU that
had the A bit set. In all other cases this field is reserved.
ITT (initiator task tag) is the initiator-assigned identifier for the referenced
command or hex FFFFFFFF. For status SNACK and DataACK, it is reserved with
a value of hex FFFFFFFF. In all other cases, it must be set to the value of the
ITT of the referenced command.
TTT (target transfer tag) or SNACK Tag must contain a value other than
FFFFFFFF. When used as a SNACK Tag the initiator picks a unique nonzero ID
for the task identified by the ITT. The value must be copied to the last or only
SCSI Response PDU by the target into a field also known as SNACK Tag. For
DataACK, this field must contain a copy of the TTT and LUN provided with the
SCSI Data-In PDU with the A bit set to 1.
In all other cases, the Target Transfer Tag field must be set to the reserved
value of hex FFFFFFFF.
The value must be set to 0 for a DataAck SNACK as well as for R-Data SNACK.
The first data SNACK after a task management request of TASK REASSIGN
(see the section Task Management Function Request PDU) for a command
whose connection allegiance was just changed should be an R-Data SNACK
with RunLength equal to 0.
Resegmentation
A target that has received an R-Data SNACK must return a SCSI Response
PDUthat contains a copy of the R-Data SNACK "SNACK Tag." This SNACK Tag
value must be placed in the SCSI Response SNACK Tag field as its last or only
response. This means that if it has already sent a response containing another
value in the SNACK Tag field or had the status included in the last Data-In
PDU, it must send a new SCSI Response PDU. If a target sends more than one
SCSI Response PDU due to this rule, all SCSI responses must carry the same
StatSN. If an initiator attempts to recover a lost SCSI Response PDU (with a
Status SNACK) when more than one response has been sent, the target will
send the SCSI Response PDU with the latest content known to the target,
including the last SNACK Tag for the command.
If a SCSI command is reassigned to another connection (Allegiance
Reassignment), any SNACK Tag it holds for a final response from the original
connection should be deleted and the default value of 0 should be used
instead. If the new connection has a different MaxRecvDataSegmentLength
than the old connection, the ExpDataSN (if greater than 0) that is sent on the
Reassign Task Management Request may not be interpreted reliably by the
iSCSI target. In such a case, the target must behave as if an R-Data SNACK
were issued and retransmit all unacknowledged data. Also note that status-
piggybacking is not to be used for delivering the response, even if it was used
the first time for delivering the nonrecovery response on the original
connection.
The numbered Data-In PDUs requested by a Data SNACK (not R-Data SNACK),
have to be delivered as exact replicas of those the initiator missedexcept for
the ExpCmdSN and MaxCmdSN fields, which must carry the current values.
Any SNACK requesting a numbered response, data, or an R2T that was not
sent by the target must be rejected with a reason code of "Protocol error." A
Data/R-Data/R2T/SNACK for a command must precede status acknowledgment
for it. Specifically, the ExpStatSN must not be advanced until all Data-In or
R2T PDUs have been received.
An iSCSI target that does not support recovery within a connection (because
its error recovery level is 0) may reject status SNACK with a Reject PDU. This
will probably cause the SCSI level to time-out and perform its own error
recovery. It is therefore possible to build a very simple error recovery model in
which iSCSI ignores these error types and lets the SCSI level retry the
operation. The timeout will be fairly long, so hopefully this approach will not
be used on "enterprise class" installations.
The DataACK is used to free resources at the target and not to request or
imply data retransmission. This feature is useful when large amounts of data
are being read, perhaps from tape, and the resources tied up by the operation
in the target can be freed up incrementally, as the target can be sure that the
data has arrived.
Task Management Function Request PDU
Function holds the task management function codes that provide an initiator
with a way to explicitly control the execution of one or more tasks (SCSI and
iSCSI tasks). These functions are as follows. (For a more detailed description,
see Chapter 10 and [SAM2].)
ABORT TASK SETAborts all tasks issued via this session on the logical unit.
CLEAR TASK SETAborts all tasks in the appropriate task set as defined by
the TST field in the control mode page (see [SPC3]).
LOGICAL UNIT RESETPerforms a Clear Task Set for the LU and then resets
various states within it.
TARGET WARM RESETPerforms a Logical Unit Reset for all LUs within the
SCSI device (iSCSI target node).
TARGET COLD RESETPerforms a Target Warm Reset and then drops all
connections.
For all these functions, the task management function response PDU must be
returned. The functions apply to the referenced tasks regardless of whether
they are proper SCSI tasks or tagged iSCSI operations.
ITT (initiator task tag) is the unique value given to each task, used to
identify the commands (as explained in Chapter 10, Task Management).
Referenced task tag is the initiator task tag of the task to be aborted (Abort
Task) or reassigned (Task Reassign). If the function is independent of any
specific command to be aborted or reassigned, the value should be set to hex
FFFFFFFF.
RefCmdSN, for the Abort Task function, must always be set by the initiator to
the CmdSN of the task identified by the ITT field. Targets must use this field
when the task identified by the ITT is not with the target.
ExpDataSN For recovery purposes the iSCSI target and initiator maintain a
data acknowledgment reference numberthe first input DataSN number
unacknowledged by the initiator. When issuing a new command this number is
set to 0. If the function is TASK REASSIGN, which establishes a new
connection allegiance for a previously issued read or bidirectional command,
ExpDataSN will contain either an updated data acknowledgment reference
number or the value 0, the latter indicating that the data acknowledgment
reference number is unchanged. The initiator must discard any data PDUs from
the previous execution that it did not acknowledge, and the target must
transmit all Data-In PDUs (if any) starting with the data acknowledgment
reference number. The number of retransmitted PDUs may or may not be the
same as the original transmission depending on whether there was a change in
MaxRecvDataSegmentLength in the reassignment. The target may also send
no more Data-In PDUs if all data has been acknowledged.
The value of ExpDataSN must be either 0 or higher than the DataSN of the
last acknowledged Data-In PDU, but not larger than DataSN+1 of the last
Data-In PDU sent by the target. The target must ignore any other value.
According to [SAM2], the iSCSI target must ensure that no tasks covered by
the task management response (i.e., with a CmdSN less than the task
management command CmdSN) have their responses delivered to the initiator
SCSI layer after the task management response. However, the iSCSI initiator
may deliver any responses received before the task management response. It
is a matter of implementation if the SCSI responsesreceived before the task
management response but after the task management request was issuedare
delivered to the SCSI layer by the iSCSI layer in the initiator.
For Abort Task Set and Clear Task Set, the issuing initiator must continue to
respond to all valid TTTs (received via R2T, Text Response, NOP-In, or SCSI
Data-In PDUs) related to the affected task set, even after issuing the task
management request. However, the issuing initiator should terminate these
response sequences as quickly as possible (by setting the F bit to 1),
preferably with no data. The target must wait for responses on all affected TTTs
before acting on either of these two task management requests. A case in
which all or part of the response sequence is not received for a valid TTT
(because of digest errors) may be treated by the target as within-command
error recovery (if it is supporting an error recovery level of 1 or higher).
Alternatively the target may drop the connection to complete the requested
task set function.
The Target Reset function (Warm and Cold) is optional. Target Warm Reset may
be subject to SCSI access controls for the requesting initiator. When
authorization fails at the target, the appropriate response as described in the
section Task Management Function Response PDU (following) must be returned
by the target. The Target Cold Reset is not subject to SCSI access controls, but
its execution privileges may be managed by iSCSI via Login Authentication.
For the Target Warm Reset and Target Cold Reset functions, the target cancels
all pending operations on all LUs known to the initiator. Both functions are
equivalent to the Target Reset function as specified by [SAM2]. They can affect
many other initiators logged into the same servicing SCSI target port.
The use of Target Cold Reset may be limited by iSCSI access controls but not
by SCSI access controls. It is handled as a power-on event so, when the Target
Cold Reset function is complete, the target must terminate all of its TCP
connections to all initiators (all sessions are terminated). Therefore, the
service responses for this function may not be reliably delivered to the issuing
initiator port.
For the Task Reassign function, the target should reassign allegiance to the
connection on which this command is executed (and thus resume iSCSI
exchanges for the task). The target must receive Task Reassign only after the
connection on which the command was previously executing has been
successfully logged out. The task management response must be issued before
the reassignment becomes effective.
Task Management Function Response PDU
The target performs the requested task management function and sends the
initiator a task management response Abort Task, Abort Task Set, Clear ACA,
Clear Task Set, Logical Unit Reset, Target Warm Reset, Target Cold Reset, and
Task Reassign. For Target Cold Reset and Target Warm Reset, the target cancels
all pending operations across all logical units known to the issuing initiator. For
Target Cold Reset, the target must then close all of its TCP connections to all
initiators (i.e., terminate all sessions). As a result, the response may not be
delivered to the initiator reliably. For Task Reassign, the new connection
allegiance must only become effective at the target after the target issues the
Task Management Function Response PDU.
Response is provided by the target, and it may take the values listed here. All
other values are reserved.
0 Function complete
The mapping of the response code onto an initiator SCSI service response code
value, if needed, is outside the scope of this book. However, in symbolic terms,
Response value 0 maps to the SCSI service response of FUNCTION COMPLETE.
All other Response values map to the SCSI service response of FUNCTION
REJECTED. If a Task Management Function Response PDU does not arrive
before the session is terminated, the initiator SCSI service response is
SERVICE DELIVERY OR TARGET FAILURE.
The response to Abort Task Set and Clear Task Set must be issued by the
target only after
If the Referenced Task Tag does not identify an existing task, but if the
CmdSN indicated by the RefCmdSN field in the task management function
request is within the valid CmdSN window (between MaxCmdSN and
ExpCmdSN), the target must consider the CmdSN received and return the
"Function complete" response.
If the Referenced Task Tag does not identify an existing task, and if the
CmdSN indicated by the RefCmdSN field in the task management function
request is outside the valid CmdSN window, the target must return the
"Task does not exist" response.
ITT (initiator task tag) is the unique value the initiator gives to each task,
used to identify the commands (as explained in Chapter 10, Task Management)
and returned on this response PDU.
The execution of Abort Task Set and Clear Task Set consists of the following
sequence of events on each of the entities:
The initiator:
Continues to respond to each valid TTT received (via R2T, Text Response,
NOP-In, or SCSI Data-In PDUs) for the affected task set.
Receives any responses for the tasks in the affected task set (it may
process them as usual since they are guaranteed to be valid).
Receives the task set management response, thus concluding all tasks in
the affected task set.
The target:
Waits for all TTTs to be responded to and for all affected tasks in the task
set to be received.
Propagates the command up to, and receives the response from, the target
SCSI layer.
The Text Request PDU allows the exchange of information and future
extensions. It permits the initiator to inform a target of its capabilities or to
request some special operations. (For a further explanation of this PDU, see
Chapter 6, the section Text Requests and Responses.)
C (continue bit), when set to 1, indicates that the text request (a set of
key=value pairs) is not complete (it will be continued on a subsequent Text
Request PDU); otherwise, it indicates that this Text Request PDU ends a set of
key=value pairs. A Text Request PDU with the C bit set to 1 must have the F
bit set to 0.
LUN (logical unit number) if the TTT is not hex FFFFFFFF, this field must be
the LU sent by the target in the Text Response PDU.
ITT (initiator task tag) is the initiator-assigned identifier for this text
request PDU. If the command is sent as part of a sequence of text requests
and responses, the ITT must be the same for all the requests within the
sequence.
TTT (target transfer tag) is set to the reserved value of hex FFFFFFFF when
the initiator originates a Text Request PDU to the target. However, when the
initiator answers a Text Response PDU from the target, this field must contain
the value the initiator copies from that PDU.
The target sets the TTT in a Text Response PDU to a value other than the
reserved value hex FFFFFFFF whenever it wants to indicate that it has more
data to send or more operations to perform that are associated with the
specified ITT. The target must do this whenever it sets the F bit to 0 in the
response. By copying the TTT from the response into the next Text Request
PDU it sends, the initiator tells the target to continue the operation for the
specific ITT. The initiator must ignore the TTT in the Text Response PDU when
the F bit is set to 1.
When the initiator sets the TTT in this PDU to the reserved value hex
FFFFFFFF, it tells the target that this is a new request and the target should
reset any internal state associated with the ITT (resets the current negotiation
state). This mechanism allows the initiator and target to transfer a large
amount of textual data over a sequence of text command/text response
exchanges, or to perform extended negotiation sequences. A target may reset
its internal negotiation state if the initiator stalls an exchange for a long time
or if it is running out of resources.
A key=value pair can span text request or response boundaries (i.e., it can
start in one PDU and continue in the next). In other words, the end of a PDU
does not necessarily signal the end of a key=value pair.
The target sends its response back to the initiator. The response text format is
similar to the request text format. That text response may refer to key=value
pairs presented in an earlier text request, and the text in that request may
refer to a still earlier response.
Text operations are usually meant for parameter setting/negotiations, but can
also perform some long-lasting operations. Those that take a long time should
be placed in their own Text Request PDU.
Text Response PDU
The Text Response PDU contains the target's response to the initiator's text
request. Its text field format matches that of the Text Request PDU.
F (final bit) when set to 1 in response to a Text Request PDU that has its F bit
set to 1, indicates that the target has finished the entire operation. An F bit
set to 0 in response to a Text Request PDU that has its F bit set to 1 indicates
that the target has more work to do (i.e., it invites a follow-on text request).
Other value settings are as follows:
The F bit set to 1, in response to a Text Request PDU with the F bit set to
0, is a protocol error.
If the F bit is set to 1, the PDU must not contain key=value pairs that
require additional answers from the initiator.
If the F bit is set to 1 the PDU must have its TTT field set to the reserved
value of hex FFFFFFFF.
If the F bit is set to 0, the PDU must have a TTT field set to a value
different from the reserved hex FFFFFFFF.
C (continue bit), when set to 1, indicates that the text (a set of key=value
pairs) in this Text Response PDU is not complete (it will be continued on a
subsequent Text Response PDU). A C bit with a 0 value indicates that the PDU
ends a set of key=value pairs. A Text Response PDU with the C bit set to 1
must have the F bit set to 0.
LUN (logical unit number) may be set to a valid significant value if the TTT
is not hex FFFFFFFF; otherwise, it is reserved.
ITT (initiator task tag) matches the ITT in the initial Text Request PDU.
TTT (target transfer tag) When a target has more text data than it can send
in a single Text Response PDU or it has to continue the negotiations (and has
enough resources to proceed), this field must be a valid value (not the
reserved value of hex FFFFFFFF); otherwise, it must be set to hex FFFFFFFF.
If the TTT is not hex FFFFFFFF, the initiator must copy it and the LUN field
from this PDU into its next request to indicate that it wants the rest of the
data. Whenever the target receives a Text Request PDU with the TTT set to the
reserved value of hex FFFFFFFF, it resets its internal information (resets state)
associated with the given ITT. A target may reset its internal stateassociated
with an ITT (the current negotiation state) and expressed through the TTTif the
initiator fails to continue the exchange for some time. It may also reject
subsequent text requests that have the TTT set to the same, now "stale,"
value.
When a target cannot finish the operation in single text response and does not
have enough resources to continue, it should reject the Text Request PDU with
a Reject PDU that contains the appropriate reason code (see Reject PDU).
A key=value pair can span Text Response PDU boundaries (i.e., a key=value
pair can start in one PDU and continue in the next). In other words, the end of
a PDU does not necessarily signal the end of a key=value pair. To get the
missing part(s) of a key=value pair, an initiator may have to send an empty
Text Request PDU. Text for text requests and responses can span several PDUs.
If the length of a PDU does not allow it to hold the whole text request, the text
response may refer to key=value pairs presented in an earlier text request.
Although the initiator is the requesting party and controls the request
response initiation and termination, the target can offer key=value pairs of its
own as part of a sequence and not only in response to the initiator. The text
response may refer to key=value pairs presented in an earlier text request,
and the text in that request may refer to key=value pairs in still earlier
responses.
Appendix B. Keys and Values
In the following table, each keyword is described according to the following:
IInitiator
TTarget
I&TBoth
SWSession-wide
COConnection only
Default value
Example use
Comments
AND The result is the Boolean AND of the offered value and the
responding value.
OR The result is the Boolean OR of the offered value and the responding
value.
Examples:
CHAP_A=<A1, A2>
CHAP_<key>,
KRB_<key>, KRB_AP_REQ=<krb_ap_req> No default
SPKM_<key>, SRP_<key>
SPKM_REQ=<spkm-req>
SRP_U=<userid>
Authentication
IO, S&O,
All I
<iSCSI-name-value> No default
T
TargetName D /SW
Example: TargetName=iqn.2001-02.com.wonder.zz
Comment: Must be sent by the initiator on the first login request per
connection. Not sent by the initiator on discovery session. Sent by the
target in response to a SendTargets command.
IO, D, S&O <unsigned-integer-1-to-65535>
T No default
/SW (actually a binary value)
TargetPortal-
GroupTag Example: TargetPortalGroupTag=1
Comment: Value is the 16-bit TPGT of the connection. It is returned to the
initiator in the first login response PDU of the session.
All I&T <Vendor Specific Values> No default
X-<Vendor- Example: X-com.ibm.DoTheRightThing=ofcourse
SpecificKey> Comment: The vendor's reverse DNS name should follow the X- and
precede the function name.
All I&T <IANA-Registered values> No default
X#<IANA-registered-
string> Example: x#my_stuff=5
Comment: The key and the values must be registered by IANA.
SCSI device According to [SAM2], an entity that contains other SCSI entities.
For example, a SCSI initiator device contains one or more SCSI initiator ports
and zero or more application clients. A SCSI target device contains one or
more SCSI target ports and one or more logical units. For iSCSI, the SCSI
device is the component within an iSCSI node that provides SCSI functionality.
As such, there can be only one SCSI device within a given iSCSI node. Access
to the SCSI device can only be achieved in an iSCSI normal, operational
session. The SCSI device name is the iSCSI node name, and its use is
mandatory in the iSCSI protocol.
SCSI port According to [SAM2], an entity in a SCSI device that provides SCSI
functionality to interface with a service delivery subsystem or transport. For
iSCSI, the definition of a SCSI initiator port and that of a SCSI target port are
different.
The SCSI target port Maps to an iSCSI target portal group. The SCSI
target port name and the SCSI target port identifier are both defined to be
the iSCSI target name, together with a label that identifies it as a target
port name/identifier and the target portal group tag.
The SCSI port name Mandatory in iSCSI, when used in SCSI parameter
data. SCSI port names have a maximum length of 255 bytes. It should be
formatted as follows, in the order given:
Zero to three null pad bytes so that the complete format is a multiple
of four bytes long.
I-T nexus A relationship between a SCSI initiator port and a SCSI target port.
For iSCSI, this relationship is a session, defined as a relationship between an
iSCSI initiator's end of the session (the initiator port) and the iSCSI target's
portal group. The I-T nexus can be identified by the conjunction of the SCSI
port names. Specifically, its identifier is the tuple (iSCSI initiator name + i
+ ISID, iSCSI target name + t + target portal group tag). Note that
the I-T nexus identifier is not the same as the session identifier (SSID).
Consequences of the Model
Between an iSCSI (SCSI) initiator port and an iSCSI (SCSI) target port, at any
given time only one I-T nexus (session) can exist. Said another way, no more
than one nexus relationship is allowed (no parallel nexus).
ISID rule Between an iSCSI initiator and an iSCSI target portal group (SCSI
target port), there can be only one session with a given ISID that identifies the
SCSI initiator port. The ISID contains a naming authority component that
facilitates compliance with this rule.
The iSCSI initiator node is expected to manage the assignment of ISIDs prior
to session initiation. The ISID rule does not preclude the use of the same ISID
from the same iSCSI initiator with different target portal groups on the same
iSCSI target or on other iSCSI targets. Allowing this is analogous to a single
SCSI initiator port having relationships (nexus) with multiple SCSI target ports
on the same SCSI target device or the same SCSI target ports on other SCSI
target devices. It is also possible to have multiple sessions with different ISIDs
to the same target portal group. Each such session is considered to be with a
different initiator even when the sessions originate from the same initiator
device. A different iSCSI initiator may use the same ISID because it is the
iSCSI name together with the ISID that identifies the SCSI initiator port.
A consequence of the ISID rule and the specification for the I-T nexus
identifier is that two nexus with the same identifier should not exist at the
same time.
TSIH rule The iSCSI target selects a nonzero value for the TSIH at session
creation (when an initiator presents a zero value at login). After being selected
the same TSIH value must be used whenever the initiator or the target refers
to the given session and a TSIH is required.
If the SCSI logical unit device server does not maintain initiator-specific mode
pages, and if an initiator makes changes to port-specific mode pages, the
changes may affect all other initiators logged in to that iSCSI target through
the same target portal group.
Changes via mode pages to the behavior of a portal group via one iSCSI target
node should not affect the behavior of this portal group with respect to other
iSCSI target nodes, even if the underlying implementation of a portal group
serves multiple iSCSI target nodes in the same network entity.
Appendix D. Numbers, Characters, and Bit
Encodings
The sections that follow describe iSCSI numbers, characters, and bit
encodings.
Text Format
The initiator and target send a set of key=value pairs encoded in UTF-8
Unicode. All the text keys and text values in this book are case sensitive and
should be used in the case in which they appear.
The term "key" is used frequently in this book with the meaning of "key-
name." A value is whatever follows the = up to a 0-byte delimiter that
separates one key=value pair from the next one or marks the end of the data
(for the last key=value pair if the PDU C bit is set to 0).
Any iSCSI target or initiator must be able to receive at least 8,192 bytes of
key=value data in a negotiation sequence. When proposing or accepting
authentication methods that require support for very long authentication items
(such as public key certificates), the initiator and target must be able to
receive 64 kilobytes of key=value data.
Appendix E. Definitions
What follows is a quick index of various iSCSI-related terms. Much, but not all,
of it is from the [iSCSI].
Alias
Connection
iSCSI device
iSCSI layer
iSCSI name
iSCSI node
Represents a single iSCSI initiator or target. There are one or more iSCSI
Nodes within a network entity, accessible via one or more network portals.
The separation of the iSCSI node name from the addresses used by and for
the node allows multiple iSCSI nodes to use the same addresses and the
same iSCSI node to use multiple addresses.
Also known simply as the target, an iSCSI node within the iSCSI server
network entity.
iSCSI task
Defined with regard to the initiator; thus, outbound transfers are from the
initiator to the target, whereas inbound transfers are from the target to
the initiator.
I-T nexus
According to [SAM2], a relationship between a SCSI initiator port and a
SCSI target port. For iSCSI, this relationship is a session, defined as a
relationship between an iSCSI initiator's end (SCSI initiator port) and an
iSCSI target's portal group. The I-T nexus can be identified by the
conjunction of the SCSI port names; that is, the I-T nexus identifier is the
tuple (iSCSI initiator name + ",i," + ISID, iSCSI target name + ",t," +
portal group tag).
Network entity
Network portal
Originator
The boundary units into which initiators and targets place their messages.
Also known as iSCSI PDU.
Portal Groups
Defines a set of Network portals within an iSCSI node that supports the
coordination of a session with connections spanning these portals. Not all
network portals within a portal group participate in every session
connected through that group. One or more portal groups may provide
access to an iSCSI node. Each network portal as utilized by a given iSCSI
node belongs to exactly one portal group within that node.
A 16-bit bitstring that identifies the portal group within an iSCSI node. All
network portals with the same portal group tag in a given iSCSI node are
in the same portal group.
Recovery R2T
An R2T generated by a target upon detecting the loss of one or more Data-
Out PDUs through one of the following means: a digest error, a sequence
error, or a sequence timeout. A recovery R2T carries the next unused
R2TSN, but requests part of or the entire data burst that an earlier R2T
(with a lower R2TSN) had already requested.
Responder
SCSI Device
According to [SAM2], an entity that contains one or more SCSI ports that
are connected to a service delivery subsystem and supports a SCSI
application protocol. For example, a SCSI initiator device contains one or
more SCSI initiator ports and zero or more application clients; a SCSI
target device contains one or more SCSI target ports and one or more
device servers and associated logical units. The SCSI device is the
component within an iSCSI node that provides SCSI functionality. There
can be at most one such device within a node. Access to the SCSI device
can only be in an iSCSI normal operational session. The SCSI device name
is the iSCSI name of the node.
SCSI layer
Session
The group of TCP connections that link an initiator with a target and thus
form a session (loosely equivalent to a SCSI I-T nexus). TCP connections
can be added and removed from a session. Across all connections within a
session, an initiator sees one and the same target.
SCSI port
Both defined to be the iSCSI target name together with (a) a label that
identifies it as a target port name/identifier and (b) the portal group tag.
TPGT (Target Portal Group Tag)
References that have a direct impact on iSCSI (other than SCSI and
security)
[DHCP iSNS] J. Tseng, "DHCP Options for Internet Storage Name Service,"
http://www.ietf.org/internet-drafts/draft-ietf-dhc-isnoption-03.txt.
[iSCSI-SLP] M. Bakke, "Finding iSCSI Targets and Name Servers Using SLP,"
http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-slp-03.txt.
[AESCBC] S. Frankel, S. Kelly, and R. Glenn, "The AES Cipher Algorithm and
Its Use with IPsec," Internet draft http://www.ietf.org/internet-drafts/draft-
ietf-ipsec-ciph-aes-cbc-03.txt (in progress).
[SEC-IPS] B. Aboba et al., "Securing Block Storage Protocols over IP," Internet
draft, http://www.ietf.org/internet-drafts/draft-ietf-ips-security-16.txt.