Net Topics
Net Topics
Andi Kleen
SuSE Labs
Contents
" Non blocking Sockets
" Unix Sockets
" Netlink
" Error handling
" Path MTU discovery
" IPv6 sockets
Non blocking sockets
" Allows well scaling servers without threads
" Not much locking overhead (=none)
" Requires state machines
" fcntl(socketfd, F_SETFL, O_NONBLOCK);
" Needed to handle many sockets (threads are
costly)
Network events
" Incoming data
" Socket ready for writing (socket buffer has room)
" Connection finished
" Error occurred
" Disconnect
" Urgent data arrived.
poll/select
" Ask the kernel in a main loop about events on the
descriptors with poll(2)
" Process event, run state machine on socket and
continue
" Copies a full table in and out the kernel
" Does not scale well: kernel and user has to walk
big tables.
" Very portable and great for small servers.
Signals vs Realtime signals
" Signal is just a bit in a mask (cannot be lost)
" Many events compress into one bit
" Realtime signals between SIGRTMIN and
SIGRTMAX
" Realtime signals carry data and are delivered in
order
" Can go lost when the queue overflows
Queued SIGIO
" You get a signal for an event.
" Scales well, no big tables to copy or search.
" Kernel supplies siginfo to the signal handler
" Signals are tied to threads or process groups.
Queued SIGIO HOWTO
" fcntl(socketd, F_SETOWN, getpid())
" fcntl(socketfd, F_SETSIG, rtsig)
" SA_SIGINFO signal handler gets siginfo_t
argument.
" siginfo− >si_fd contains fd
" On overflow you get SIGIO and use poll to pick
up events.
" sigtimedwait is a nice main loop if you don’ t want
signal handlers.
Unix Sockets
" Some basics:
" Unix sockets are for local communication
" PF_UNIX; AF_UNIX in POSIX speak
" Two flavors: stream socket and datagram socket.
" Fast (your X runs through them)
" Commonly used for local desktop use (e.g.
GNOME’ s Orbit ORB or X11)
Abstract namespace
" Socket endpoints of well known services are
found via socket nodes in the filesystem.
" They do not go away after reboot or when the
server crashes.
" There is no easy way to check if a server has
crashed so recovery is difficult.
" Abstract namespace is a non portable trick to
solve these problems
Abstract namespace 2
" How to use? Simply pass a 0 byte as the first
character of the sockaddr_un.sun_path and then
the abstract name.
" Abstract name only exists as a hash table
internally.
" Goes away when the last reference is gone.
" Very simple semantics unlike file system objects
Control messages
" Berkeley and POSIX sockets support control
messages since some time.
" Only works for SOCK_DGRAM sockets.
" Control messages are passed out of band with
datagrams by the kernel.
" Sockets API supplies some standard macros to
encode them.
" Standardized in POSIX/IPv6 API.
Control messages, what good for?
" Credentials passing for Unix sockets.
" File descriptor passing for Unix sockets.
" Setting and receiving interface index/TOS/TTL
for IP and IPv6 packets.
" Sending and receiving IP options (alternative to
RAW sockets)
" Sending and receiving IPv6 extension headers.
Credentials passing
" Often local servers want to check the user and
group id of client processes.
" Management using group rights of file system
sockets is clumsy and works only for well defined
restrictions, not for logging.
" Credentials passing gives you the process and
user and group id of the process that sent the
message.
" Relatively portable if well encapsulated.
Credentials passing, HOWTO
" SO_PASSCRED enables sending of credentials.
" For connected SOCK_STREAM sockets: use the
SO_PEERCRED getsockopt.
" For SOCK_DGRAM the senders can send an
SCM_CREDENTIALS control message with the
datagram. It contains pid/uid/gid
" Sender sets its own values, but kernel checks
them. Root can override it. If client sends nothing
the kernel fills in defaults.
File descriptor passing
" Passing file descriptors from one process to
another (»remote dup«)
" Pass a SCM_RIGHTS control message via a
PF_UNIX socket. It contains an fd array.
" Use at least a one byte message to carry it.
" Allows authentication servers for fd resources
" Allows you to avoid threads for more fault
encapsulation.
Netlink
" Message based kernel/user space communication.
" Simple protocol to detect message loss (e.g.
because of out of memory)
" User interface via PF_NETLINK sockets.
" Currently used for routing messages, interface
setup, firewalling, netlink queuing, arpd, ethertap.
Each has its own netlink family.
Netlink messages
" Has a common header with sequence number,
type, flags, length, sender pid.
" Sender can request an ACK or an ECHO for
reliability.
" Multipart messages are used for table dumps.
" Passes back a nlmsgerr message when a problem
occurs.
Sending a netlink message
" Netlink message buffers are set up through
macros from linux/netlink.h
" Find the length of the buffer using
NLMSG_SPACE passing payload length
" Allocate a buffer. Setup nlmsghdr at beginning of
buffer. Nlmsg_length is computed by
NLMSG_LENGTH.
" Get a pointer to payload using NLMSG_DATA
and set it up.
Receiving a netlink message
" Fill a buffer using recvmsg() from a netlink
socket.
" First nlmsghdr is beginning of buffer.
" Check if it is not truncated using NLMSG_OK
" Check the type and it you’ re interested in it get
the payload using NLMSG_DATA. For rtnetlink
don’ t forget the rta attributes.
" Get next message using NLMSG_NEXT
Netlink multicast groups
" sockaddr_nl contains a nl_groups bitmask that
allows 32 multicast groups.
" Groups are specific for the netlink family.
" Only root or the kernel can send to a multicast
group.
" User processes bind to them.
" Useful for listening to updates of some common
resource.
Rtnetlink
" Rtnetlink is used to configure the IP stack.
" Superset of the old ioctl interface.
" Can configure and watch interfaces, routes, IP
addresses, routing rules, neighbours (ARP
entries), queueing disciplines and other stuff.
" Kernel uses it internally (ioctls are turned into
netlink)
" User interface in iproute2
" Some groups: Link, Neighbour, Route, Mroute,
Rtnetlink messages
" Messages start with a standard netlink header
(struct nlmsghdr) and a type specific header.
" They come in NEW, GET, DEL flavours for each
object that can be touched.
" GET can dump all objects in the database or only
matching one.
" Messages carry attributes after the main headers.
" Attributes are like small netlink messages with a
rta_attr header.
A few rtnetlink messages:
" NEW/GET/DEL
" ROUTE: struct rtmsghdr and describes a routing
table entry. Has lots of attributes like
RTA_GATEWAY, RTA_OIF, RTA_IIF etc.
" ADDR: struct ifaddrmsg and describes a local IP
address. Has attributes like IFA_LOCAL (local
IP), IFA_LABEL (alias name), etc.
" See include/linux/rtnetlink.h and rtnetlink(7) for a
lot more messages and the details.
Some rtnetlink applications
" Waiting for interface up and down by binding to
RTMGRP_LINK and watching for link up/down
[when the network driver supports the netif_carrier* interface in 2.4 this allows HA
failover and watching for network problems]
struct sock_extended_error {
u_int32_t ee_errno; /* errno */
u_int8_t ee_origin; /* Where it came from; see below */
u_int8_t ee_type; /* ICMP type */
u_int8_t ee_code; /* ICMP code */
u_int32_t ee_info; /* ICMP specific info (gateway or pmtu) */
/* data follows */
};