1©2017 Open-NFP
Accelerating Networked Applications
with Flexible Packet Processing
Antoine	Kaufmann,		Naveen	Kr.	Sharma,
Thomas	Anderson,		Arvind	Krishnamurthy
Timothy	Stamler, Simon	Peter
University	of	Washington The University	of	Texas	at	Austin
2©2017 Open-NFP
Networks are becoming faster
100	MbE
1	GbE
10	GbE
40	GbE
100	GbE
400	GbE
100	M
1	G
10	G
100	G
1	T
1990 1995 2000 2005 2010 2015 2020
Ethernet	Bandwidth	[bits/s]
Year	of	Standard	Release
5ns	inter-arrival	time	for	
64B	packets	at	100Gbps
3©2017 Open-NFP
...but software packet processing is slow
Recv+send TCP stack processing time (2.2 GHz)
▪ Linux: 3.5µs
▪ Kernel bypass: ~1µs
Single core performance has stalled
Parallelize? Assuming 1µs over 100Gb/s, excluding Amdahl‘s law:
▪ 64B packets => 200 cores
▪ 1KB packets => 14 cores
Many cloud apps dominated by packet processing
▪ Key-value storage, real-time analytics, intrusion detection, file service, ...
▪ All rely on small messages: latency & throughput equally important
4©2017 Open-NFP
What are the alternatives?
RDMA
▪ Bypasses server software entirely
▪ Not well matched to client/server processing (security, two-sided for RPC)
Full application offload to NIC (FPGA, etc.)
▪ Application now at slower hardware-development speed
▪ Difficult to change once deployed
Fixed-function offloads (segmentation, checksums, RSS)
▪ Good start!
▪ Too rigid for today’s complex server & network architecture (next slide)
Flexible function offload to NIC (NFP, FlexNIC, etc.)
▪ Break down functions (eg., RSS) and provide API for software flexibility
5©2017 Open-NFP
Fixed-function offloads are not well integrated
Wasted CPU cycles
▪ Packet parsing and validation repeated in software
▪ Packet formatted for network, not software access
▪ Multiplexing, filtering repeated in software
Poor cache locality, extra synchronization
▪ NIC steers packets to cores by connection
▪ Application locality may not match connection
6©2017 Open-NFP
A more flexible NIC can help
With multi-core, NIC needs to pick destination core
▪ The “right” core is application specific
NIC is perfectly situated – sees all traffic
▪ Can scalably preprocess packets according to software needs
▪ Can scalably forward packets among host CPUs and network
With kernel-bypass, only NIC can enforce OS policy
▪ Need flexible NIC mechanisms, or go back into kernel
7©2017 Open-NFP
Talk Outline
• Motivation
• FlexNIC model
• Experience with Agilio-CX as prototyping platform
• Accelerating packet-oriented networking (UDP, DCCP)
• Key-value store
• Real-time analytics
• Network Intrusion Detection
• WiP: Accelerating stream-oriented networking (TCP)
8©2017 Open-NFP
FLEXNIC MODEL
9©2017 Open-NFP
FlexNIC: A Model for Integrated NIC/SW Processing
[ASPLOS’16]
• Implementable at Tbps line rate & low cost
Match+action pipeline:
Action	ALU
Match	Table
Parser
M+A	Stage	1 M+A	2
.	.	.
Extracted	
Header	Fields
Packet
Modified	Fields
10©2017 Open-NFP
Match+Action Programs
Supports: Does not support:
Match:
IF udp.port ==	kvs
Action:
core	=	HASH(kvs.key)	%	ncores
DMA hash,	kvs TO Cores[core]
Loops
Complex calculations
Keeping large state
Steer packet
Calculate hash/Xsum
Initiate DMA operations
Trigger reply packet
Modify packets
11©2017 Open-NFP
FlexNIC: M+A for NICs
Efficient application level processing in the NIC
▪ Improve locality by steering to cores based on app criteria
▪ Transform packets for efficient processing in SW
▪ DMA directly into and out of application data structures
▪ Send acknowledgements on NIC
Ingress	
Pipeline
Egress	
Pipeline
DMA	
Pipeline
Queues
12©2017 Open-NFP
Netronome Agilio-CX
We use Agilio-CX to prototype FlexNIC
• Implement M&A programs in P4
• Run on NIC
Our experience with Agilio-CX:
▪ Improve locality by steering to cores based on app criteria
▪ Transform packets for efficient processing in SW
▪ DMA directly into and out of application data structures
▪ Send acknowledgements on NIC
Dev
13©2017 Open-NFP
ACCELERATING PACKET-
ORIENTED NETWORKING
14©2017 Open-NFP
Example: Key-Value Store
4
7
Hash	Table
Core	1
Core	2
NIC
Receive-side	scaling:
core	=	hash(connection)	%	N
Client	1
K	= 3,	4
Client	2
K	=	4,	7
Client	3
K	=	7,	8
• Lock	contention
• Poor	cache	utilization
4,	7
4,	7
15©2017 Open-NFP
Key-based Steering
Core	1
Core	2
NIC
3
4
7
8
Hash	Table
Client	1
K	=	3,	4
Client	2
K	=	4,	7
Client	3
K	=	7,	8
Match:
IF udp.port ==	kvs
Action:
core	=	HASH(kvs.key)	%	N
DMA hash,	kvs TO Cores[core]
• No	locks	needed
• Higher	cache	utilization
16©2017 Open-NFP
Custom DMA
DMA to application-level data structures
Requires packet validation and transformation
Item	Log
Event	Queue
G
Item	1
Item	
2
G S
GET,	Client	ID,	Hash,	Key
SET,	Client	ID,	Item	
Pointer
17©2017 Open-NFP
Evaluation of the Model
• Measure impact on application performance
• Key-based steering: Use NIC
• Custom DMA: Software emulation of M&A pipeline
• Workload: 100k 32B keys, 64B values, 90% GET
• 6 Core Sandy Bridge Xeon 2.2GHz, 2x10G links
18©2017 Open-NFP
Key-based steering
• Better scalability
▪ PCIe is bottleneck for 4+ cores
• 45% higher throughput
• Processing time reduced to 310ns
0
2
4
6
8
1 2 3 4 5
Throughput	[m	op/s]
Number	of	CPU	Cores
FlexKVS/RSS
FlexKVS/Key
FlexKVS/Linux
Memcached
Custom	DMA	reduces	time	to	200ns
19©2017 Open-NFP
Real-time Analytics System
(De-)Multiplexing threads are performance bottleneck
• 2 CPUs required for 10 Gb/s => 20 CPUs for 100 Gb/s
NIC
Software
Rx	
Queue
Tx	
Queue
Count
Count
Rank
Rank
Demux
ACKs
Mux
20©2017 Open-NFP
Real-time Analytics System
Offload (de)multiplexing and ACK generation to FlexNIC
• No CPUs needed => Energy-efficiency
NIC
Software
Rx	
Queue
Tx	
Queue
Count
Count
Rank
Rank
Demux
ACKs
Mux
21©2017 Open-NFP
Performance Evaluation
0
2
4
6
Balanced Grouped
Throughput
[m	tuples/s]
Apache	Storm
FlexStorm/Linux
FlexStorm/Bypass
FlexStorm/FlexNIC.5x
1x
2x
.3x
1x
2.5x
• Cluster	of	3	machines
• Determine	Top-n	Twitter	posters	(real	trace)
• Measure	attainable	throughput
22©2017 Open-NFP
Network Intrusion Detection
Snort sniffs packets and analyzes them
• Parallelized by running multiple instances
• Status quo: Receive-side scaling
FlexNIC:
• Analyze rules loaded into Snort
• Partition rules among cores to maximize caching
• Fine-grained steering to cores
Result: 1.6x higher throughput, 30% fewer cache misses
23©2017 Open-NFP
ACCELERATING STREAM-
ORIENTED NETWORKING
24©2017 Open-NFP
Ongoing work: Stream protocols
Full TCP processing is too complex for M&A processing
▪ Significant connection state required
▪ Tricky edge cases: reordering, drops
▪ Complicated algorithms for congestion control
But the common case is simpler: it can be offloaded
▪ Reduces the critical path in software
Opportunity: Enforce correct protocol onto untrusted app
▪ Focus: congestion control
25©2017 Open-NFP
FlexTCP ideas
Safety critical & common processing on NIC
▪ Includes filtering, validating ACKs, enforcing rate limits
Handle all non-common cases in software
▪ E.g. packet drops, re-ordering, timeouts, …
Requires small per-flow state
▪ 64 bytes (SEQ/ACK, queues, rate-limit, …)
26©2017 Open-NFP
FlexTCP overview
27©2017 Open-NFP
Flexible congestion control offload
NIC enforces per-flow rate limits set by trusted kernel
▪ Flexibility to choose congestion control
Example: DCTCP
Common-case processing on NIC
▪ Echo ECN marks in generated ACK
▪ Track fraction of ECN marked packets per flow
Kernel implements control policy (DCTCP)
▪ Use NIC-reported fraction of packets that are ECN marked
▪ Adapt rate limit according to DCTCP protocol
Result: Indistinguishable from pure software implementations
28©2017 Open-NFP
FlexTCP overhead evaluation
• We implemented FlexTCP in P4
• Run on Agilio-CX with null application
• Compare throughput to basic NIC (wiretest)
0
10
20
30
40
256 512 1024 1500
Throughput	[Gb/s]
Packet	size	[Bytes]
Basic
Full
29©2017 Open-NFP
Summary
Networks are becoming faster, CPUs are not
▪ Server applications need to keep up
▪ Fast I/O requires efficient I/O path to application
Flexible offloads can eliminate inefficiencies
▪ Application control over where packets are processed
▪ Efficient steering, validation, transformation
Case studies: Key-value store, real-time analytics, IDS
▪ Up to 2.5x throughput & latency improvement vs. kernel-bypass
▪ Vastly more energy-efficient (no CPUs for packet processing)

More Related Content

PDF
Stacks and Layers: Integrating P4, C, OVS and OpenStack
PDF
Transparent eBPF Offload: Playing Nice with the Linux Kernel
PDF
P4 for Custom Identification, Flow Tagging, Monitoring and Control
PDF
Measuring a 25 and 40Gb/s Data Plane
PDF
Protecting the Privacy of the Network – Using P4 to Prototype and Extend Netw...
PDF
P4-based VNF and Micro-VNF Chaining for Servers With Intelligent Server Adapters
PDF
OpenContrail, Real Speed: Offloading vRouter
PDF
Consensus as a Network Service
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Transparent eBPF Offload: Playing Nice with the Linux Kernel
P4 for Custom Identification, Flow Tagging, Monitoring and Control
Measuring a 25 and 40Gb/s Data Plane
Protecting the Privacy of the Network – Using P4 to Prototype and Extend Netw...
P4-based VNF and Micro-VNF Chaining for Servers With Intelligent Server Adapters
OpenContrail, Real Speed: Offloading vRouter
Consensus as a Network Service

What's hot (20)

PDF
Network Measurement with P4 and C on Netronome Agilio
PDF
Whitebox Switches Deployment Experience
PDF
Data Plane and VNF Acceleration Mini Summit
PPTX
Compiling P4 to XDP, IOVISOR Summit 2017
PDF
LF_DPDK17_GRO/GSO Libraries: Bring Significant Performance Gains to DPDK-base...
PPTX
2016 NCTU P4 Workshop
PDF
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
PDF
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
PDF
LinuxCon 2015 Stateful NAT with OVS
PPTX
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
PDF
Ebpf ovsconf-2016
PDF
Programmable data plane at terabit speeds
PPTX
Spy hard, challenges of 100G deep packet inspection on x86 platform
PDF
Host Data Plane Acceleration: SmartNIC Deployment Models
PDF
TRex Traffic Generator - Hanoch Haim
PDF
Cilium - BPF & XDP for containers
PDF
Linux Kernel Cryptographic API and Use Cases
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
LF_DPDK17_Lagopus Router
PDF
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
Network Measurement with P4 and C on Netronome Agilio
Whitebox Switches Deployment Experience
Data Plane and VNF Acceleration Mini Summit
Compiling P4 to XDP, IOVISOR Summit 2017
LF_DPDK17_GRO/GSO Libraries: Bring Significant Performance Gains to DPDK-base...
2016 NCTU P4 Workshop
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
LinuxCon 2015 Stateful NAT with OVS
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Ebpf ovsconf-2016
Programmable data plane at terabit speeds
Spy hard, challenges of 100G deep packet inspection on x86 platform
Host Data Plane Acceleration: SmartNIC Deployment Models
TRex Traffic Generator - Hanoch Haim
Cilium - BPF & XDP for containers
Linux Kernel Cryptographic API and Use Cases
Network Programming: Data Plane Development Kit (DPDK)
LF_DPDK17_Lagopus Router
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
Ad

Similar to Accelerating Networked Applications with Flexible Packet Processing (20)

PPSX
FD.IO Vector Packet Processing
PPSX
FD.io Vector Packet Processing (VPP)
PPTX
Introduction to DPDK
PDF
Cilium - Fast IPv6 Container Networking with BPF and XDP
PDF
DPDK Summit 2015 - Aspera - Charles Shiflett
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PPTX
High Performance Networking Leveraging the DPDK and Growing Community
PDF
From Fixed-Function to Programmable Switching Chip for Network Packet Broker ...
PDF
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
PPTX
Making our networking stack truly extensible
PDF
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
PDF
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
PPTX
OSS-10mins-7th2.pptx
PPTX
Software Stacks to enable SDN and NFV
PDF
100 M pps on PC.
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
PDF
A Dataflow Processing Chip for Training Deep Neural Networks
PDF
Containers and Kubernetes
PDF
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
FD.IO Vector Packet Processing
FD.io Vector Packet Processing (VPP)
Introduction to DPDK
Cilium - Fast IPv6 Container Networking with BPF and XDP
DPDK Summit 2015 - Aspera - Charles Shiflett
Capital One Delivers Risk Insights in Real Time with Stream Processing
High Performance Networking Leveraging the DPDK and Growing Community
From Fixed-Function to Programmable Switching Chip for Network Packet Broker ...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Making our networking stack truly extensible
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
OSS-10mins-7th2.pptx
Software Stacks to enable SDN and NFV
100 M pps on PC.
High performace network of Cloud Native Taiwan User Group
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
A Dataflow Processing Chip for Training Deep Neural Networks
Containers and Kubernetes
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
Ad

Recently uploaded (20)

PDF
Identification of potential depression in social media posts
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPTX
How to use fields_get method in Odoo 18
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
Human Computer Interaction Miterm Lesson
PDF
The AI Revolution in Customer Service - 2025
PDF
Launch a Bumble-Style App with AI Features in 2025.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Decision Optimization - From Theory to Practice
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PPTX
How to Convert Tickets Into Sales Opportunity in Odoo 18
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
Identification of potential depression in social media posts
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
How to use fields_get method in Odoo 18
A symptom-driven medical diagnosis support model based on machine learning te...
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Advancing precision in air quality forecasting through machine learning integ...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
Human Computer Interaction Miterm Lesson
The AI Revolution in Customer Service - 2025
Launch a Bumble-Style App with AI Features in 2025.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Decision Optimization - From Theory to Practice
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
How to Convert Tickets Into Sales Opportunity in Odoo 18
EIS-Webinar-Regulated-Industries-2025-08.pdf
giants, standing on the shoulders of - by Daniel Stenberg
NewMind AI Journal Monthly Chronicles - August 2025
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...

Accelerating Networked Applications with Flexible Packet Processing

  • 1. 1©2017 Open-NFP Accelerating Networked Applications with Flexible Packet Processing Antoine Kaufmann, Naveen Kr. Sharma, Thomas Anderson, Arvind Krishnamurthy Timothy Stamler, Simon Peter University of Washington The University of Texas at Austin
  • 2. 2©2017 Open-NFP Networks are becoming faster 100 MbE 1 GbE 10 GbE 40 GbE 100 GbE 400 GbE 100 M 1 G 10 G 100 G 1 T 1990 1995 2000 2005 2010 2015 2020 Ethernet Bandwidth [bits/s] Year of Standard Release 5ns inter-arrival time for 64B packets at 100Gbps
  • 3. 3©2017 Open-NFP ...but software packet processing is slow Recv+send TCP stack processing time (2.2 GHz) ▪ Linux: 3.5µs ▪ Kernel bypass: ~1µs Single core performance has stalled Parallelize? Assuming 1µs over 100Gb/s, excluding Amdahl‘s law: ▪ 64B packets => 200 cores ▪ 1KB packets => 14 cores Many cloud apps dominated by packet processing ▪ Key-value storage, real-time analytics, intrusion detection, file service, ... ▪ All rely on small messages: latency & throughput equally important
  • 4. 4©2017 Open-NFP What are the alternatives? RDMA ▪ Bypasses server software entirely ▪ Not well matched to client/server processing (security, two-sided for RPC) Full application offload to NIC (FPGA, etc.) ▪ Application now at slower hardware-development speed ▪ Difficult to change once deployed Fixed-function offloads (segmentation, checksums, RSS) ▪ Good start! ▪ Too rigid for today’s complex server & network architecture (next slide) Flexible function offload to NIC (NFP, FlexNIC, etc.) ▪ Break down functions (eg., RSS) and provide API for software flexibility
  • 5. 5©2017 Open-NFP Fixed-function offloads are not well integrated Wasted CPU cycles ▪ Packet parsing and validation repeated in software ▪ Packet formatted for network, not software access ▪ Multiplexing, filtering repeated in software Poor cache locality, extra synchronization ▪ NIC steers packets to cores by connection ▪ Application locality may not match connection
  • 6. 6©2017 Open-NFP A more flexible NIC can help With multi-core, NIC needs to pick destination core ▪ The “right” core is application specific NIC is perfectly situated – sees all traffic ▪ Can scalably preprocess packets according to software needs ▪ Can scalably forward packets among host CPUs and network With kernel-bypass, only NIC can enforce OS policy ▪ Need flexible NIC mechanisms, or go back into kernel
  • 7. 7©2017 Open-NFP Talk Outline • Motivation • FlexNIC model • Experience with Agilio-CX as prototyping platform • Accelerating packet-oriented networking (UDP, DCCP) • Key-value store • Real-time analytics • Network Intrusion Detection • WiP: Accelerating stream-oriented networking (TCP)
  • 9. 9©2017 Open-NFP FlexNIC: A Model for Integrated NIC/SW Processing [ASPLOS’16] • Implementable at Tbps line rate & low cost Match+action pipeline: Action ALU Match Table Parser M+A Stage 1 M+A 2 . . . Extracted Header Fields Packet Modified Fields
  • 10. 10©2017 Open-NFP Match+Action Programs Supports: Does not support: Match: IF udp.port == kvs Action: core = HASH(kvs.key) % ncores DMA hash, kvs TO Cores[core] Loops Complex calculations Keeping large state Steer packet Calculate hash/Xsum Initiate DMA operations Trigger reply packet Modify packets
  • 11. 11©2017 Open-NFP FlexNIC: M+A for NICs Efficient application level processing in the NIC ▪ Improve locality by steering to cores based on app criteria ▪ Transform packets for efficient processing in SW ▪ DMA directly into and out of application data structures ▪ Send acknowledgements on NIC Ingress Pipeline Egress Pipeline DMA Pipeline Queues
  • 12. 12©2017 Open-NFP Netronome Agilio-CX We use Agilio-CX to prototype FlexNIC • Implement M&A programs in P4 • Run on NIC Our experience with Agilio-CX: ▪ Improve locality by steering to cores based on app criteria ▪ Transform packets for efficient processing in SW ▪ DMA directly into and out of application data structures ▪ Send acknowledgements on NIC Dev
  • 14. 14©2017 Open-NFP Example: Key-Value Store 4 7 Hash Table Core 1 Core 2 NIC Receive-side scaling: core = hash(connection) % N Client 1 K = 3, 4 Client 2 K = 4, 7 Client 3 K = 7, 8 • Lock contention • Poor cache utilization 4, 7 4, 7
  • 15. 15©2017 Open-NFP Key-based Steering Core 1 Core 2 NIC 3 4 7 8 Hash Table Client 1 K = 3, 4 Client 2 K = 4, 7 Client 3 K = 7, 8 Match: IF udp.port == kvs Action: core = HASH(kvs.key) % N DMA hash, kvs TO Cores[core] • No locks needed • Higher cache utilization
  • 16. 16©2017 Open-NFP Custom DMA DMA to application-level data structures Requires packet validation and transformation Item Log Event Queue G Item 1 Item 2 G S GET, Client ID, Hash, Key SET, Client ID, Item Pointer
  • 17. 17©2017 Open-NFP Evaluation of the Model • Measure impact on application performance • Key-based steering: Use NIC • Custom DMA: Software emulation of M&A pipeline • Workload: 100k 32B keys, 64B values, 90% GET • 6 Core Sandy Bridge Xeon 2.2GHz, 2x10G links
  • 18. 18©2017 Open-NFP Key-based steering • Better scalability ▪ PCIe is bottleneck for 4+ cores • 45% higher throughput • Processing time reduced to 310ns 0 2 4 6 8 1 2 3 4 5 Throughput [m op/s] Number of CPU Cores FlexKVS/RSS FlexKVS/Key FlexKVS/Linux Memcached Custom DMA reduces time to 200ns
  • 19. 19©2017 Open-NFP Real-time Analytics System (De-)Multiplexing threads are performance bottleneck • 2 CPUs required for 10 Gb/s => 20 CPUs for 100 Gb/s NIC Software Rx Queue Tx Queue Count Count Rank Rank Demux ACKs Mux
  • 20. 20©2017 Open-NFP Real-time Analytics System Offload (de)multiplexing and ACK generation to FlexNIC • No CPUs needed => Energy-efficiency NIC Software Rx Queue Tx Queue Count Count Rank Rank Demux ACKs Mux
  • 21. 21©2017 Open-NFP Performance Evaluation 0 2 4 6 Balanced Grouped Throughput [m tuples/s] Apache Storm FlexStorm/Linux FlexStorm/Bypass FlexStorm/FlexNIC.5x 1x 2x .3x 1x 2.5x • Cluster of 3 machines • Determine Top-n Twitter posters (real trace) • Measure attainable throughput
  • 22. 22©2017 Open-NFP Network Intrusion Detection Snort sniffs packets and analyzes them • Parallelized by running multiple instances • Status quo: Receive-side scaling FlexNIC: • Analyze rules loaded into Snort • Partition rules among cores to maximize caching • Fine-grained steering to cores Result: 1.6x higher throughput, 30% fewer cache misses
  • 24. 24©2017 Open-NFP Ongoing work: Stream protocols Full TCP processing is too complex for M&A processing ▪ Significant connection state required ▪ Tricky edge cases: reordering, drops ▪ Complicated algorithms for congestion control But the common case is simpler: it can be offloaded ▪ Reduces the critical path in software Opportunity: Enforce correct protocol onto untrusted app ▪ Focus: congestion control
  • 25. 25©2017 Open-NFP FlexTCP ideas Safety critical & common processing on NIC ▪ Includes filtering, validating ACKs, enforcing rate limits Handle all non-common cases in software ▪ E.g. packet drops, re-ordering, timeouts, … Requires small per-flow state ▪ 64 bytes (SEQ/ACK, queues, rate-limit, …)
  • 27. 27©2017 Open-NFP Flexible congestion control offload NIC enforces per-flow rate limits set by trusted kernel ▪ Flexibility to choose congestion control Example: DCTCP Common-case processing on NIC ▪ Echo ECN marks in generated ACK ▪ Track fraction of ECN marked packets per flow Kernel implements control policy (DCTCP) ▪ Use NIC-reported fraction of packets that are ECN marked ▪ Adapt rate limit according to DCTCP protocol Result: Indistinguishable from pure software implementations
  • 28. 28©2017 Open-NFP FlexTCP overhead evaluation • We implemented FlexTCP in P4 • Run on Agilio-CX with null application • Compare throughput to basic NIC (wiretest) 0 10 20 30 40 256 512 1024 1500 Throughput [Gb/s] Packet size [Bytes] Basic Full
  • 29. 29©2017 Open-NFP Summary Networks are becoming faster, CPUs are not ▪ Server applications need to keep up ▪ Fast I/O requires efficient I/O path to application Flexible offloads can eliminate inefficiencies ▪ Application control over where packets are processed ▪ Efficient steering, validation, transformation Case studies: Key-value store, real-time analytics, IDS ▪ Up to 2.5x throughput & latency improvement vs. kernel-bypass ▪ Vastly more energy-efficient (no CPUs for packet processing)