Linux HTTPS/TCP/IP Stack for the Fast and Secure Web

Kernel HTTPS/TCP/IP stack
for HTTP DDoS mitigation
Alexander Krizhanovsky
Tempesta Technologies, Inc.
ak@tempesta-tech.com

Who am I?
CEO at Tempesta Technologies, INC
Developing Tempesta FW – open source Linux
Application Delivery Controller (ADC)
Custom software development in:
●
high performance network traffic processing
e.g. WAF mentioned in Gartner magic quadrant
https://www.ptsecurity.com/ww-en/products/af/
●
Databases
e.g. MariaDB SQL System-Versioned Tables
https://mariadb.com/kb/en/library/system-versioned-tables/
https://mariadb.com/conference/session/querying-data-previous-
point-time

Problem: HTTP filtration
2013: WAF development by request of Positive Technologies
●
Web attacks
●
L7 HTTP/HTTPS DDoS attacks
Nginx, HAProxy, etc. - perfect HTTP accelerators, not HTTP filters
Netfilter works in TCP/IP stack (softirq) => HTTP(S)/TCP/IP stack
Tempesta FW: a hybrid of HTTP accelerator & firewall

Tempesta FW:
Application Delivery Controller (ADC)

Web Application Firewall (WAF) acceleration
WAFs are slow (Machine learning, DOM, regexps etc.)
Advanced load balancing among more powerful and slow WAFs
Simple & fast web attacks filtering
Some DDoS attacks can be normally serviced from fast web cache

Web-accelerators are slow
Slow & non-scalable network I/O (queues are bad for CPU caches)
Data copyings & syscalls
Dummy HTTP FSMs
HTTP strings are special: LIBC functions don’t work well
Don’t care about the corner cases (good DDoS targets)
TLS data copies (even with kTLS & QUIC), no TCP awarness
Filesystem-based Web-cache (except ATS)
Sometimes request blocking is slower than serving it :)

Application layer DDoS
Service from Cache Rate limit
Nginx 22us 23us
(Additional logic in limiting module)
Fail2Ban: write to the log, parse the log, write to the log, parse the
log…

Application layer DDoS
Service from Cache Rate limit
Nginx 22us 23us
(Additional logic in limiting module)
Fail2Ban: write to the log, parse the log, write to the log, parse the
log… - really in 21th century?!
tight integration of Web accelerator and a firewall is needed

Other DDoS filters: firewall & NIDS
IPtables strings, BPF, XDP, NIC filters
●
HTTP headers can cross packet bounds
●
Scan large URI or Cookie for Host value?
NIPS (e.g. Suricata)
●
powerful rules syntax at L3-L7
●
Not a TCP end point => evasions are possible
●
TLS terminator is required (data copies & context switches)
or double TLS processing
●
Double HTTP parsing
●
Doesn’t improve Web server performance
(mitigation != prevention)

Web-accelerators are slow: SSL/TLS copying
User-kernel space copying
●
Copy network data to user space
●
Encrypt/decrypt it
●
Copy the date to kernel for transmission
Kernel-mode TLS (Linux kTLS)
●
Modern kTLS eliminates ingress & egress data copyings
●
Unaware about TCP transmission state (cwnd & rwnd)
●
Doesn’t use SIMD for memcpy() & memset()
●
TLS 1.3 is good, but it’s profitable for DDoS bots to be legacy clients
●
TLS handshake is still an issue

Linux kernel TLS & DDoS
Most Facebook users have
established sessions
TLS handshake is still an issue
●
TLS 1.3 has 1-RTT handshake
●
TLS 1.2 must live for a long
time for legacy clients
https://www.netdevconf.org/0x12/se
ssion.html?kernel-tls-handshakes-
for-https-ddos-mitigation
9.11% libcrypto.so.1.1 [.] __ecp_nistz256_mul_montx
7.80% libc-2.24.so [.] _int_malloc
7.03% libcrypto.so.1.1 [.] __ecp_nistz256_sqr_montx
3.54% libcrypto.so.1.1 [.] sha512_block_data_order_avx2
3.05% libcrypto.so.1.1 [.] BN_div
2.43% libc-2.24.so [.] _int_free
1.89% libcrypto.so.1.1 [.] OPENSSL_cleanse
1.61% libc-2.24.so [.] malloc_consolidate
1.49% libcrypto.so.1.1 [.] ecp_nistz256_avx2_gather_w7
1.41% libc-2.24.so [.] malloc
1.24% libcrypto.so.1.1 [.] ecp_nistz256_point_doublex
1.20% libcrypto.so.1.1 [.] ecp_nistz256_ord_sqr_montx
1.01% libcrypto.so.1.1 [.] __ecp_nistz256_sub_fromx
1.00% libcrypto.so.1.1 [.] BN_lshift
0.87% libcrypto.so.1.1 [.] BN_num_bits_word
0.86% libcrypto.so.1.1 [.] bn_correct_top
0.84% libcrypto.so.1.1 [.] BN_CTX_get
0.81% libc-2.24.so [.] __memset_avx2_unaligned_erms
0.77% libc-2.24.so [.] free
0.74% libcrypto.so.1.1 [.] __ecp_nistz256_mul_by_2x
0.71% libcrypto.so.1.1 [.] BN_rshift
0.59% libcrypto.so.1.1 [.] BN_uadd
0.59% libcrypto.so.1.1 [.] int_bn_mod_inverse
0.54% libc-2.24.so [.] __memmove_avx_unaligned_erms
0.53% libcrypto.so.1.1 [.] aesni_ecb_encrypt

Web-accelerators are slow: profile
% symbol name
1.5719 ngx_http_parse_header_line
1.0303 ngx_vslprintf
0.6401 memcpy
0.5807 recv
0.5156 ngx_linux_sendfile_chain
0.4990 ngx_http_limit_req_handler
=> flat profile

Web-accelerators are slow: syscalls
epoll_wait(.., {{EPOLLIN, ....}},...)
recvfrom(3, "GET / HTTP/1.1rnHost:...", ...)
write(1, “...limiting requests, excess...", ...)
writev(3, "HTTP/1.1 503 Service...", ...)
sendfile(3,..., 383)
recvfrom(3, ...) = -1 EAGAIN
epoll_wait(.., {{EPOLLIN, ....}}, ...)
recvfrom(3, "", 1024, 0, NULL, NULL) = 0
close(3)

Web-accelerators are slow: HTTP parser
Start: state = 1, *str_ptr = 'b'
while (++str_ptr) {
switch (state) { <= check state
case 1:
switch (*str_ptr) {
case 'a':
...
state = 1
case 'b':
...
state = 2
}
case 2:
...
}
...
}

while (++str_ptr) {
switch (state) {
case 1:
switch (*str_ptr) {
case 'a':
...
state = 1
case 'b':
...
state = 2 <= set state
}
case 2:
...
}
...
}

while (++str_ptr) {
switch (state) {
case 1:
switch (*str_ptr) {
case 'a':
...
state = 1
case 'b':
...
state = 2
}
case 2:
...
}
... <= jump to while
}

while (++str_ptr) {
switch (state) {
case 1:
switch (*str_ptr) {
case 'a':
...
state = 1
case 'b':
...
state = 2
}
case 2:
... <= do something
}
...
}

Web-accelerators are slow: strings
We have AVX2, but GLIBC doesn’t still use it
HTTP strings are special:
● No ‘0’-termination (if you’re zero-copy)
● Special delimiters (‘:’ or CRLF)
●
strcasecmp(): no need case conversion for one string
●
strspn(): limited number of accepted alphabets
switch()-driven FSM is even worse

Fast & secure HTTP parser
http://natsys-lab.blogspot.ru/2014/11/the-fast-finite-state-machine-for-
http.html
●
1.6-1.8 times faster than Nginx’s
HTTP optimized AVX2 strings processing:
http://natsys-lab.blogspot.ru/2016/10/http-strings-processing-using-c-
sse42.html
●
injection attacks prevention: allowed strict character sets
●
strncasecmp() ~x3 faster than GLIBC’s
●
URI matching ~x6 faster than GLIBC’s strspn()
●
kernel_fpu_begin()/kernel_fpu_end() for whole softirq shot

Web-accelerators are slow: async I/O

Web-accelerators are slow: async I/O
Web cache also
resides In CPU
caches and evicts
requests

Web cache: TempestaDB
In-memory database for Web-cache and
firewall rules
Cache conscious Burst Hash Trie
●
short offsets instead of pointers
●
(almost) lock-free
lock-free block allocator on huge pages
for virtually contiguous memory
https://www.percona.com/live/data-
performance-conference-
2016/sessions/linux-kernel-extension-
databases

The HTTPS/TCP/IP stack
(Interbreed an HTTP accelerator and a firewall)
Alternative to user space TCP/IP stacks
HTTPS is built into TCP/IP stack
●
HTTP pipelining even for HTTP/1.1
Kernel TLS handshakes (fork from mbedTLS)
HTTP/L7 firewall plus to nftables and BPF
●
TCP & TLS end point (vs. NIPS such as Suricata)
Very fast HTTP parser and strings processing using AVX2
Cache-conscious in-memory Web-cache for DDoS mitigation
TODO: HTTP QoS for asymmetric DDoS mitigation

L7 DDoS mitigation: sticky cookie
User/session identification
●
Cookie challenge for dummy DDoS bots
●
Persistent/sessions scheduling (no rescheduling on a server failure)
timestamp | HMAC(Secret User-Agent timestamp Client IP)
enforce: HTTP 302 redirect
sticky name=__tfw_user_id__ enforce;

L7 DDoS mitigation: JavaScript challenge
Effectively slows bots down

L7 DDoS mitigation: limits
Rate limits
●
request_rate, request_burst
●
connection_rate, connection_burst
●
concurrent_connections
●
http_resp_code_block – blocks password crackers
Slow HTTP
●
client_header_timeout, client_body_timeout
●
http_header_cnt
●
http_header_chunk_cnt, http_body_chunk_cnt

Web Application Security (WAF acceleration)
Length limits: http_uri_len, http_field_len, http_body_len
Content validation: http_host_required, http_ct_required,
http_ct_vals, http_methods
HTTP Response Splitting: count and match requests and responses
Injections: verify allowed (by an administrator) character sets
●
Resistant to large HTTP fields (AVX2)
https://natsys-lab.blogspot.ru/2016/10/http-strings-processing-using
-c-sse42.html
TODO: decoding before character sets validation

HTTP tables
HTTP load balancer and a firewall (~nftables)
mark-integration with nftables
# nft add rule inet filter input ip saddr 192.168.100.1 mark set 1
# cat etc/tempesta_fw.conf
srv_group backend { server 127.0.0.1:8080; }
vhost protected_host { proxy_pass backend; }
http_chain multi_layer_rules {
hdr “Referer” == “badhost.com/*” -> block;
-> protected_host; # all checks are passed
}
http_chain {
mark == 1 -> multi_layer_rules;
-> protected_host; # pass all by default
}

Performance
https://github.com/tempesta-tech/tempesta/wiki/HTTP-cache-performance

Performance
https://github.com/tempesta-tech/tempesta/wiki/HTTP-cache-performance
Most HTTP floods can be
mitigated w/o any special filtering!

Performance analysis: comparison w/ Nginx
0
500000
1x10
6
1.5x10
6
2x10 6
2.5x10
6
1 10 100 1000 10000
rps
connections
Tempesta FW vs Nginx; E5-1650v3; HTTP/1.1, 8B response, keep-alive
Nginx 1.11.5
Tempesta FW 0.5.0-pre5

Performance analysis: kernel bypass
Similar to DPDK/user-space TCP/IP stacks
http://www.seastar-project.org/
http-performance/
...bypassing Linux TCP/IP
isn’t the only way to get a fast Web
server
...lives in Linux infrastructure:
LVS, tc, IPtables, eBPF, tcpdump etc.

User space HTTP proxying
1. Receive request at CPU1
2. Copy request to user space
3. Update headers
4. Copy request to kernel space
5. Send the request from CPU2
3 data copies
Access TCP control blocks and
data buffers from different CPUs

Synchronous sockets: HTTPS/TCP/IP stack
Socket callbacks call TLS and
HTTP processing
Everything is processing in
softirq (while the data is hot)
No receive & accept queues
No file descriptors
Less locking

Synchronous sockets: HTTPS/TCP/IP stack
Socket callbacks call TLS and
HTTP processing
Everything is processing in
softirq (while the data is hot)
No receive & accept queues
No file descriptors
Less locking
Lock-free inter-CPU transport
=> faster socket reading
=> lower latency

skb page allocator:
zero-copy HTTP messages adjustment
Add/remove/update HTTP
headers w/o copies
skb and its head are
allocated in the same
page fragment or
a compound page

skb page allocator:
zero-copy HTTP messages adjustment
Add/remove/update HTTP
headers w/o copies
skb and its head are allocated
in the same page fragment or a
compound page

Beta (exp. early 2019)
We’re in alpha (0.5.x)
Beta (1.0, exp. early 2019)
●
Tempesta TLS (GPU offload - TBD)
https://www.netdevconf.org/0x12/session.html?kernel-tls-
handshakes-for-https-ddos-mitigation
●
TLS 1.3
●
HTTP/2
●
Tunable HTTP proxy buffering & streaming (like Tengine)
●
HTTP QoS for asymmetric DDoS mitigation (some ML)
●
HTTP URI/Cookie/POST normalization
(protection against injection attacks)

Thanks!
Web-site: http://tempesta-tech.com
Availability: https://github.com/tempesta-tech/tempesta
Blog: http://natsys-lab.blogspot.com
E-mail: ak@tempesta-tech.com

Linux HTTPS/TCP/IP Stack for the Fast and Secure Web

More Related Content

What's hot(20)

Similar to Linux HTTPS/TCP/IP Stack for the Fast and Secure Web(20)

More from All Things Open(20)

Recently uploaded(20)

Linux HTTPS/TCP/IP Stack for the Fast and Secure Web