Linux Performance
2018
Brendan Gregg
Senior Performance Architect
Oct 2018
http://neuling.org/linux-next-size.html
https://kernelnewbies.org/Linux_4.18
https://lwn.net/Kernel/
Post frequency:
4 per year
4 per week
http://vger.kernel.org/vger-lists.html
#linux-kernel
LKML400 per day
https://meltdownattack.com/
Cloud Hypervisor
(patches)
Cloud Hypervisor
(patches)
Linux Kernel
(KPTI)
Linux Kernel
(KPTI)
CPU
(microcode)
CPU
(microcode)
Application
(retpolne)
Application
(retpolne)
KPTI Linux 4.15
& backports
Server A: 31353 MySQL queries/sec
Server B: 22795 queries/sec (27% slower)
serverA# mpstat 1
Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx) 02/09/2018 _x86_64_ (36 CPU)
01:09:13 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:09:14 AM all 86.89 0.00 13.08 0.00 0.00 0.00 0.00 0.00 0.00 0.03
01:09:15 AM all 86.77 0.00 13.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:09:16 AM all 86.93 0.00 13.02 0.00 0.00 0.00 0.03 0.00 0.00 0.03
[...]
serverB# mpstat 1
Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx) 02/09/2018 _x86_64_ (36 CPU)
01:09:44 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:09:45 AM all 82.94 0.00 17.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:09:46 AM all 82.78 0.00 17.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:09:47 AM all 83.14 0.00 16.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[...]
CPUCPU MMUMMU Main
Memory
Main
Memory
TLBTLB
Virtual
Address
Physical
Address
hit miss
(walk) Page
Table
Page
Table
Linux KPTI patches for Meltdown flush the Translation
Lookaside Buffer
Server A: TLB miss walks 3.5%
Server B: TLB miss walks 19.2% (16% higher)
serverA# ./tlbstat 1
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
95913667 99982399 1.04 86588626 115441706 1507279 1837217 1.57 1.92
95810170 99951362 1.04 86281319 115306404 1507472 1842313 1.57 1.92
95844079 100066236 1.04 86564448 115555259 1511158 1845661 1.58 1.93
95978588 100029077 1.04 86187531 115292395 1508524 1845525 1.57 1.92
[...]
serverB# ./tlbstat 1
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
95911236 80317867 0.84 911337888 719553692 10476524 7858141 10.92 8.19
95927861 80503355 0.84 913726197 721751988 10518488 7918261 10.96 8.25
95955825 80533254 0.84 912994135 721492911 10524675 7929216 10.97 8.26
96067221 80443770 0.84 912009660 720027006 10501926 7911546 10.93 8.24
[...]
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html
Enhanced BPF
Kernel
kprobeskprobes
uprobesuprobes
tracepointstracepoints
socketssockets
SDN ConfigurationSDN Configuration
User-Defined BPF Programs
…
Event TargetsRuntime
also known as just "BPF"
Linux 4.*
perf_eventsperf_events
BPF
actions
BPF
actions
BPFBPF
verifierverifier
DDoS MitigationDDoS Mitigation
Intrusion DetectionIntrusion Detection
Container SecurityContainer Security
ObservabilityObservability
Firewalls (bpfilter)Firewalls (bpfilter)
Device DriversDevice Drivers
eBPF is solving new things: off-CPU + wakeup analysis
eBPF bcc Linux 4.4+
https://github.com/iovisor/bcc
e.g., identify multimodal disk I/O latency and outliers
with bcc/eBPF biolatency
# biolatency -mT 10
Tracing block device I/O... Hit Ctrl-C to end.
19:19:04
msecs : count distribution
0 -> 1 : 238 |********* |
2 -> 3 : 424 |***************** |
4 -> 7 : 834 |********************************* |
8 -> 15 : 506 |******************** |
16 -> 31 : 986 |****************************************|
32 -> 63 : 97 |*** |
64 -> 127 : 7 | |
128 -> 255 : 27 |* |
19:19:14
msecs : count distribution
0 -> 1 : 427 |******************* |
2 -> 3 : 424 |****************** |
[…]
bcc/eBPF programs are laborious: biolatency
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
typedef struct disk_key {
char disk[DISK_NAME_LEN];
u64 slot;
} disk_key_t;
BPF_HASH(start, struct request *);
STORAGE
// time block I/O
int trace_req_start(struct pt_regs *ctx, struct request *req)
{
u64 ts = bpf_ktime_get_ns();
start.update(&req, &ts);
return 0;
}
// output
int trace_req_completion(struct pt_regs *ctx, struct request *req)
{
u64 *tsp, delta;
// fetch timestamp and calculate delta
tsp = start.lookup(&req);
if (tsp == 0) {
return 0; // missed issue
}
delta = bpf_ktime_get_ns() - *tsp;
FACTOR
// store as histogram
STORE
start.delete(&req);
return 0;
}
"""
# code substitutions
if args.milliseconds:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
label = "msecs"
else:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
label = "usecs"
if args.disks:
bpf_text = bpf_text.replace('STORAGE',
'BPF_HISTOGRAM(dist, disk_key_t);')
bpf_text = bpf_text.replace('STORE',
'disk_key_t key = {.slot = bpf_log2l(delta)}; ' +
'void *__tmp = (void *)req->rq_disk->disk_name; ' +
'bpf_probe_read(&key.disk, sizeof(key.disk), __tmp); ' +
'dist.increment(key);')
else:
bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
bpf_text = bpf_text.replace('STORE',
'dist.increment(bpf_log2l(delta));')
if debug or args.ebpf:
print(bpf_text)
if args.ebpf:
exit()
# load BPF program
b = BPF(text=bpf_text)
if args.queued:
b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
else:
b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
fn_name="trace_req_completion")
print("Tracing block device I/O... Hit Ctrl-C to end.")
# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
print()
if args.timestamp:
print("%-8sn" % strftime("%H:%M:%S"), end="")
dist.print_log2_hist(label, "disk")
dist.clear()
countdown -= 1
if exiting or countdown == 0:
exit()
… rewritten in bpftrace (launched Oct 2018)!
#!/usr/local/bin/bpftrace
BEGIN
{
printf("Tracing block device I/O... Hit Ctrl-C to end.n");
}
kprobe:blk_account_io_start
{
@start[arg0] = nsecs;
}
kprobe:blk_account_io_completion
/@start[arg0]/
{
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}
eBPF bpftrace (aka BPFtrace) Linux 4.9+
https://github.com/iovisor/bpftrace
# Syscall count by program
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Read size distribution by process:
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'
# Files opened by process
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %sn", comm,
str(args->filename)); }'
# Trace kernel function
bpftrace -e 'kprobe:do_nanosleep { printf(“sleep by %s”, comm); }'
# Trace user-level function
Bpftrace -e 'uretprobe:/bin/bash:readline { printf(“%sn”, str(retval)); }’
…
Good for one-liners & short scripts; bcc is good for complex tools
bpftrace Internals
eBPF XDP
https://www.netronome.com/blog/frnog-30-faster-networking-la-francaise/
Linux 4.8+
eBPF bpfilter
https://lwn.net/Articles/747551/
Linux 4.18+
ipfwadm (1.2.1)
ipchains (2.2.10)
iptables
nftables (3.13)
bpfilter (4.18+)
jit-compiled
NIC offloading
BBR
TCP congestion control algorithm
Bottleneck Bandwidth and RTT
1% packet loss: we see 3x better throughput
Linux 4.9
https://twitter.com/amernetflix/status/892787364598132736
https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/ https://queue.acm.org/detail.cfm?id=3022184
Kyber
Multiqueue block I/O scheduler
Tune target read & write latency
Up to 300x lower 99th
latencies in our testing
Linux 4.12
reads (sync)reads (sync) dispatchdispatch
writes (async)writes (async) dispatchdispatch
completions
queue size adjustqueue size adjustKyber (simplified)
https://lwn.net/Articles/720675/
Hist Triggers
Linux 4.17
https://www.kernel.org/doc/html/latest/trace/histogram.html
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info:
hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048
[active]
[…]
{ stacktrace:
__kmalloc+0x11b/0x1b0
seq_buf_alloc+0x1b/0x50
seq_read+0x2cc/0x370
proc_reg_read+0x3d/0x80
__vfs_read+0x28/0xe0
vfs_read+0x86/0x140
SyS_read+0x46/0xb0
system_call_fastpath+0x12/0x6a
} hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768
ftrace
advanced
summaries
PSI
Pressure Stall Information
More saturation metrics!
Linux 4.?
not merged yet
https://lwn.net/Articles/759781/
Resource
Utilization
(%)
Saturation
Errors
X
The USE Method
/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io
10-, 60-, and 300-second averages
More perf 4.4 - 4.19 (2016 - 2018)
●
TCP listener lockless (4.4)
●
copy_file_range() (4.5)
●
madvise() MADV_FREE (4.5)
●
epoll multithread scalability (4.5)
●
Kernel Connection Multiplexor (4.6)
●
Writeback management (4.10)
●
Hybrid block polling (4.10)
●
BFQ I/O scheduler (4.12)
●
Async I/O improvements (4.13)
●
In-kernel TLS acceleration (4.13)
●
Socket MSG_ZEROCOPY (4.14)
●
Asynchronous buffered I/O (4.14)
●
Longer-lived TLB entries with PCID (4.14)
●
mmap MAP_SYNC (4.15)
●
Software-interrupt context hrtimers (4.16)
●
Idle loop tick efficiency (4.17)
●
perf_event_open() [ku]probes (4.17)
●
AF_XDP sockets (4.18)
●
Block I/O latency controller (4.19)
●
CAKE for bufferbloat (4.19)
●
New async I/O polling (4.19)
… and many minor improvements to:
• perf
• CPU scheduling
• futexes
• NUMA
• Huge pages
• Slab allocation
• TCP, UDP
• Drivers
• Processor support
• GPUs
Take Aways
1. Run latest
2. Browse major features
eg, https://kernelnewbies.org/Linux_4.19
Some Linux perf Resources
- http://www.brendangregg.com/linuxperf.html
- https://kernelnewbies.org/LinuxChanges
- https://lwn.net/Kernel
- https://github.com/iovisor/bcc
- http://blog.stgolabs.net/search/label/linux
- http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html

More Related Content

PDF
re:Invent 2019 BPF Performance Analysis at Netflix
PDF
LISA2019 Linux Systems Performance
PDF
NetConf 2018 BPF Observability
PDF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
PDF
Velocity 2017 Performance analysis superpowers with Linux eBPF
PDF
Linux Performance 2018 (PerconaLive keynote)
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
IntelON 2021 Processor Benchmarking
re:Invent 2019 BPF Performance Analysis at Netflix
LISA2019 Linux Systems Performance
NetConf 2018 BPF Observability
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Linux Performance 2018 (PerconaLive keynote)
YOW2018 Cloud Performance Root Cause Analysis at Netflix
IntelON 2021 Processor Benchmarking

What's hot (20)

PDF
UM2019 Extended BPF: A New Type of Software
PPTX
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
PDF
LISA17 Container Performance Analysis
PDF
Performance Wins with BPF: Getting Started
PDF
Security Monitoring with eBPF
PDF
eBPF Trace from Kernel to Userspace
PDF
LSFMM 2019 BPF Observability
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
PDF
eBPF Perf Tools 2019
PDF
Systems@Scale 2021 BPF Performance Getting Started
POTX
Performance Tuning EC2 Instances
PDF
Container Performance Analysis
PDF
BPF Tools 2017
PDF
bcc/BPF tools - Strategy, current tools, future challenges
PDF
Kernel Recipes 2017: Performance Analysis with BPF
PDF
Linux System Troubleshooting
PDF
Tuning parallelcodeonsolaris005
PDF
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
PDF
LPC2019 BPF Tracing Tools
PDF
Linux kernel-rootkit-dev - Wonokaerun
UM2019 Extended BPF: A New Type of Software
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA17 Container Performance Analysis
Performance Wins with BPF: Getting Started
Security Monitoring with eBPF
eBPF Trace from Kernel to Userspace
LSFMM 2019 BPF Observability
Linux 4.x Tracing Tools: Using BPF Superpowers
eBPF Perf Tools 2019
Systems@Scale 2021 BPF Performance Getting Started
Performance Tuning EC2 Instances
Container Performance Analysis
BPF Tools 2017
bcc/BPF tools - Strategy, current tools, future challenges
Kernel Recipes 2017: Performance Analysis with BPF
Linux System Troubleshooting
Tuning parallelcodeonsolaris005
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
LPC2019 BPF Tracing Tools
Linux kernel-rootkit-dev - Wonokaerun
Ad

Similar to ATO Linux Performance 2018 (20)

PDF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
PDF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
PDF
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
PPTX
Debugging linux issues with eBPF
PDF
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
PDF
PDF
Reverse engineering Swisscom's Centro Grande Modem
PDF
Playing BBR with a userspace network stack
PPTX
Stress your DUT
PPTX
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PDF
Crash_Report_Mechanism_In_Tizen
PDF
YOW2020 Linux Systems Performance
PDF
Profiling your Applications using the Linux Perf Tools
PPTX
Top-5-production-devconMunich-2023-v2.pptx
PDF
Disruptive IP Networking with Intel DPDK on Linux
PDF
Check the version with fixes. Link in description
PDF
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
PDF
BPF: Tracing and more
PDF
Debugging Ruby
PDF
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Debugging linux issues with eBPF
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Reverse engineering Swisscom's Centro Grande Modem
Playing BBR with a userspace network stack
Stress your DUT
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
Crash_Report_Mechanism_In_Tizen
YOW2020 Linux Systems Performance
Profiling your Applications using the Linux Perf Tools
Top-5-production-devconMunich-2023-v2.pptx
Disruptive IP Networking with Intel DPDK on Linux
Check the version with fixes. Link in description
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
BPF: Tracing and more
Debugging Ruby
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
Ad

More from Brendan Gregg (10)

PDF
YOW2021 Computing Performance
PDF
Performance Wins with eBPF: Getting Started (2021)
PDF
Computing Performance: On the Horizon (2021)
PDF
BPF Internals (eBPF)
PDF
YOW2018 CTO Summit: Working at netflix
PDF
FlameScope 2018
PDF
How Netflix Tunes EC2 Instances for Performance
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
EuroBSDcon 2017 System Performance Analysis Methodologies
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
YOW2021 Computing Performance
Performance Wins with eBPF: Getting Started (2021)
Computing Performance: On the Horizon (2021)
BPF Internals (eBPF)
YOW2018 CTO Summit: Working at netflix
FlameScope 2018
How Netflix Tunes EC2 Instances for Performance
Kernel Recipes 2017: Using Linux perf at Netflix
EuroBSDcon 2017 System Performance Analysis Methodologies
USENIX ATC 2017: Visualizing Performance with Flame Graphs

Recently uploaded (20)

PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Decision Optimization - From Theory to Practice
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Altius execution marketplace concept.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Examining Bias in AI Generated News Content.pdf
PDF
The AI Revolution in Customer Service - 2025
PDF
substrate PowerPoint Presentation basic one
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
EIS-Webinar-Regulated-Industries-2025-08.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Decision Optimization - From Theory to Practice
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Altius execution marketplace concept.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Module 1 Introduction to Web Programming .pptx
Examining Bias in AI Generated News Content.pdf
The AI Revolution in Customer Service - 2025
substrate PowerPoint Presentation basic one
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
Co-training pseudo-labeling for text classification with support vector machi...
Presentation - Principles of Instructional Design.pptx
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf

ATO Linux Performance 2018

  • 1. Linux Performance 2018 Brendan Gregg Senior Performance Architect Oct 2018
  • 3. https://kernelnewbies.org/Linux_4.18 https://lwn.net/Kernel/ Post frequency: 4 per year 4 per week http://vger.kernel.org/vger-lists.html #linux-kernel LKML400 per day
  • 5. Cloud Hypervisor (patches) Cloud Hypervisor (patches) Linux Kernel (KPTI) Linux Kernel (KPTI) CPU (microcode) CPU (microcode) Application (retpolne) Application (retpolne) KPTI Linux 4.15 & backports
  • 6. Server A: 31353 MySQL queries/sec Server B: 22795 queries/sec (27% slower) serverA# mpstat 1 Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx) 02/09/2018 _x86_64_ (36 CPU) 01:09:13 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 01:09:14 AM all 86.89 0.00 13.08 0.00 0.00 0.00 0.00 0.00 0.00 0.03 01:09:15 AM all 86.77 0.00 13.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:09:16 AM all 86.93 0.00 13.02 0.00 0.00 0.00 0.03 0.00 0.00 0.03 [...] serverB# mpstat 1 Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx) 02/09/2018 _x86_64_ (36 CPU) 01:09:44 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 01:09:45 AM all 82.94 0.00 17.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:09:46 AM all 82.78 0.00 17.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:09:47 AM all 83.14 0.00 16.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [...]
  • 7. CPUCPU MMUMMU Main Memory Main Memory TLBTLB Virtual Address Physical Address hit miss (walk) Page Table Page Table Linux KPTI patches for Meltdown flush the Translation Lookaside Buffer
  • 8. Server A: TLB miss walks 3.5% Server B: TLB miss walks 19.2% (16% higher) serverA# ./tlbstat 1 K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB% 95913667 99982399 1.04 86588626 115441706 1507279 1837217 1.57 1.92 95810170 99951362 1.04 86281319 115306404 1507472 1842313 1.57 1.92 95844079 100066236 1.04 86564448 115555259 1511158 1845661 1.58 1.93 95978588 100029077 1.04 86187531 115292395 1508524 1845525 1.57 1.92 [...] serverB# ./tlbstat 1 K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB% 95911236 80317867 0.84 911337888 719553692 10476524 7858141 10.92 8.19 95927861 80503355 0.84 913726197 721751988 10518488 7918261 10.96 8.25 95955825 80533254 0.84 912994135 721492911 10524675 7929216 10.97 8.26 96067221 80443770 0.84 912009660 720027006 10501926 7911546 10.93 8.24 [...]
  • 10. Enhanced BPF Kernel kprobeskprobes uprobesuprobes tracepointstracepoints socketssockets SDN ConfigurationSDN Configuration User-Defined BPF Programs … Event TargetsRuntime also known as just "BPF" Linux 4.* perf_eventsperf_events BPF actions BPF actions BPFBPF verifierverifier DDoS MitigationDDoS Mitigation Intrusion DetectionIntrusion Detection Container SecurityContainer Security ObservabilityObservability Firewalls (bpfilter)Firewalls (bpfilter) Device DriversDevice Drivers
  • 11. eBPF is solving new things: off-CPU + wakeup analysis
  • 12. eBPF bcc Linux 4.4+ https://github.com/iovisor/bcc
  • 13. e.g., identify multimodal disk I/O latency and outliers with bcc/eBPF biolatency # biolatency -mT 10 Tracing block device I/O... Hit Ctrl-C to end. 19:19:04 msecs : count distribution 0 -> 1 : 238 |********* | 2 -> 3 : 424 |***************** | 4 -> 7 : 834 |********************************* | 8 -> 15 : 506 |******************** | 16 -> 31 : 986 |****************************************| 32 -> 63 : 97 |*** | 64 -> 127 : 7 | | 128 -> 255 : 27 |* | 19:19:14 msecs : count distribution 0 -> 1 : 427 |******************* | 2 -> 3 : 424 |****************** | […]
  • 14. bcc/eBPF programs are laborious: biolatency # define BPF program bpf_text = """ #include <uapi/linux/ptrace.h> #include <linux/blkdev.h> typedef struct disk_key { char disk[DISK_NAME_LEN]; u64 slot; } disk_key_t; BPF_HASH(start, struct request *); STORAGE // time block I/O int trace_req_start(struct pt_regs *ctx, struct request *req) { u64 ts = bpf_ktime_get_ns(); start.update(&req, &ts); return 0; } // output int trace_req_completion(struct pt_regs *ctx, struct request *req) { u64 *tsp, delta; // fetch timestamp and calculate delta tsp = start.lookup(&req); if (tsp == 0) { return 0; // missed issue } delta = bpf_ktime_get_ns() - *tsp; FACTOR // store as histogram STORE start.delete(&req); return 0; } """ # code substitutions if args.milliseconds: bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;') label = "msecs" else: bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;') label = "usecs" if args.disks: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist, disk_key_t);') bpf_text = bpf_text.replace('STORE', 'disk_key_t key = {.slot = bpf_log2l(delta)}; ' + 'void *__tmp = (void *)req->rq_disk->disk_name; ' + 'bpf_probe_read(&key.disk, sizeof(key.disk), __tmp); ' + 'dist.increment(key);') else: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);') bpf_text = bpf_text.replace('STORE', 'dist.increment(bpf_log2l(delta));') if debug or args.ebpf: print(bpf_text) if args.ebpf: exit() # load BPF program b = BPF(text=bpf_text) if args.queued: b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start") else: b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start") b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start") b.attach_kprobe(event="blk_account_io_completion", fn_name="trace_req_completion") print("Tracing block device I/O... Hit Ctrl-C to end.") # output exiting = 0 if args.interval else 1 dist = b.get_table("dist") while (1): try: sleep(int(args.interval)) except KeyboardInterrupt: exiting = 1 print() if args.timestamp: print("%-8sn" % strftime("%H:%M:%S"), end="") dist.print_log2_hist(label, "disk") dist.clear() countdown -= 1 if exiting or countdown == 0: exit()
  • 15. … rewritten in bpftrace (launched Oct 2018)! #!/usr/local/bin/bpftrace BEGIN { printf("Tracing block device I/O... Hit Ctrl-C to end.n"); } kprobe:blk_account_io_start { @start[arg0] = nsecs; } kprobe:blk_account_io_completion /@start[arg0]/ { @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }
  • 16. eBPF bpftrace (aka BPFtrace) Linux 4.9+ https://github.com/iovisor/bpftrace # Syscall count by program bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' # Read size distribution by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }' # Files opened by process bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %sn", comm, str(args->filename)); }' # Trace kernel function bpftrace -e 'kprobe:do_nanosleep { printf(“sleep by %s”, comm); }' # Trace user-level function Bpftrace -e 'uretprobe:/bin/bash:readline { printf(“%sn”, str(retval)); }’ … Good for one-liners & short scripts; bcc is good for complex tools
  • 19. eBPF bpfilter https://lwn.net/Articles/747551/ Linux 4.18+ ipfwadm (1.2.1) ipchains (2.2.10) iptables nftables (3.13) bpfilter (4.18+) jit-compiled NIC offloading
  • 20. BBR TCP congestion control algorithm Bottleneck Bandwidth and RTT 1% packet loss: we see 3x better throughput Linux 4.9 https://twitter.com/amernetflix/status/892787364598132736 https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/ https://queue.acm.org/detail.cfm?id=3022184
  • 21. Kyber Multiqueue block I/O scheduler Tune target read & write latency Up to 300x lower 99th latencies in our testing Linux 4.12 reads (sync)reads (sync) dispatchdispatch writes (async)writes (async) dispatchdispatch completions queue size adjustqueue size adjustKyber (simplified) https://lwn.net/Articles/720675/
  • 22. Hist Triggers Linux 4.17 https://www.kernel.org/doc/html/latest/trace/histogram.html # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist # trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active] […] { stacktrace: __kmalloc+0x11b/0x1b0 seq_buf_alloc+0x1b/0x50 seq_read+0x2cc/0x370 proc_reg_read+0x3d/0x80 __vfs_read+0x28/0xe0 vfs_read+0x86/0x140 SyS_read+0x46/0xb0 system_call_fastpath+0x12/0x6a } hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768 ftrace advanced summaries
  • 23. PSI Pressure Stall Information More saturation metrics! Linux 4.? not merged yet https://lwn.net/Articles/759781/ Resource Utilization (%) Saturation Errors X The USE Method /proc/pressure/cpu /proc/pressure/memory /proc/pressure/io 10-, 60-, and 300-second averages
  • 24. More perf 4.4 - 4.19 (2016 - 2018) ● TCP listener lockless (4.4) ● copy_file_range() (4.5) ● madvise() MADV_FREE (4.5) ● epoll multithread scalability (4.5) ● Kernel Connection Multiplexor (4.6) ● Writeback management (4.10) ● Hybrid block polling (4.10) ● BFQ I/O scheduler (4.12) ● Async I/O improvements (4.13) ● In-kernel TLS acceleration (4.13) ● Socket MSG_ZEROCOPY (4.14) ● Asynchronous buffered I/O (4.14) ● Longer-lived TLB entries with PCID (4.14) ● mmap MAP_SYNC (4.15) ● Software-interrupt context hrtimers (4.16) ● Idle loop tick efficiency (4.17) ● perf_event_open() [ku]probes (4.17) ● AF_XDP sockets (4.18) ● Block I/O latency controller (4.19) ● CAKE for bufferbloat (4.19) ● New async I/O polling (4.19) … and many minor improvements to: • perf • CPU scheduling • futexes • NUMA • Huge pages • Slab allocation • TCP, UDP • Drivers • Processor support • GPUs
  • 25. Take Aways 1. Run latest 2. Browse major features eg, https://kernelnewbies.org/Linux_4.19
  • 26. Some Linux perf Resources - http://www.brendangregg.com/linuxperf.html - https://kernelnewbies.org/LinuxChanges - https://lwn.net/Kernel - https://github.com/iovisor/bcc - http://blog.stgolabs.net/search/label/linux - http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html