深入浅出io_uring

最新推荐文章于 2025-06-30 12:17:19 发布

04290629

最新推荐文章于 2025-06-30 12:17:19 发布

阅读量2k

点赞数 33

CC 4.0 BY-SA版权

分类专栏： Linux 文章标签： linux 后端

本文链接：https://blog.csdn.net/weixin_45817413/article/details/137697118

Linux 专栏收录该内容

4 篇文章

订阅专栏

本文探讨了Linux传统阻塞式I/O的局限性，介绍了libaio的引入以及其不足，重点讲解了io_uring的出现，它是原生异步I/O的解决方案，通过共享内存和环形队列提高性能。文章还提供了io_uring的实现机制和一个简单的使用示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Linux I/O 发展

请添加图片描述

基于 fd 的阻塞式 I/O

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

阻塞式系统调用：程序调用这些函数时会进入 sleep 状态，然后被调度出去，直到 I/O 操作完成。随着存储设备越来越快，程序越来越复杂，阻塞式（blocking）I/O 性能难以满足要求。

libaio：linux kernal native async I/O

Linux 2.6 内核引入了 libaio：

用户通过 io_submit() 提交 I/O 请求，
过一会再调用 io_getevents() 来检查哪些 events 已经 ready 了。
使用户能编写异步的代码。

libaio 的缺陷：

系统调用开销大：io_submit() 和 io_getevents() 通过系统调用完成，而触发系统调用时，需要进行上下文切换。在高 IOPS 的情况下，进行上下文切换也会消耗大量的 CPU 时间。
仅支持 Direct I/O（直接 I/O）：在使用原生 AIO 的时候，只能指定 O_DIRECT 标识位（直接 I/O），不能借助文件系统的页缓存（page cache）来缓存当前的 I/O 请求，只适用于数据库系统。
对数据有大小对齐限制：所有写操作的数据大小必须是文件系统块大小（一般为 4KB）的倍数，而且要与内存页大小对齐。
**扩展性差：**接口在设计时并未考虑扩展性。

io_uring

在设计上是原生异步的。应用程序只需要将请求放入队列，不需要其他任何等待，请求完成之后会出现在结果队列。
支持多种类型的 I/O：cached files、direct-access files 等。
灵活、可扩展：基于 io_uring 可以对 Linux 的系统调用进行重写。

Design

应用程序与内核通过共享内存进行通信：

请添加图片描述

io_uring 主要创建了 3 块共享内存：

提交队列（Submission Queue, SQ）：一整块连续的内存空间存储的环形队列，用于存放将执行 I/O 操作的数据（指向提交队列项数组的索引）。
完成队列（Completion Queue, CQ）：一整块连续的内存空间存储的环形队列，用于存放 I/O 操作完成后返回的结果。
提交队列项数组（Submission Queue Entry，SQE）：提交队列中的一项。

提交队列 SQ

struct io_uring_sq {
    unsigned *khead;    //队头
    unsigned *ktail;    //队尾
    // Deprecated: use `ring_mask` instead of `*kring_mask`
    unsigned *kring_mask;
    // Deprecated: use `ring_entries` instead of `*kring_entries`
    unsigned *kring_entries;
    unsigned *kflags;
    unsigned *kdropped;
    unsigned *array;
    struct io_uring_sqe *sqes;  //SQE指针数组

    unsigned sqe_head;
    unsigned sqe_tail;

    size_t ring_sz;
    void *ring_ptr;

    unsigned ring_mask;
    unsigned ring_entries;

    unsigned pad[2];
};

请添加图片描述

应用程序直接向 io_sq_ring 结构的环形队列中提交 I/O 操作，无需通过系统调用来提交，避免了上下文切换的发生。内核线程从 io_sq_ring 结构的环形队列中获取到要进行的 I/O 操作，并且发起 I/O 请求。

提交队列项 SQE

/*
 * IO submission data structure (Submission Queue Entry)
 */
struct io_uring_sqe {
    __u8    opcode;     /* type of operation for this sqe */
    __u8    flags;      /* IOSQE_ flags */
    __u16   ioprio;     /* ioprio for the request */
    __s32   fd;     /* file descriptor to do IO on */
    union {
        __u64   off;    /* offset into file */
        __u64   addr2;
        struct {
            __u32   cmd_op;
            __u32   __pad1;
        };
    };
    union {
        __u64   addr;   /* pointer to buffer or iovecs */
        __u64   splice_off_in;
    };
    __u32   len;        /* buffer size or number of iovecs */
    ...
};

当用户调用 io_uring_setup() 系统调用创建一个 io_ring 对象时，内核将会创建一个类型为 io_uring_sqe 结构的数组。
应用程序提交 I/O 操作时，先要从 提交队列项数组 中获取一个空闲的项 io_uring_sqe，然后向此项填充数据（如 I/O 操作码、要进行 I/O 操作的文件句柄等），然后将此项在 提交队列项数组 的索引写入 提交队列 中。

完成队列 CQ

当内核完成 I/O 操作后，会将 I/O 操作的结果保存到 完成队列 中。

struct io_uring_cq {
    unsigned *khead;
    unsigned *ktail;
    // Deprecated: use `ring_mask` instead of `*kring_mask`
    unsigned *kring_mask;
    // Deprecated: use `ring_entries` instead of `*kring_entries`
    unsigned *kring_entries;
    unsigned *kflags;
    unsigned *koverflow;
    struct io_uring_cqe *cqes;

    size_t ring_sz;
    void *ring_ptr;

    unsigned ring_mask;
    unsigned ring_entries;

    unsigned pad[2];
};

在这里插入图片描述

SQ 线程

在内核轮询模式下，内核将会创建一个名为 io_uring-sq 的内核线程（称为 SQ 线程），此内核线程会不断从 提交队列 中读取 I/O 操作，并且发起 I/O 请求。

当 I/O 请求完成以后，SQ 线程将会把 I/O 操作的结果写入到 完成队列 中，应用程序就可以从 完成队列 中读取 I/O 操作的结果。

在这里插入图片描述

简要步骤

io_uring 的基本操作流程：

第一步：应用程序通过向 io_uring 的 提交队列 提交 I/O 操作。
第二步：SQ 内核线程从 提交队列 中读取 I/O 操作。
第三步：SQ 内核线程发起 I/O 请求。
第四步：I/O 请求完成后，SQ 内核线程会将 I/O 请求的结果写入到 io_uring 的 完成队列 中。
第五步：应用程序可以通过从 完成队列 中读取到 I/O 操作的结果。

Demo

/* SPDX-License-Identifier: MIT */
/*
 * Simple app that demonstrates how to setup an io_uring interface,
 * submit and complete IO against it, and then tear it down.
 *
 * gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring
 */
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include "liburing.h"

#define QD  4

int main(int argc, char *argv[])
{
    struct io_uring ring;
    int i, fd, ret, pending, done;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct iovec *iovecs;
    struct stat sb;
    ssize_t fsize;
    off_t offset;
    void *buf;

    if (argc < 2) {
        printf("%s: file\n", argv[0]);
        return 1;
    }

// 1. 初始化一个 io_uring 实例
    ret = io_uring_queue_init(QD, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return 1;
    }

//2. 获取文件描述符，指定O_DIRECT flag，内核轮询模式需要O_DIRECT flag
    fd = open(argv[1], O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("open");
        return 1;
    }

    if (fstat(fd, &sb) < 0) {
        perror("fstat");
        return 1;
    }

    printf("file size=%lu\n",sb.st_size);

// 3. 初始化 4 个读缓冲区
    fsize = 0;
    iovecs = calloc(QD, sizeof(struct iovec));
    for (i = 0; i < QD; i++) {
        if (posix_memalign(&buf, 4096, 4096))
            return 1;
        iovecs[i].iov_base = buf;
        iovecs[i].iov_len = 4096;
        fsize += 4096;
    }

// 4. 准备 4 个 SQE 读请求，指定将随后读入的数据写入 iovecs 
    offset = 0;
    i = 0;
    do {
        sqe = io_uring_get_sqe(&ring);
        if (!sqe)
            break;
        
        printf("prepare sqe %d\n",i);

        // 指定将随后读入的数据写入 iovecs 
        io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset);
        offset += iovecs[i].iov_len;
        i++;
        if (offset > sb.st_size)
            break;
    } while (1);

// 5. 提交 SQE 读请求
    ret = io_uring_submit(&ring);
    if (ret < 0) {
        fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
        return 1;
    } else if (ret != i) {
        fprintf(stderr, "io_uring_submit submitted less %d\n", ret);
        return 1;
    }

// 6. 等待读请求完成（CQE）
    done = 0;
    pending = ret;
    fsize = 0;

    printf("pending=%d\n",pending);

    for (i = 0; i < pending; i++) {
        ret = io_uring_wait_cqe(&ring, &cqe); // 等待系统返回一个读完成事件
        if (ret < 0) {
            fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
            return 1;
        }

        done++;
        ret = 0;
        if (cqe->res != 4096 && cqe->res + fsize != sb.st_size) {
            fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
            ret = 1;
        }
        fsize += cqe->res;

        printf("iteration %d\n",i);
        printf("ret=%d\tcqe->res=%d\n",ret,cqe->res);
        printf("%s\n",iovecs[i].iov_base);

        io_uring_cqe_seen(&ring, cqe); // 释放一个io_uring_cqe entry
        if (ret)
            break;
    }

    printf("Submitted=%d, completed=%d, bytes=%lu\n", pending, done,
                        (unsigned long) fsize);
    close(fd);
    io_uring_queue_exit(&ring);
    return 0;
}