io_uring gets described as “async I/O done right”, which is true but unhelpful. The model that made it click for me: it’s two shared ring buffers between you and the kernel.

  • SQ (submission queue): you write descriptions of work you want done.
  • CQ (completion queue): the kernel writes results back.

Because both rings live in memory shared with the kernel, you can submit and reap work without a syscall per operation. That’s the whole point — the syscall overhead that killed throughput on read/write heavy workloads largely disappears.

The minimal loop

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_submit(&ring);                 // one syscall, many ops

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// cqe->res is the byte count (or -errno)
io_uring_cqe_seen(&ring, cqe);

Where it shines (and doesn’t)

It’s a real win for many small, independent operations — think a database flushing thousands of pages, or a proxy fanning out connections. With registered buffers and IORING_SETUP_SQPOLL you can get to near-zero syscalls in the hot path.

It’s not a magic speedup for a single sequential stream — there you were already bandwidth-bound, not syscall-bound. Measure first. I’ve watched people bolt on io_uring and get nothing because their bottleneck was the disk, not the interface to it.

A caveat worth knowing: the feature surface moved fast across kernels, and there have been security advisories. On shared hosts, check whether it’s even enabled before designing around it.