Wine-NSPA – io_uring I/O Architecture

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-15 Author: jordan Johnston

Table of Contents

  1. Overview
  2. Design Principles
  3. I/O Architecture: Before and After
  4. Phase 1: Synchronous Poll Replacement
  5. Phase 2: Async File I/O Server Bypass
  6. Phase 3: Socket I/O Server Bypass
  7. File Manifest
  8. Status Summary

1. Overview

This document is the deep-dive companion to the Wine-NSPA Architecture Overview. It covers the design decisions, tradeoffs, and implementation details of io_uring integration across all three phases. For background on Wine’s I/O model and why io_uring matters for RT audio, see the architecture page.

The two bottlenecks targeted by this work:

  1. Syscall overhead. 4+ kernel transitions per async file read (register + epoll + alert + read). io_uring collapses this to 1 (io_uring_enter).
  2. Global lock contention. Every fd in server epoll extends global_lock hold time. Fewer server-monitored fds = shorter hold = less contention for shmem dispatchers.

Relationship to Existing NSPA Infrastructure

NSPA Component io_uring Interaction
Shmem IPC (v1.5) Orthogonal. Shmem handles request/reply protocol. io_uring handles file/socket I/O. Different fd sets, no conflict.
PI global_lock Indirect benefit. Fewer fds in server epoll = shorter main loop iterations = shorter global_lock hold.
NTSync (/dev/ntsync) Integrated (Phase 3). ntsync uring_fd extension wakes threads blocked in ntsync waits when io_uring CQEs arrive. The pad field in ntsync_wait_args carries the io_uring eventfd; kernel returns NTSYNC_INDEX_URING_READY on CQE. This enables sync socket waits to drain CQEs inline. ntsync PI v2 kernel fixes committed independently.
CS-PI (FUTEX_LOCK_PI) No conflict. io_uring operations happen client-side in ntdll, never acquiring server locks.
RT scheduling (SCHED_FIFO/RR) Compatible. COOP_TASKRUN ensures completions run in the submitting thread’s context, preserving RT priority.
Manual PI boost (v2.5) No conflict. v2.5 (cached sched state) is shipped. FUTEX_LOCK_PI redesign was attempted and REJECTED (SMP deadlocks — see Section 7).

Reference: rbernon’s Archived Attempt

Rémi Bernon attempted a full wineserver main-loop replacement with io_uring circa 2021-2022 (gitlab.winehq.org/rbernon/wine, branch archive/iouring). It was abandoned because io_uring was immature at the time – missing features, kernel bugs, API instability. That approach replaced the server’s epoll entirely (~500 LOC across server/fd.c, request.c, thread.c).

Wine-NSPA’s approach is fundamentally different: with the shmem fast path already handling request/reply IPC, the server main loop is no longer the bottleneck. Instead, we target the remaining server-dependent paths – file and socket I/O – from the client side, keeping changes isolated in a new io_uring.c file with minimal modifications to existing code.


2. Design Principles

Per-Thread Ring Architecture

Per-Thread io_uring Ring + Pool Allocator (TLS) Thread N (TLS) thread_ring SINGLE_ISSUER COOP_TASKRUN SQ: 32 entries CQ: 64 entries op_pool[32] (TLS static) uring_async_op structs op_free_head -> [0]->[1]->...->[31]->NULL O(1) alloc: pop head O(1) free: push head ring_initialized (bool) ring_init_failed (bool) ring_efd (eventfd, Phase 3) ensure_ring(): lazy init + op_pool_init() + eventfd() + IORING_REGISTER_EVENTFD NtReadFile (async path) op_pool_alloc() zero malloc dup(fd) lifetime safety io_uring_submit() 1 syscall kernel async I/O server_wait() / server_select() entry point process_completions() drain CQ, complete_uring_op() complete (P1-2/P3) IOSB + event/IOCP op_pool_free() + close(dup_fd)

3. I/O Architecture: Before and After

I/O Path Comparison Diagram

Vanilla Wine (server-mediated) Wine-NSPA + io_uring NtReadFile(handle, async=TRUE) register_async_file_read() SERVER: register_async round-trip Server: epoll monitors fd global_lock held during dispatch Server: async_wake_up(ALERTED) async_read_proc() callback SERVER: server_get_unix_fd (again) read(fd, buf, len) set IOSB, signal event, IOCP Cost per async read: 2 server round-trips (register + get_fd) 1 epoll monitoring cycle (global_lock) 1 read() syscall = 4+ kernel transitions NtReadFile(handle, async=TRUE) server_get_unix_fd() (cached, no trip) dup(fd) → ring_fd (lifetime safe) io_uring_prep_read(ring_fd, buf, len) io_uring_submit() return STATUS_PENDING ... thread does other work ... server_wait() → drain CQ CQE: result = bytes_read complete_uring_op() → file_complete_async() set IOSB, signal event, IOCP close(ring_fd) Cost per async read: 0 server round-trips 0 epoll monitoring (no global_lock) 1 io_uring_enter = 1 kernel transition

Vanilla Wine: Server-Mediated Async File I/O

Client Thread Wineserver ───────────── ────────── NtReadFile(async) server_get_unix_fd() ──→ get_handle_fd register_async() ──→ register_async queue_async(&fd->read_q) set_fd_events(POLLIN) ←── epoll monitors fd return STATUS_PENDING ... main_loop_epoll(): (thread does other work) global_lock.lock() epoll_pwait2() → fd ready fd_poll_event → async_wake_up(STATUS_ALERTED) global_lock.unlock() ... (thread enters alertable wait) async_read_proc(): server_get_unix_fd() ──→ get_handle_fd (again) read(fd, buf, len) set IOSB, signal event

Syscalls per async read: 2 server round-trips (register + get_fd) + epoll_wait + read = 4+ kernel transitions

Wine-NSPA with io_uring: Client-Side Async File I/O

Client Thread Wineserver ───────────── ────────── NtReadFile(async) server_get_unix_fd() ──→ get_handle_fd (cached, usually no round-trip) dup(unix_fd) → ring_fd io_uring_prep_read(ring_fd, buf, len) io_uring_submit() (server never sees this I/O) return STATUS_PENDING ... (thread enters server_wait) ntdll_io_uring_process_completions(): CQE ready → bytes_read file_complete_async() close(ring_fd)

Syscalls per async read: 1 io_uring_enter (submit+wait batched) = 1 kernel transition

The server is completely bypassed for the I/O monitoring and data transfer. It still handles the initial fd lookup (usually cached) and completion port notifications if needed.

Synchronous I/O: poll() Replacement

Before: After: poll(fd, POLLIN, timeout) ntdll_io_uring_poll(fd, POLLIN, timeout) read(fd, buf, len) read(fd, buf, len) ← unchanged

The read()/write() still goes through virtual_locked_read() for write-watch safety. Only the poll wait is replaced.


4. Phase 1: Synchronous Poll Replacement

Status: COMMITTED (2d903dda200)

What Changed

In NtReadFile and NtWriteFile, the synchronous blocking path does poll(fd, events, timeout) to wait for fd readiness, then loops back to read()/write(). Phase 1 replaces the poll() call with ntdll_io_uring_poll().

Integration Points

File Change LOC
dlls/ntdll/unix/file.c NtReadFile sync wait: poll()ntdll_io_uring_poll() with fallback ~15
dlls/ntdll/unix/file.c NtWriteFile sync wait: same pattern ~15

Fallback

If ntdll_io_uring_poll() returns -ENOSYS (ring unavailable), the code immediately falls back to the original poll() call. Zero behavioral change for non-io_uring systems.

Why virtual_locked_read is Preserved

The actual read()/write() still uses virtual_locked_read() / virtual_locked_pread(), which handles EFAULT from write-watched pages by retrying inside virtual_mutex. io_uring only replaces the poll wait, not the data transfer.


5. Phase 2: Async File I/O Server Bypass

Status: COMMITTED (2d903dda200), pool allocator COMMITTED (82cea7143ff)

What Changed

When NtReadFile or NtWriteFile would register an async operation with the server (register_async / SERVER_START_REQ(register_async)), the io_uring path instead:

  1. dup()s the unix fd for lifetime safety
  2. Submits IORING_OP_READ or IORING_OP_WRITE to the per-thread ring
  3. Returns STATUS_PENDING
  4. The CQE is processed later at the next server_wait() or server_select() call

fd Lifetime Safety

The current async model re-fetches the unix fd via server_get_unix_fd() in every callback. io_uring requires the fd to remain valid for the duration of the in-flight SQE. Solution: dup() the fd before submission. The duplicate is owned exclusively by the uring_async_op struct and closed on CQE completion or cancellation.

This is analogous to the thread-map fd reference pattern used elsewhere in NSPA – hold a ref for the lifetime of the operation.

Completion Delivery

CQEs are drained cooperatively:

// server.c — called before any blocking wait
unsigned int server_select(…) {
    ntdll_io_uring_process_completions();  // drain CQ
    …
}
unsigned int server_wait(…) {
    ntdll_io_uring_process_completions();  // drain CQ
    …
}

When a CQE arrives, complete_uring_op() translates the result to NTSTATUS and calls file_complete_async() – the same function used by Wine’s normal async completion path. This handles:

EFAULT Handling

If io_uring’s kernel read hits EFAULT (buffer in a write-watched page), the CQE result is -EFAULT. The completion handler detects this and frees the operation – the caller should retry through the server async path, which uses virtual_locked_read() with proper page fault handling. This is the graceful fallback for an edge case that rarely occurs in practice.

Integration Points

File Change LOC
dlls/ntdll/unix/file.c Async read: try io_uring before register_async_file_read() ~8
dlls/ntdll/unix/file.c Async write: try io_uring before server registration ~10
dlls/ntdll/unix/server.c Completion drain in server_select() and server_wait() +2
dlls/ntdll/unix/thread.c Ring cleanup in pthread_exit_wrapper() +1

6. Phase 3: Socket I/O Server Bypass

Status: COMMITTED — sync + overlapped WORKING (a645adc66ed)

The Challenge

Socket I/O (sock_recv / sock_send) is tightly coupled with the server’s async lifecycle:

Unlike file I/O (where the server’s only role is fd monitoring), the socket code has the server actively participating in the protocol state machine.

Implemented Approach: E2 Bitmap + ALERTED-State Interception

Option Description Benefit Verdict
B1: Server flag Add client_poll flag to recv/send. Server skips epoll. Clean, ~50 LOC Evaluated, not used
B2: Both poll Server + client both monitor. First wins. No global_lock benefit Rejected
E2: Shared bitmap Process-level bitmap. Client sets bit per fd. Server checks in sock_get_poll_events(). No protocol change. Generalizes. IMPLEMENTED
C: Full bypass Skip server entirely for connected TCP. Breaks socket state machine Rejected

How It Works

The key innovation is ALERTED-state interception – intercepting in the ALERTED block before set_async_direct_result is called:

Server: recv_socket → STATUS_ALERTED + wait_handle
Client: try_recv(fd) → EAGAIN (not ready)
                                              ← interception point
  BEFORE: set_async_direct_result(PENDING)    ← would restart async on server
  NOW:    set bitmap + io_uring POLL_ADD      ← async stays ALERTED (frozen)
          return STATUS_PENDING
  … io_uring monitors fd …
CQE fires:
  try_recv(fd) → SUCCESS (data available)
  set_async_direct_result(SUCCESS, bytes)     ← server accepts (ALERTED preserved)
  Server: completes async, signals event/IOCP
ALERTED-State Interception Flow (Phase 3) Server: recv_socket() returns STATUS_ALERTED + wait_handle Client: try_recv(fd) returns EAGAIN (data not ready yet) INTERCEPTION POINT -- before set_async_direct_result() OLD PATH NEW PATH (NSPA) set_async_direct_result(PENDING) server restarts async async re-queued, server monitors via epoll Server: epoll monitors fd global_lock held during dispatch Server: async_wake_up(ALERTED) client callback, re-fetches fd, completes 2 server round-trips + epoll cycle set E2 bitmap bit for fd io_uring POLL_ADD(fd, POLLIN) async stays ALERTED (frozen on server) return STATUS_PENDING io_uring monitors fd in kernel no server involvement no global_lock contention CQE fires: fd ready eventfd wakes ntsync wait (sync) or drain (async) try_recv(fd) -- SUCCESS data available, read completes set_async_direct_result(SUCCESS, bytes) single server call with final result ALERTED state preserved -- server accepts Server: completes async signals event / IOCP 1 server call (completion only) Server operations NSPA new path Old path (replaced) io_uring kernel ops Interception point

Why this works: When an async is ALERTED on the server, terminated=1 and async_waiting() returns false. The server does not monitor the fd via epoll. The bitmap provides additional safety (sock_get_poll_events returns -1). Only one call to set_async_direct_result ever happens – from the CQE handler with the final result.

Why previous approaches failed (4 attempts):

  1. CQ drain inline NtSetEvent: Signal reentrancy crash – NtSetEvent requires Wine signal manipulation, unsafe from CQ drain context.
  2. Deferred completion flush: Deadlock – event must be signaled during the wait to wake inproc_wait, not after.
  3. Direct ntsync ioctl: Double completion – set_async_direct_result(PENDING) restarted the async, server monitored via epoll AND io_uring monitored → race.
  4. Bitmap after set_async_direct_result: Same race – bitmap set too late, async already restarted.

Sync vs Overlapped Path

Sync Overlapped
ALERTED block Intercept, submit POLL_ADD Same
Return wait_async(wait_handle) – blocks STATUS_PENDING – returns immediately
CQE wakeup ntsync uring_fd → retry loop drains CQ → set_async_direct_result → ntsync signals wait_handle set_async_direct_result → server signals event/IOCP
Fallback (EAGAIN in CQE) set_async_direct_result(PENDING) → server restarts async → epoll Same

Global Lock Impact

Every socket fd removed from server epoll = one fewer event in main_loop_epoll() = shorter global_lock hold per iteration. This benefits ALL shmem dispatchers, not just the thread doing socket I/O.

Before: main_loop_epoll() processes N socket events while holding global_lock After: main_loop_epoll() processes (N - M) events (M = client-monitored sockets)

Test Results (v4-overlapped, 2026-04-15)

Phase A (immediate recv): 2000/2000, avg 95us, p99 162us Phase B (overlapped recv): 2000/2000, avg 113us, p99 189us, 2000 async (PENDING) Full suite: 22/22 PASS (11 tests x baseline + rt)


7. File Manifest

New Files

File Lines Purpose
dlls/ntdll/unix/io_uring.c ~760 Per-thread ring management, all Phase 1-3 functions, pool allocator

Modified Files

File Changed Lines Purpose
dlls/ntdll/unix/unix_private.h +30 io_uring function declarations, bitmap helpers
dlls/ntdll/unix/file.c ~30 Sync poll replacement + async read/write bypass
dlls/ntdll/unix/socket.c ~120 Phase 3: ALERTED interception, CQE handler, bitmap set/clear
dlls/ntdll/unix/sync.c ~40 ntsync uring_fd retry loop, deferred completion flush
dlls/ntdll/unix/server.c +2 Completion drain at server_select/server_wait
dlls/ntdll/unix/thread.c +1 Ring cleanup at thread exit
server/sock.c ~40 E2 bitmap check in sock_get_poll_events, bitmap cache on struct sock
dlls/ntdll/Makefile.in +2 io_uring.c source + URING_LIBS
configure.ac +8 liburing detection

Build Dependency

liburing.so.2 (system package). Available as liburing in Arch Linux [extra]. Detected at configure time via AC_CHECK_LIB(uring, io_uring_queue_init).


8. Status Summary

Component Status Test Coverage
io_uring ring management COMMITTED All tests PASS
Phase 1: sync poll replacement COMMITTED All tests PASS
Phase 2: async file I/O bypass COMMITTED All tests PASS
Pool allocator (TLS, 32 ops) COMMITTED RT-safe, zero malloc in submit path
Phase 3: socket I/O (sync) COMMITTED (cc9610c187c) socket-io Phase A 2000/2000
Phase 3: socket I/O (overlapped) COMMITTED (a645adc66ed) socket-io Phase B 2000/2000
E2 bitmap (server sock.c) COMMITTED sock_get_poll_events returns -1 for client-monitored fds
ntsync uring_fd extension COMMITTED (kernel patch) Enables CQE wakeup in ntsync waits
ntsync PI kmalloc fix COMMITTED (kernel patch) Fixes __schedule_bug on PREEMPT_RT
ntsync PI v2 (kernel driver) COMMITTED 3 bugs fixed, ntsync 8/8 PASS
Shmem PI futex redesign REJECTED Deadlocks on SMP
ntsync URING_CMD Shelved ntsync is synchronous
io_uring futex PI Research Kernel lacks PI flag

Next Actions

  1. Profile: measure server epoll fd reduction and global_lock hold time under socket load
  2. Test with real socket-heavy applications (networked DAWs, MIDI over network)
  3. Investigate multishot recv + provided buffers for streaming socket optimization
  4. Investigate K3 (ntsync multishot poll) for event-driven audio loops