Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-15 Author: jordan Johnston
This document is the deep-dive companion to the Wine-NSPA Architecture Overview. It covers the design decisions, tradeoffs, and implementation details of io_uring integration across all three phases. For background on Wine’s I/O model and why io_uring matters for RT audio, see the architecture page.
The two bottlenecks targeted by this work:
global_lock hold time. Fewer server-monitored fds = shorter hold = less contention for shmem dispatchers.| NSPA Component | io_uring Interaction |
|---|---|
| Shmem IPC (v1.5) | Orthogonal. Shmem handles request/reply protocol. io_uring handles file/socket I/O. Different fd sets, no conflict. |
| PI global_lock | Indirect benefit. Fewer fds in server epoll = shorter main loop iterations = shorter global_lock hold. |
| NTSync (/dev/ntsync) | Integrated (Phase 3). ntsync uring_fd extension wakes threads blocked in ntsync waits when io_uring CQEs arrive. The pad field in ntsync_wait_args carries the io_uring eventfd; kernel returns NTSYNC_INDEX_URING_READY on CQE. This enables sync socket waits to drain CQEs inline. ntsync PI v2 kernel fixes committed independently. |
| CS-PI (FUTEX_LOCK_PI) | No conflict. io_uring operations happen client-side in ntdll, never acquiring server locks. |
| RT scheduling (SCHED_FIFO/RR) | Compatible. COOP_TASKRUN ensures completions run in the submitting thread’s context, preserving RT priority. |
| Manual PI boost (v2.5) | No conflict. v2.5 (cached sched state) is shipped. FUTEX_LOCK_PI redesign was attempted and REJECTED (SMP deadlocks — see Section 7). |
Rémi Bernon attempted a full wineserver main-loop replacement with io_uring circa 2021-2022 (gitlab.winehq.org/rbernon/wine, branch archive/iouring). It was abandoned because io_uring was immature at the time – missing features, kernel bugs, API instability. That approach replaced the server’s epoll entirely (~500 LOC across server/fd.c, request.c, thread.c).
Wine-NSPA’s approach is fundamentally different: with the shmem fast path already handling request/reply IPC, the server main loop is no longer the bottleneck. Instead, we target the remaining server-dependent paths – file and socket I/O – from the client side, keeping changes isolated in a new io_uring.c file with minimal modifications to existing code.
IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_COOP_TASKRUN. No cross-thread submission, no locking.COOP_TASKRUN ensures CQE processing happens in the submitting thread’s context – preserving SCHED_FIFO priority. No kernel worker threads at default priority.-ENOSYS if the ring is unavailable. Callers fall back to the existing code path. Wine apps see identical behavior whether io_uring is present or not.dup() the unix fd before submitting to the ring. The duplicate is owned by the in-flight SQE and closed on CQE completion. This prevents use-after-close if the server-side handle table changes during the operation.server_select() and server_wait() entry points – the natural places where a thread is about to block. This ensures completions are delivered promptly without adding threads or signals.io_uring.c, ~760 lines). Existing files get thin conditionals: if (!ntdll_io_uring_submit_*()) { return STATUS_PENDING; } /* fallback */.uring_async_op structs (matches RING_SIZE). Freelist-based O(1) alloc/free — no malloc/free in the submit path. Initialized once at ring setup.
Client Thread Wineserver
───────────── ──────────
NtReadFile(async)
server_get_unix_fd() ──→ get_handle_fd
register_async() ──→ register_async
queue_async(&fd->read_q)
set_fd_events(POLLIN) ←── epoll monitors fd
return STATUS_PENDING
... main_loop_epoll():
(thread does other work) global_lock.lock()
epoll_pwait2() → fd ready
fd_poll_event → async_wake_up(STATUS_ALERTED)
global_lock.unlock()
...
(thread enters alertable wait)
async_read_proc():
server_get_unix_fd() ──→ get_handle_fd (again)
read(fd, buf, len)
set IOSB, signal event
Syscalls per async read: 2 server round-trips (register + get_fd) + epoll_wait + read = 4+ kernel transitions
Client Thread Wineserver
───────────── ──────────
NtReadFile(async)
server_get_unix_fd() ──→ get_handle_fd (cached, usually no round-trip)
dup(unix_fd) → ring_fd
io_uring_prep_read(ring_fd, buf, len)
io_uring_submit() (server never sees this I/O)
return STATUS_PENDING
...
(thread enters server_wait)
ntdll_io_uring_process_completions():
CQE ready → bytes_read
file_complete_async()
close(ring_fd)
Syscalls per async read: 1 io_uring_enter (submit+wait batched) = 1 kernel transition
The server is completely bypassed for the I/O monitoring and data transfer. It still handles the initial fd lookup (usually cached) and completion port notifications if needed.
Before: After:
poll(fd, POLLIN, timeout) ntdll_io_uring_poll(fd, POLLIN, timeout)
read(fd, buf, len) read(fd, buf, len) ← unchanged
The read()/write() still goes through virtual_locked_read() for write-watch safety. Only the poll wait is replaced.
Status: COMMITTED (2d903dda200)
In NtReadFile and NtWriteFile, the synchronous blocking path does poll(fd, events, timeout) to wait for fd readiness, then loops back to read()/write(). Phase 1 replaces the poll() call with ntdll_io_uring_poll().
| File | Change | LOC |
|---|---|---|
dlls/ntdll/unix/file.c |
NtReadFile sync wait: poll() → ntdll_io_uring_poll() with fallback |
~15 |
dlls/ntdll/unix/file.c |
NtWriteFile sync wait: same pattern | ~15 |
If ntdll_io_uring_poll() returns -ENOSYS (ring unavailable), the code immediately falls back to the original poll() call. Zero behavioral change for non-io_uring systems.
The actual read()/write() still uses virtual_locked_read() / virtual_locked_pread(), which handles EFAULT from write-watched pages by retrying inside virtual_mutex. io_uring only replaces the poll wait, not the data transfer.
Status: COMMITTED (2d903dda200), pool allocator COMMITTED (82cea7143ff)
When NtReadFile or NtWriteFile would register an async operation with the server (register_async / SERVER_START_REQ(register_async)), the io_uring path instead:
dup()s the unix fd for lifetime safetyIORING_OP_READ or IORING_OP_WRITE to the per-thread ringSTATUS_PENDINGserver_wait() or server_select() callThe current async model re-fetches the unix fd via server_get_unix_fd() in every callback. io_uring requires the fd to remain valid for the duration of the in-flight SQE. Solution: dup() the fd before submission. The duplicate is owned exclusively by the uring_async_op struct and closed on CQE completion or cancellation.
This is analogous to the thread-map fd reference pattern used elsewhere in NSPA – hold a ref for the lifetime of the operation.
CQEs are drained cooperatively:
// server.c — called before any blocking wait
unsigned int server_select(…) {
ntdll_io_uring_process_completions(); // drain CQ
…
}
unsigned int server_wait(…) {
ntdll_io_uring_process_completions(); // drain CQ
…
}
When a CQE arrives, complete_uring_op() translates the result to NTSTATUS and calls file_complete_async() – the same function used by Wine’s normal async completion path. This handles:
NtSetEvent)NtQueueApcThread)add_completion())If io_uring’s kernel read hits EFAULT (buffer in a write-watched page), the CQE result is -EFAULT. The completion handler detects this and frees the operation – the caller should retry through the server async path, which uses virtual_locked_read() with proper page fault handling. This is the graceful fallback for an edge case that rarely occurs in practice.
| File | Change | LOC |
|---|---|---|
dlls/ntdll/unix/file.c |
Async read: try io_uring before register_async_file_read() |
~8 |
dlls/ntdll/unix/file.c |
Async write: try io_uring before server registration | ~10 |
dlls/ntdll/unix/server.c |
Completion drain in server_select() and server_wait() |
+2 |
dlls/ntdll/unix/thread.c |
Ring cleanup in pthread_exit_wrapper() |
+1 |
Status: COMMITTED — sync + overlapped WORKING (a645adc66ed)
Socket I/O (sock_recv / sock_send) is tightly coupled with the server’s async lifecycle:
wait_handle for the async operationpending_events, reported_events)set_async_direct_result() to report completion back to serversock_get_poll_events() in server decides what to monitor based on queue stateUnlike file I/O (where the server’s only role is fd monitoring), the socket code has the server actively participating in the protocol state machine.
| Option | Description | Benefit | Verdict |
|---|---|---|---|
| B1: Server flag | Add client_poll flag to recv/send. Server skips epoll. |
Clean, ~50 LOC | Evaluated, not used |
| B2: Both poll | Server + client both monitor. First wins. | No global_lock benefit | Rejected |
| E2: Shared bitmap | Process-level bitmap. Client sets bit per fd. Server checks in sock_get_poll_events(). |
No protocol change. Generalizes. | IMPLEMENTED |
| C: Full bypass | Skip server entirely for connected TCP. | Breaks socket state machine | Rejected |
The key innovation is ALERTED-state interception – intercepting in the ALERTED block before set_async_direct_result is called:
Server: recv_socket → STATUS_ALERTED + wait_handle
Client: try_recv(fd) → EAGAIN (not ready)
← interception point
BEFORE: set_async_direct_result(PENDING) ← would restart async on server
NOW: set bitmap + io_uring POLL_ADD ← async stays ALERTED (frozen)
return STATUS_PENDING
… io_uring monitors fd …
CQE fires:
try_recv(fd) → SUCCESS (data available)
set_async_direct_result(SUCCESS, bytes) ← server accepts (ALERTED preserved)
Server: completes async, signals event/IOCP
Why this works: When an async is ALERTED on the server, terminated=1 and async_waiting() returns false. The server does not monitor the fd via epoll. The bitmap provides additional safety (sock_get_poll_events returns -1). Only one call to set_async_direct_result ever happens – from the CQE handler with the final result.
Why previous approaches failed (4 attempts):
set_async_direct_result(PENDING) restarted the async, server monitored via epoll AND io_uring monitored → race.| Sync | Overlapped | |
|---|---|---|
| ALERTED block | Intercept, submit POLL_ADD | Same |
| Return | wait_async(wait_handle) – blocks |
STATUS_PENDING – returns immediately |
| CQE wakeup | ntsync uring_fd → retry loop drains CQ → set_async_direct_result → ntsync signals wait_handle |
set_async_direct_result → server signals event/IOCP |
| Fallback (EAGAIN in CQE) | set_async_direct_result(PENDING) → server restarts async → epoll |
Same |
Every socket fd removed from server epoll = one fewer event in main_loop_epoll() = shorter global_lock hold per iteration. This benefits ALL shmem dispatchers, not just the thread doing socket I/O.
Before: main_loop_epoll() processes N socket events while holding global_lock
After: main_loop_epoll() processes (N - M) events (M = client-monitored sockets)
Phase A (immediate recv): 2000/2000, avg 95us, p99 162us
Phase B (overlapped recv): 2000/2000, avg 113us, p99 189us, 2000 async (PENDING)
Full suite: 22/22 PASS (11 tests x baseline + rt)
| File | Lines | Purpose |
|---|---|---|
dlls/ntdll/unix/io_uring.c |
~760 | Per-thread ring management, all Phase 1-3 functions, pool allocator |
| File | Changed Lines | Purpose |
|---|---|---|
dlls/ntdll/unix/unix_private.h |
+30 | io_uring function declarations, bitmap helpers |
dlls/ntdll/unix/file.c |
~30 | Sync poll replacement + async read/write bypass |
dlls/ntdll/unix/socket.c |
~120 | Phase 3: ALERTED interception, CQE handler, bitmap set/clear |
dlls/ntdll/unix/sync.c |
~40 | ntsync uring_fd retry loop, deferred completion flush |
dlls/ntdll/unix/server.c |
+2 | Completion drain at server_select/server_wait |
dlls/ntdll/unix/thread.c |
+1 | Ring cleanup at thread exit |
server/sock.c |
~40 | E2 bitmap check in sock_get_poll_events, bitmap cache on struct sock |
dlls/ntdll/Makefile.in |
+2 | io_uring.c source + URING_LIBS |
configure.ac |
+8 | liburing detection |
liburing.so.2 (system package). Available as liburing in Arch Linux [extra]. Detected at configure time via AC_CHECK_LIB(uring, io_uring_queue_init).
| Component | Status | Test Coverage |
|---|---|---|
| io_uring ring management | COMMITTED | All tests PASS |
| Phase 1: sync poll replacement | COMMITTED | All tests PASS |
| Phase 2: async file I/O bypass | COMMITTED | All tests PASS |
| Pool allocator (TLS, 32 ops) | COMMITTED | RT-safe, zero malloc in submit path |
| Phase 3: socket I/O (sync) | COMMITTED (cc9610c187c) | socket-io Phase A 2000/2000 |
| Phase 3: socket I/O (overlapped) | COMMITTED (a645adc66ed) | socket-io Phase B 2000/2000 |
| E2 bitmap (server sock.c) | COMMITTED | sock_get_poll_events returns -1 for client-monitored fds |
| ntsync uring_fd extension | COMMITTED (kernel patch) | Enables CQE wakeup in ntsync waits |
| ntsync PI kmalloc fix | COMMITTED (kernel patch) | Fixes __schedule_bug on PREEMPT_RT |
| ntsync PI v2 (kernel driver) | COMMITTED | 3 bugs fixed, ntsync 8/8 PASS |
| Shmem PI futex redesign | REJECTED | Deadlocks on SMP |
| ntsync URING_CMD | Shelved | ntsync is synchronous |
| io_uring futex PI | Research | Kernel lacks PI flag |