Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-18 Author: jordan Johnston
Footnote: Why memfd and not session shmem
Wine’s windowing model routes every PostMessage / SendMessage call through the wineserver: the sender writes a request, the server allocates a struct message, inserts it into the receiver’s queue, the receiver polls via GetMessage / PeekMessage (another wineserver round-trip), and for synchronous sends a reply_message round-trip closes the loop. On a typical RT audio workload this costs hundreds to thousands of wineserver RTTs per second – the NSPA profiler captured 6,239 send_message RTTs / 60 s from Ableton Live’s AudioCalc thread alone during a single adversarial recording session.
Wine-NSPA message bypass replaces that round-trip chain for same-process cross-thread window messages with a direct shared-memory ring:
NTSync eventget_message server request)The feature is invisible to Win32 applications – the same PostMessage / SendMessage API, the same delivery semantics, the same window procedure dispatch. It is same-process-only by design (cross-process messaging continues through the server because ring addresses like HWND / WPARAM / LPARAM only make sense in the sender’s address space and handle table).
| Source | RTTs / 60 s (bypass off) |
|---|---|
AudioCalc threads (send_message) |
6,239 |
DWM-Sync (posts + sync sends) |
several thousand |
| Total busy Ableton playback traffic | ~500 – 1000 / sec |
The bypass targets the AudioCalc + DWM-Sync -> MainThread hot path that dominates this profile.
| Component | Interaction |
|---|---|
| Shmem IPC (v1.5) | Orthogonal. Shmem IPC handles request/reply protocol for ntdll <-> wineserver. Msg bypass is a peer-to-peer window-message path that sidesteps the server entirely. |
NTSync (/dev/ntsync) |
Direct wake. Sender calls wine_server_signal_internal_sync() on the receiver’s queue sync event – a ntsync ioctl, no wineserver round-trip. Receiver wakes via ntsync_schedule. |
| PI global_lock | Load relief. Every bypass message is one fewer send_message request the server handles under global_lock. Reduces contention for shmem dispatchers. |
| CS-PI (FUTEX_LOCK_PI) | No conflict. Bypass operates in client code only; no server locks are acquired on the fast path. |
| RT scheduling (SCHED_FIFO/RR) | RT-safe fast path. After warm-up, a bypass POST/SEND is atomic CAS + memory reads/writes on mlock()-pinned memory. No syscalls, no page faults. |
| io_uring I/O bypass | Compatible, independent. Different bottleneck, different ring. |
memfd_create() region containing its bypass ring. Server allocates on demand; client receives via SCM_RIGHTS and mmaps locally. Rings never live in Wine’s session shmem (§8 explains why that matters).nspa_get_thread_queue requests early.FALSE / NULL on any corner case (bypass disabled, no own ring, lookup failure, cross-process destination, DDE message, thread-message with hwnd == 0). Callers fall back to the legacy wineserver path. Wine apps see identical behavior whether bypass is enabled or not.__atomic_* operations only. Shared memory is MAP_POPULATE-prefaulted and mlock-pinned so no demand paging happens on hot access. Warm-up cost (memfd create + map + lock) is paid once per peer, off the RT-critical path.NSPA_* environment variable. Users can enable the layers they trust and fall back to stricter Phase 3 behavior by unsetting the later gates.nspa_get_thread_queue (peer lookup) and nspa_ensure_own_bypass (own bootstrap). Server handlers reuse a single nspa_alloc_bypass_shm() helper for both.The reduction is not just in RTT count – vanilla Wine’s send_message handler acquires global_lock to insert the new message into the receiver’s queue. Under heavy traffic this contended mutex becomes a serialization point. The memfd ring sidesteps global_lock entirely: slot reservation is a lock-free CAS in shared memory.
Each msg_queue owns a nspa_queue_bypass_shm_t region consisting of a 64-slot forward-message ring and a 16-slot reply ring. Sizes are compile-time constants chosen to keep the whole region around 10 KB.
The forward ring’s state transitions are CAS-claimed (multi-producer head, single-consumer tail). The reply ring’s generation field discriminates against stale writebacks: a sender that times out marks its slot FREE and a later writer that finds state != PENDING drops the reply without touching user memory.
| From | To | Actor | Semantic |
|---|---|---|---|
EMPTY |
WRITING |
sender | ring_reserve_slot CAS on head allocates, then transition happens pre-fill |
WRITING |
READY |
sender | release store after all slot fields written |
READY |
CONSUMED |
receiver | consume via CAS in client pump OR server arbitration – whichever wins |
CONSUMED |
EMPTY |
receiver | batched run at tail advance after consumption |
Every queue’s bypass region is backed by an anonymous memfd_create() file. The fd’s lifetime follows the queue: created on first use, closed on queue destroy. Clients that need to talk to a peer receive the fd over the wineserver socket via SCM_RIGHTS and mmap it locally.
Wine’s session shmem is a single large mmap’d region shared across the entire wineserver + all client processes. alloc_shared_object() sub-allocates from that region via a bump allocator with per-object headers containing an id and a seqlock seq. This is great for small, long-lived shared records (process/thread metadata, desktop state).
The original msg-bypass implementation put the ring inside session shmem too. It produced a reliable, reproducible Ableton regression: library panel would not populate whenever MainThread had a ring allocated – and every runtime gate tested showed the bug persisted independent of ring reads/writes. The allocation machinery itself was the trigger. See §8 for the full story.
memfd gives each queue a private page-aligned region with zero interaction with session shmem’s allocator, seqlock, or layout. The library regression vanished on the redesign.
Warm-cache RT cost estimate (sender side, steady state):
| Step | Cost |
|---|---|
| TLS cache lookup (hash + probe) | ~10 ns |
ring_reserve_slot CAS on head |
~20 ns |
| 10 aligned stores (filling the slot) | ~10 ns |
Release store on state |
~5 ns |
wine_server_signal_internal_sync (ntsync ioctl) |
200 – 500 ns kernel round-trip |
| Total per message (warm) | < 1 µs |
Cold-cache cost (first publish to a peer) adds one SERVER_START_REQ(nspa_get_thread_queue) round-trip + SCM_RIGHTS receive + mmap + mlock – ~50 – 200 µs one-time, paid off the RT path.
Fall-back conditions (fast path returns FALSE -> caller falls back to server):
NSPA_ENABLE_MSG_RING env var unset (default)dest_tid == current tid (same-thread PostMessage)hwnd == 0 (thread message; semantics require server handling)dest is in a different processWM_DDE_FIRST..WM_DDE_LASTMSG_CALLBACK / MSG_HOOK_LL typesNSPA_ENABLE_OWN_BOOTSTRAP unset (opt-in gate)The receiver side has two concerns: (a) learn that a ring message has arrived, and (b) pull it out and deliver it to the window proc.
NtUserGetQueueStatus, check_queue_bits, and the message pump’s local shmem check all need to report ring-pending activity alongside legacy wake_bits in the queue’s shmem. Phase 4.5 (1d18cb7c4e8) wires this through the Phase 4 TLS-cached own-ring mmap:
c
const nspa_queue_bypass_shm_t *queue_bypass = nspa_get_own_bypass_shm_public();
if (queue_bypass) {
UINT ring_total = __atomic_load_n(&queue_bypass->nspa_msg_ring.pending_count, ACQUIRE);
UINT ring_send = __atomic_load_n(&queue_bypass->nspa_msg_ring.pending_send_count, ACQUIRE);
if (ring_total > ring_send) ring_bits |= QS_POSTMESSAGE | QS_ALLPOSTMESSAGE;
if (ring_send) ring_bits |= QS_SENDMESSAGE;
wake |= ring_bits;
}
Without this synthesis, check_queue_bits reports “nothing to do” even after the sender’s ntsync wake – the thread sleeps through ring deliveries until some unrelated event prompts it to call get_message, producing the 5 s dispatch-latency timeout seen pre-4.5.
Phase 4.6 (657c16691f5) adds nspa_try_pop_own_ring_send(), called from peek_message before the SERVER_START_REQ(get_message) block:
c
if (signal_bits & QS_SENDMESSAGE &&
nspa_try_pop_own_ring_send(hwnd, first, last, &pop_type, ..., &pop_win)) {
/* Filled info struct directly; skip server RTT entirely */
}
else SERVER_START_REQ(get_message) {
/* Legacy path: server arbitrates ring + legacy queue */
}
The client pop scans its own ring for a READY SEND-class slot, CAS-claims it (READY -> CONSUMED), decrements pending_count/pending_send_count, advances tail over any leading run of CONSUMED slots, and fills the received_message_info for the message pump. Same-process window filter (hwnd != 0) falls back to server because is_child_window needs server’s window tree.
When the window proc returns for a ring-origin SEND, reply_message() writes the result back:
c
if (info->nspa_sender_tid && remove) {
if (nspa_write_ring_reply(info->nspa_sender_tid, info->nspa_reply_slot,
result, NULL, 0))
return; /* direct ring write + signal -- no server */
/* fall through to server reply_message on stale-slot */
}
nspa_write_ring_reply resolves the sender via the peer cache (nspa_lookup_peer(sender_tid), which mmaps the sender’s ring on first use), writes the LRESULT into the sender’s reply slot, and signals the sender’s queue sync event. The sender’s nspa_wait_ring_reply wakes, reads the result, and returns to the caller.
Each capability layers on the previous. Stricter configs trade capture coverage for safety during validation.
Recommended configurations by use case:
| Use case | Env vars |
|---|---|
| Production-safe baseline | none set |
| POST capture only | NSPA_ENABLE_MSG_RING=1 + NSPA_FORCE_SPECULATIVE_ALLOC=1 |
| Full capture (DAW workloads) | all four layers set |
| Regression debug | set NSPA_MSG_RING_EXCLUDE_MAIN=1 alongside any other config |
Exact A/B on a real DAW isn’t practical without a scripted identical workload. Ballpark from measured data: bypass OFF vs msg-ring full (all four opt-in layers enabled, 90 s active Ableton playback + GUI interaction).
| Metric | Bypass OFF | msg-ring full | Delta |
|---|---|---|---|
| Wineserver message RTTs / sec | 500 – 1000 | 250 – 500 | ~halved |
| Ring-arbitrated dispatches / sec | 0 | ~450 | ring active |
| Ring allocations | 0 | 14 thread rings | on demand |
| MainThread idle wait state | ntsync_schedule |
ntsync_schedule |
same |
| Stale-slot / timeout errors | n/a | 0 | clean dispatch |
Ballpark ~200 – 500 wineserver RTTs / sec eliminated on the hot message path during busy playback. Server-side CPU spent in send_message + get_posted_message handlers roughly halves.
The ring also shortens the hot path for latency-sensitive traffic:
| Operation | Bypass OFF | msg-ring |
|---|---|---|
| CAS slot claim | N/A (server allocates struct message) | ~20 ns |
| Message publish | 2 wineserver RTTs (~10 – 100 µs each) | 1 ntsync wake ioctl (~200 – 500 ns) |
| Message consume | 1 wineserver RTT | ~1 µs CAS + local read |
| SEND reply | 1 wineserver RTT | CAS + ntsync wake on sender |
For RT-critical senders (AudioCalc, DWM-Sync, winejack JACK callback), removing server round-trips from the message-send path directly shortens the hot loop, not just reduces total throughput.
ntsync_schedule between events under active GUI stress| File | Role | Additions |
|---|---|---|
server/queue.c |
Memfd alloc/free, nspa_ensure_shared, nspa_get_thread_queue handler, nspa_ensure_own_bypass handler, ring arbitration | ~300 LOC |
server/protocol.def |
nspa_get_thread_queue + nspa_ensure_own_bypass requests, ring slot + ring struct definitions |
~100 LOC |
server/mapping.c |
Alloc-side-effect probe gates (NO_POISON, ID_STRIDE) | ~35 LOC |
dlls/win32u/nspa_msg_ring.c |
Client cache, try_post_ring, try_send_ring, wait_ring_reply, write_ring_reply, get_own_bypass_shm, try_pop_own_ring_send, tail advance | ~900 LOC (new file) |
dlls/win32u/win32u_private.h |
Msg-ring public prototypes | ~15 LOC |
dlls/win32u/message.c |
Integration in send_inter_thread_message, put_message_in_queue, peek_message, reply_message, SEH wrap | ~80 LOC |
dlls/win32u/input.c |
Wake-bit synthesis through own ring | ~10 LOC |
dlls/win32u/winstation.c |
get_queue_bypass_shm delegates to msg-ring public | ~20 LOC |
dlls/ntdll/unix/server.c, include/wine/server.h |
Export wine_server_receive_fd for cross-unixlib use |
~2 LOC |
dlls/wow64/sync.c, dlls/ntdll/ntdll.spec |
Wow64 thunks for Nt* condvar PI syscalls | ~30 LOC |
nspa/docs/msg-ring-memfd-redesign.md |
Implementation plan + phase notes | 400 LOC |
nspa/docs/msg-ring-architecture.md |
This document |
Total msg-ring feature: ~1500 LOC new code, ~250 LOC edits in existing files.
| Phase | Commit | Outcome |
|---|---|---|
| c9daf83afe9 | alloc-side-effect isolation probes (NO_POISON, ID_STRIDE) | Ruled out poison fill, ID sensitivity |
| 924df6727db | opt-in NSPA_MSG_RING_EXCLUDE_MAIN gate | Workaround for library panel; first-thread specific |
| 54802c6351d | Memfd redesign implementation plan doc | Plan captured |
| 75de316f9ad | Phase 1 + 2 memfd alloc + client mmap | POST capture validated (~95 RTTs/s saved) |
| 4eaf876a118 | Phase 4 ensure_own_bypass protocol + client TLS | SEND infrastructure in place |
| 106735ff791 | Phase 4 opt-in gate (default off) | Avoided premature-default stale-slot storm |
| 1d18cb7c4e8 | Phase 4.5 wake-bit synthesis via memfd | Fixed client-side wake-bit blindness |
| 657c16691f5 | Phase 4.6 client-side ring-SEND dispatch | Full SEND bypass validated |
| 70ea71f8c7b | Design doc update with ballpark reduction | Docs current |
bypass_locator.id != 0 sentinel in favor of a proper bypass_fd_valid protocol field. Pure cleanup, no functional change.check_queue_bits if any remaining paths short-circuit without the ring-aware bits (audit pending).MSG_CALLBACK / MSG_CALLBACK_RESULT stay on server (callback needs the server’s sender-callback state).Today’s scope captures roughly 48 % of get_posted_message server calls on busy Ableton playback (the 43,559 / 90,681 ring-arbitrated fraction from §9). The remaining 52 % is split across several message categories the ring doesn’t yet handle. Ranked by expected server-load reduction:
| # | Extension | Why it matters | Complexity |
|---|---|---|---|
| 1 | WM_TIMER local injection |
DAWs + editors pump UI at 60+ Hz with timers. Each tick = 1 wineserver RTT today to post WM_TIMER to the target thread. Per-thread timers in a playing Ableton: hundreds of RTTs / sec. Biggest single server-load source we still hit. | High – client-side timer expiry tracking + direct ring injection; must stay consistent with server’s SetTimer / KillTimer state. |
| 2 | WM_PAINT / invalidation cascade |
DWM-Sync-driven redraw loops + InvalidateRect chains fire many paint-chain messages. Common in refresh-intensive apps (DAWs, editors, video, games). Hundreds of RTTs / sec in busy UI. |
High – paint message generation + coalescing currently server-side; extending ring to signal invalidation between peer threads requires careful coordination. |
| # | Extension | Why it matters | Complexity |
|---|---|---|---|
| 3 | Thread messages (hwnd == 0) |
PostThreadMessage used heavily by WebView2, auth pumps, COM apartments, custom event loops. Quiet in Ableton specifically but broadens coverage to browser / enterprise workloads. |
Low – ring slots already carry sender/dest tid; mostly a matter of removing the “skip thread-msg” early-return and adjusting dispatch fallback. |
| 4 | Hook dispatch (WH_CALLWNDPROC, WH_GETMESSAGE, etc.) |
Every message passing through a thread with an in-process hook = one server call to hook-dispatch today. Acts as a multiplier on other captures: the more messages hook, the more the existing capture compounds into further savings. | Medium – hook chain ordering + filter semantics must match exactly. Same-process hooks are ring-eligible. |
| # | Extension | Why it matters | Complexity |
|---|---|---|---|
| 5 | Packed-data SEND (MSG_OTHER_PROCESS same-process) |
Registered window messages with lparam pointing at variable-size structs. Used for cross-AS or serialised IPC. Rare in pure DAW workloads. |
Medium – add a side-channel memfd for payloads; slot holds (offset, size) into it. |
| 6 | Callback SEND (MSG_CALLBACK / MSG_CALLBACK_RESULT) |
SendMessageCallbackA/W async-reply pattern. Not high-frequency in most apps. |
Medium – extend reply ring to invoke callback locally on reply arrival; handle callback identity + security. |
| 7 | SendInput intra-process synthesis |
Rare automation use cases where apps inject input into their own threads. | Low – small addition but very niche. |
thread_input is a different shared structure with server-owned state; out of scope for the per-thread bypass ring.For raw wineserver-load reduction on Ableton-style workloads, Tier 1 (timer + paint) is the clear target. For broadening to non-DAW apps (WebView2, COM-heavy tools, browsers embedded in apps), Tier 2 item 3 (thread messages) is the cheapest high-leverage add. Tiers 2-3 items are best adopted opportunistically as specific apps demand them.
Any of these extensions reuses the existing ring memory layout + cache discipline + fast-path atomics – the extension work is entirely in the “which message types get ring-delivered and how” layer, not in the memfd / seqlock / cross-thread plumbing.
The memfd design was not the initial plan. The first msg-ring implementation put the per-queue ring inside Wine’s session shmem via alloc_shared_object() – natural given the existing machinery. That produced a reliable Ableton Live regression: the library panel would not populate whenever the ring allocation happened for the process’s first thread (MainThread).
A systematic A/B matrix ruled out every runtime code path that reads or writes the ring. Gates tested (each with bypass on, each in isolation):
| Gate | Subsystem disabled | Library |
|---|---|---|
NSPA_MSG_RING_SERVER_NO_RING_ARB |
ring arbitration in get_posted / get_message | broken |
NSPA_MSG_RING_SERVER_NO_WAKE_SYN |
wake-bit synthesis in is_signaled | broken |
NSPA_MSG_RING_SERVER_NO_SEQ |
per-message post_seq / change_ack_seq atomics | broken |
NSPA_MSG_RING_SERVER_NO_LOCATOR |
zero the wire locator (keep alloc) | broken |
NSPA_MSG_RING_SERVER_NO_POISON |
skip mark_block_uninitialized 0x55 fill | broken |
NSPA_MSG_RING_SERVER_ID_STRIDE=1 |
bump last_object_id by 65536 (ID range) | broken |
NSPA_CLIENT_IGNORE_LOCATOR |
client never resolves ring (no reads) | broken |
NSPA_MSG_RING_SERVER_NO_ALLOC |
skip alloc_shared_object entirely | works |
NSPA_MSG_RING_EXCLUDE_MAIN |
block alloc only for first-thread queue | works |
Every identifiable runtime side-effect (poison fill, ID bump, locator publish, seqlock ops) was proven innocent. The bug sat in the mere presence of a session_object_t entry + its shared_object_t header inside the shared session for the process’s first thread. The specific mechanism was never isolated further – all named side-effects were ruled out, leaving only a memory-layout / seqlock-interaction class of cause.
Moving the ring to a per-queue memfd eliminates all of it: no session_object_t entry, no shared_object_t header bump, no queue_shm_t locator publish, no interaction with session shmem’s bump allocator. The ring protocol itself (slot layout, state machine, cache discipline, fast paths) was unchanged – only the allocation + discovery layer swapped. Library regression resolved end-to-end.
See also:
- nspa/docs/architecture.md – full Wine-NSPA architecture reference
- nspa/docs/msg-ring-memfd-redesign.md – implementation plan + phase notes
- nspa/docs/io_uring-architecture.md – orthogonal I/O bypass layer