Wine-NSPA – Message Bypass Architecture

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-18 Author: jordan Johnston

Table of Contents

  1. Overview
  2. Design Principles
  3. Vanilla Wine vs Wine-NSPA Message Ring
  4. Ring Layout in Shared Memory
  5. Memfd Lifecycle
  6. Client Fast Path (POST + SEND)
  7. Dispatch Path
  8. Opt-In Gating
  9. Results & Ballpark Reduction
  10. File Manifest
  11. Phase History
  12. Future Extensions (ranked)

Footnote: Why memfd and not session shmem


1. Overview

Wine’s windowing model routes every PostMessage / SendMessage call through the wineserver: the sender writes a request, the server allocates a struct message, inserts it into the receiver’s queue, the receiver polls via GetMessage / PeekMessage (another wineserver round-trip), and for synchronous sends a reply_message round-trip closes the loop. On a typical RT audio workload this costs hundreds to thousands of wineserver RTTs per second – the NSPA profiler captured 6,239 send_message RTTs / 60 s from Ableton Live’s AudioCalc thread alone during a single adversarial recording session.

Wine-NSPA message bypass replaces that round-trip chain for same-process cross-thread window messages with a direct shared-memory ring:

  1. Sender writes the message into the receiver thread’s ring and wakes the receiver via an NTSync event
  2. Receiver’s message pump reads the message out of the ring locally (no get_message server request)
  3. For synchronous SENDs the receiver writes the reply back into the sender’s reply ring and signals

The feature is invisible to Win32 applications – the same PostMessage / SendMessage API, the same delivery semantics, the same window procedure dispatch. It is same-process-only by design (cross-process messaging continues through the server because ring addresses like HWND / WPARAM / LPARAM only make sense in the sender’s address space and handle table).

Motivating Profile

Source RTTs / 60 s (bypass off)
AudioCalc threads (send_message) 6,239
DWM-Sync (posts + sync sends) several thousand
Total busy Ableton playback traffic ~500 – 1000 / sec

The bypass targets the AudioCalc + DWM-Sync -> MainThread hot path that dominates this profile.

Relationship to Existing NSPA Infrastructure

Component Interaction
Shmem IPC (v1.5) Orthogonal. Shmem IPC handles request/reply protocol for ntdll <-> wineserver. Msg bypass is a peer-to-peer window-message path that sidesteps the server entirely.
NTSync (/dev/ntsync) Direct wake. Sender calls wine_server_signal_internal_sync() on the receiver’s queue sync event – a ntsync ioctl, no wineserver round-trip. Receiver wakes via ntsync_schedule.
PI global_lock Load relief. Every bypass message is one fewer send_message request the server handles under global_lock. Reduces contention for shmem dispatchers.
CS-PI (FUTEX_LOCK_PI) No conflict. Bypass operates in client code only; no server locks are acquired on the fast path.
RT scheduling (SCHED_FIFO/RR) RT-safe fast path. After warm-up, a bypass POST/SEND is atomic CAS + memory reads/writes on mlock()-pinned memory. No syscalls, no page faults.
io_uring I/O bypass Compatible, independent. Different bottleneck, different ring.

2. Design Principles


3. Vanilla Wine vs Wine-NSPA Message Ring

Vanilla Wine (server-mediated) Wine-NSPA msg-ring (memfd) PostMessage / SendMessage (ntuser) SERVER: send_message request alloc struct message + insert queue global_lock held during insertion set_queue_bits + sync wake receiver wakes (NtWaitForMultipleObjects) SERVER: get_message request remove from queue, copy to reply dispatch window proc (SEND only) reply_message RTT Cost per send: 2 wineserver RTTs (POST), 3 (SEND) global_lock held during every insertion PostMessage / SendMessage (ntuser) nspa_try_post_ring() / try_send_ring() ring_reserve_slot (CAS head++) write fields to slot, state -> READY (SEND) reserve reply slot in own ring wine_server_signal_internal_sync() ntsync ioctl -> rt_mutex wake (no RTT) receiver wakes (ntsync_schedule) nspa_try_pop_own_ring_send() CAS READY -> CONSUMED, fill info dispatch window proc (SEND) nspa_write_ring_reply() write to sender's reply slot + wake Cost per send (warm ring): 0 wineserver RTTs for POST or SEND 1 ntsync wake ioctl (kernel fast path)

The reduction is not just in RTT count – vanilla Wine’s send_message handler acquires global_lock to insert the new message into the receiver’s queue. Under heavy traffic this contended mutex becomes a serialization point. The memfd ring sidesteps global_lock entirely: slot reservation is a lock-free CAS in shared memory.


4. Ring Layout in Shared Memory

Each msg_queue owns a nspa_queue_bypass_shm_t region consisting of a 64-slot forward-message ring and a 16-slot reply ring. Sizes are compile-time constants chosen to keep the whole region around 10 KB.

nspa_queue_bypass_shm_t (memfd, ~10 KB, mlock'd) nspa_msg_ring_t (forward, 64 slots) head monotonic tail consumer cursor active 0 = ring disabled pending_count + pending_send_count change_seq / change_ack_seq wake-bit change detection slots[0..63] nspa_msg_slot_t (128 B each approx) state type, msg win (hwnd) wparam/lparam sender_tid sender_pid reply_slot post_seq time x / y data_size __pad ... [63] State machine: EMPTY -> WRITING -> READY -> CONSUMED -> EMPTY nspa_reply_ring_t (16 slots, per-queue) next_alloc monotonic reservation hint (CAS FREE -> PENDING) slots[0..15] nspa_reply_slot_t state result (LRESULT) generation data[inline] [1] [2] ... [15] State: FREE -> PENDING (sender) -> READY (receiver writes) -> FREE (sender reads)

The forward ring’s state transitions are CAS-claimed (multi-producer head, single-consumer tail). The reply ring’s generation field discriminates against stale writebacks: a sender that times out marks its slot FREE and a later writer that finds state != PENDING drops the reply without touching user memory.

Forward Slot State Machine

From To Actor Semantic
EMPTY WRITING sender ring_reserve_slot CAS on head allocates, then transition happens pre-fill
WRITING READY sender release store after all slot fields written
READY CONSUMED receiver consume via CAS in client pump OR server arbitration – whichever wins
CONSUMED EMPTY receiver batched run at tail advance after consumption

5. Memfd Lifecycle

Every queue’s bypass region is backed by an anonymous memfd_create() file. The fd’s lifetime follows the queue: created on first use, closed on queue destroy. Clients that need to talk to a peer receive the fd over the wineserver socket via SCM_RIGHTS and mmap it locally.

memfd allocation + fd passing + client mmap Kernel (anon shmem pages, single physical backing) fd reference-counted; pages unlink when last mapping released AND fd closed wineserver nspa_alloc_bypass_shm(): 1. memfd_create(MFD_CLOEXEC) 2. ftruncate(fd, sizeof(ring)) 3. mmap(fd, RW, SHARED) 4. memset(map, 0); active = 1 queue->nspa_bypass_fd = fd queue->nspa_shared = map nspa_get_thread_queue handler: send_client_fd(fd, sync_handle) Client thread (ntdll Unix) SERVER_START_REQ(nspa_get_thread_queue): wine_server_call(req) check reply->bypass_locator.id wine_server_receive_fd(&token): recvmsg(..., SCM_RIGHTS) match token == sync_handle mmap(fd, RW, SHARED | MAP_POPULATE): prefault all pages mlock(map, size): pin in RAM, no RT page faults SCM_RIGHTS fd crosses the wineserver socket server-side map client-side map (same physical pages) Lifetime rules: 1. Server holds fd until msg_queue_destroy -> nspa_free_bypass_shm 2. Each client holds one mmap (+ reference count via page tables) 3. Clients close fd immediately after mmap -- mapping holds the kernel ref 4. On queue destroy: server closes fd + unmaps; client maps drain naturally

Why memfd, Not Session Shmem

Wine’s session shmem is a single large mmap’d region shared across the entire wineserver + all client processes. alloc_shared_object() sub-allocates from that region via a bump allocator with per-object headers containing an id and a seqlock seq. This is great for small, long-lived shared records (process/thread metadata, desktop state).

The original msg-bypass implementation put the ring inside session shmem too. It produced a reliable, reproducible Ableton regression: library panel would not populate whenever MainThread had a ring allocated – and every runtime gate tested showed the bug persisted independent of ring reads/writes. The allocation machinery itself was the trigger. See §8 for the full story.

memfd gives each queue a private page-aligned region with zero interaction with session shmem’s allocator, seqlock, or layout. The library regression vanished on the redesign.


6. Client Fast Path (POST + SEND)

Client fast path: cache lookup + ring publish + ntsync wake Per-thread cache (pthread_key TLS) nspa_cache[32] open-addressed hash on tid entry.tid (0 = empty slot) entry.sync_handle peer's queue sync event entry.mapped_ptr peer's ring mmap (client view) entry.mapped_size munmap on evict/clear Negative cache: tid set, ptr NULL avoids RTT on every subsequent call Own-ring TLS slot NULL / -1 / valid-ptr sentinel bootstrapped via ensure_own_bypass POST / SEND publish sequence (warm cache) nspa_lookup_peer(dest_tid) cached entry -> ptr, sync_handle ring_reserve_slot: CAS head, ring->head += 1 slot->type = type; slot->wparam = wp; slot->lparam = lp; slot->reply_slot = ... __atomic_store_n(&slot->state, READY, RELEASE) wine_server_signal_internal_sync(peer_sync_handle)

Warm-cache RT cost estimate (sender side, steady state):

Step Cost
TLS cache lookup (hash + probe) ~10 ns
ring_reserve_slot CAS on head ~20 ns
10 aligned stores (filling the slot) ~10 ns
Release store on state ~5 ns
wine_server_signal_internal_sync (ntsync ioctl) 200 – 500 ns kernel round-trip
Total per message (warm) < 1 µs

Cold-cache cost (first publish to a peer) adds one SERVER_START_REQ(nspa_get_thread_queue) round-trip + SCM_RIGHTS receive + mmap + mlock – ~50 – 200 µs one-time, paid off the RT path.

Fall-back conditions (fast path returns FALSE -> caller falls back to server):


7. Dispatch Path

The receiver side has two concerns: (a) learn that a ring message has arrived, and (b) pull it out and deliver it to the window proc.

Wake-bit Synthesis

NtUserGetQueueStatus, check_queue_bits, and the message pump’s local shmem check all need to report ring-pending activity alongside legacy wake_bits in the queue’s shmem. Phase 4.5 (1d18cb7c4e8) wires this through the Phase 4 TLS-cached own-ring mmap:

c const nspa_queue_bypass_shm_t *queue_bypass = nspa_get_own_bypass_shm_public(); if (queue_bypass) { UINT ring_total = __atomic_load_n(&queue_bypass->nspa_msg_ring.pending_count, ACQUIRE); UINT ring_send = __atomic_load_n(&queue_bypass->nspa_msg_ring.pending_send_count, ACQUIRE); if (ring_total > ring_send) ring_bits |= QS_POSTMESSAGE | QS_ALLPOSTMESSAGE; if (ring_send) ring_bits |= QS_SENDMESSAGE; wake |= ring_bits; }

Without this synthesis, check_queue_bits reports “nothing to do” even after the sender’s ntsync wake – the thread sleeps through ring deliveries until some unrelated event prompts it to call get_message, producing the 5 s dispatch-latency timeout seen pre-4.5.

Client-Side Ring Pop

Phase 4.6 (657c16691f5) adds nspa_try_pop_own_ring_send(), called from peek_message before the SERVER_START_REQ(get_message) block:

c if (signal_bits & QS_SENDMESSAGE && nspa_try_pop_own_ring_send(hwnd, first, last, &pop_type, ..., &pop_win)) { /* Filled info struct directly; skip server RTT entirely */ } else SERVER_START_REQ(get_message) { /* Legacy path: server arbitrates ring + legacy queue */ }

The client pop scans its own ring for a READY SEND-class slot, CAS-claims it (READY -> CONSUMED), decrements pending_count/pending_send_count, advances tail over any leading run of CONSUMED slots, and fills the received_message_info for the message pump. Same-process window filter (hwnd != 0) falls back to server because is_child_window needs server’s window tree.

Reply Path

When the window proc returns for a ring-origin SEND, reply_message() writes the result back:

c if (info->nspa_sender_tid && remove) { if (nspa_write_ring_reply(info->nspa_sender_tid, info->nspa_reply_slot, result, NULL, 0)) return; /* direct ring write + signal -- no server */ /* fall through to server reply_message on stale-slot */ }

nspa_write_ring_reply resolves the sender via the peer cache (nspa_lookup_peer(sender_tid), which mmaps the sender’s ring on first use), writes the LRESULT into the sender’s reply slot, and signals the sender’s queue sync event. The sender’s nspa_wait_ring_reply wakes, reads the result, and returns to the caller.


8. Opt-In Gating

Each capability layers on the previous. Stricter configs trade capture coverage for safety during validation.

Capability layering via env gates Default msg-ring inert vanilla Wine behaviour zero-regression + ENABLE_MSG_RING POST capture (when own ring exists or speculative on) wake-bit fix active + FORCE_SPECULATIVE_ALLOC peer rings allocated on nspa_get_thread_queue call aggressive peer coverage + ENABLE_OWN_BOOTSTRAP caller's own reply ring via nspa_ensure_own_bypass SEND path capable + ENABLE_CLIENT_RING_DISPATCH client-side pop of SEND msgs in peek_message (no server RTT) FULL msg-ring active library + MIDI + audio + UI all OK Safety / diagnostic gates NSPA_MSG_RING_EXCLUDE_MAIN block alloc for process first thread NSPA_DISABLE_MSG_RING force-off override

Recommended configurations by use case:

Use case Env vars
Production-safe baseline none set
POST capture only NSPA_ENABLE_MSG_RING=1 + NSPA_FORCE_SPECULATIVE_ALLOC=1
Full capture (DAW workloads) all four layers set
Regression debug set NSPA_MSG_RING_EXCLUDE_MAIN=1 alongside any other config

9. Results & Ballpark Reduction

Exact A/B on a real DAW isn’t practical without a scripted identical workload. Ballpark from measured data: bypass OFF vs msg-ring full (all four opt-in layers enabled, 90 s active Ableton playback + GUI interaction).

Server message-path load

Metric Bypass OFF msg-ring full Delta
Wineserver message RTTs / sec 500 – 1000 250 – 500 ~halved
Ring-arbitrated dispatches / sec 0 ~450 ring active
Ring allocations 0 14 thread rings on demand
MainThread idle wait state ntsync_schedule ntsync_schedule same
Stale-slot / timeout errors n/a 0 clean dispatch

Ballpark ~200 – 500 wineserver RTTs / sec eliminated on the hot message path during busy playback. Server-side CPU spent in send_message + get_posted_message handlers roughly halves.

Latency-bound benefits (ns vs µs)

The ring also shortens the hot path for latency-sensitive traffic:

Operation Bypass OFF msg-ring
CAS slot claim N/A (server allocates struct message) ~20 ns
Message publish 2 wineserver RTTs (~10 – 100 µs each) 1 ntsync wake ioctl (~200 – 500 ns)
Message consume 1 wineserver RTT ~1 µs CAS + local read
SEND reply 1 wineserver RTT CAS + ntsync wake on sender

For RT-critical senders (AudioCalc, DWM-Sync, winejack JACK callback), removing server round-trips from the message-send path directly shortens the hot loop, not just reduces total throughput.

Qualitative wins


10. File Manifest

File Role Additions
server/queue.c Memfd alloc/free, nspa_ensure_shared, nspa_get_thread_queue handler, nspa_ensure_own_bypass handler, ring arbitration ~300 LOC
server/protocol.def nspa_get_thread_queue + nspa_ensure_own_bypass requests, ring slot + ring struct definitions ~100 LOC
server/mapping.c Alloc-side-effect probe gates (NO_POISON, ID_STRIDE) ~35 LOC
dlls/win32u/nspa_msg_ring.c Client cache, try_post_ring, try_send_ring, wait_ring_reply, write_ring_reply, get_own_bypass_shm, try_pop_own_ring_send, tail advance ~900 LOC (new file)
dlls/win32u/win32u_private.h Msg-ring public prototypes ~15 LOC
dlls/win32u/message.c Integration in send_inter_thread_message, put_message_in_queue, peek_message, reply_message, SEH wrap ~80 LOC
dlls/win32u/input.c Wake-bit synthesis through own ring ~10 LOC
dlls/win32u/winstation.c get_queue_bypass_shm delegates to msg-ring public ~20 LOC
dlls/ntdll/unix/server.c, include/wine/server.h Export wine_server_receive_fd for cross-unixlib use ~2 LOC
dlls/wow64/sync.c, dlls/ntdll/ntdll.spec Wow64 thunks for Nt* condvar PI syscalls ~30 LOC
nspa/docs/msg-ring-memfd-redesign.md Implementation plan + phase notes 400 LOC
nspa/docs/msg-ring-architecture.md This document

Total msg-ring feature: ~1500 LOC new code, ~250 LOC edits in existing files.


11. Phase History

Phase Commit Outcome
c9daf83afe9 alloc-side-effect isolation probes (NO_POISON, ID_STRIDE) Ruled out poison fill, ID sensitivity
924df6727db opt-in NSPA_MSG_RING_EXCLUDE_MAIN gate Workaround for library panel; first-thread specific
54802c6351d Memfd redesign implementation plan doc Plan captured
75de316f9ad Phase 1 + 2 memfd alloc + client mmap POST capture validated (~95 RTTs/s saved)
4eaf876a118 Phase 4 ensure_own_bypass protocol + client TLS SEND infrastructure in place
106735ff791 Phase 4 opt-in gate (default off) Avoided premature-default stale-slot storm
1d18cb7c4e8 Phase 4.5 wake-bit synthesis via memfd Fixed client-side wake-bit blindness
657c16691f5 Phase 4.6 client-side ring-SEND dispatch Full SEND bypass validated
70ea71f8c7b Design doc update with ballpark reduction Docs current

What’s next

Known scope limits


12. Future Extensions (ranked by wineserver-load reduction)

Today’s scope captures roughly 48 % of get_posted_message server calls on busy Ableton playback (the 43,559 / 90,681 ring-arbitrated fraction from §9). The remaining 52 % is split across several message categories the ring doesn’t yet handle. Ranked by expected server-load reduction:

Tier 1 – highest traffic, biggest win

# Extension Why it matters Complexity
1 WM_TIMER local injection DAWs + editors pump UI at 60+ Hz with timers. Each tick = 1 wineserver RTT today to post WM_TIMER to the target thread. Per-thread timers in a playing Ableton: hundreds of RTTs / sec. Biggest single server-load source we still hit. High – client-side timer expiry tracking + direct ring injection; must stay consistent with server’s SetTimer / KillTimer state.
2 WM_PAINT / invalidation cascade DWM-Sync-driven redraw loops + InvalidateRect chains fire many paint-chain messages. Common in refresh-intensive apps (DAWs, editors, video, games). Hundreds of RTTs / sec in busy UI. High – paint message generation + coalescing currently server-side; extending ring to signal invalidation between peer threads requires careful coordination.

Tier 2 – broader app coverage, medium win

# Extension Why it matters Complexity
3 Thread messages (hwnd == 0) PostThreadMessage used heavily by WebView2, auth pumps, COM apartments, custom event loops. Quiet in Ableton specifically but broadens coverage to browser / enterprise workloads. Low – ring slots already carry sender/dest tid; mostly a matter of removing the “skip thread-msg” early-return and adjusting dispatch fallback.
4 Hook dispatch (WH_CALLWNDPROC, WH_GETMESSAGE, etc.) Every message passing through a thread with an in-process hook = one server call to hook-dispatch today. Acts as a multiplier on other captures: the more messages hook, the more the existing capture compounds into further savings. Medium – hook chain ordering + filter semantics must match exactly. Same-process hooks are ring-eligible.

Tier 3 – targeted unlocks, smaller per-extension win

# Extension Why it matters Complexity
5 Packed-data SEND (MSG_OTHER_PROCESS same-process) Registered window messages with lparam pointing at variable-size structs. Used for cross-AS or serialised IPC. Rare in pure DAW workloads. Medium – add a side-channel memfd for payloads; slot holds (offset, size) into it.
6 Callback SEND (MSG_CALLBACK / MSG_CALLBACK_RESULT) SendMessageCallbackA/W async-reply pattern. Not high-frequency in most apps. Medium – extend reply ring to invoke callback locally on reply arrival; handle callback identity + security.
7 SendInput intra-process synthesis Rare automation use cases where apps inject input into their own threads. Low – small addition but very niche.

Out-of-scope (architectural mismatches)

Prioritization heuristic

For raw wineserver-load reduction on Ableton-style workloads, Tier 1 (timer + paint) is the clear target. For broadening to non-DAW apps (WebView2, COM-heavy tools, browsers embedded in apps), Tier 2 item 3 (thread messages) is the cheapest high-leverage add. Tiers 2-3 items are best adopted opportunistically as specific apps demand them.

Any of these extensions reuses the existing ring memory layout + cache discipline + fast-path atomics – the extension work is entirely in the “which message types get ring-delivered and how” layer, not in the memfd / seqlock / cross-thread plumbing.


Footnote: why memfd and not session shmem

The memfd design was not the initial plan. The first msg-ring implementation put the per-queue ring inside Wine’s session shmem via alloc_shared_object() – natural given the existing machinery. That produced a reliable Ableton Live regression: the library panel would not populate whenever the ring allocation happened for the process’s first thread (MainThread).

A systematic A/B matrix ruled out every runtime code path that reads or writes the ring. Gates tested (each with bypass on, each in isolation):

Gate Subsystem disabled Library
NSPA_MSG_RING_SERVER_NO_RING_ARB ring arbitration in get_posted / get_message broken
NSPA_MSG_RING_SERVER_NO_WAKE_SYN wake-bit synthesis in is_signaled broken
NSPA_MSG_RING_SERVER_NO_SEQ per-message post_seq / change_ack_seq atomics broken
NSPA_MSG_RING_SERVER_NO_LOCATOR zero the wire locator (keep alloc) broken
NSPA_MSG_RING_SERVER_NO_POISON skip mark_block_uninitialized 0x55 fill broken
NSPA_MSG_RING_SERVER_ID_STRIDE=1 bump last_object_id by 65536 (ID range) broken
NSPA_CLIENT_IGNORE_LOCATOR client never resolves ring (no reads) broken
NSPA_MSG_RING_SERVER_NO_ALLOC skip alloc_shared_object entirely works
NSPA_MSG_RING_EXCLUDE_MAIN block alloc only for first-thread queue works

Every identifiable runtime side-effect (poison fill, ID bump, locator publish, seqlock ops) was proven innocent. The bug sat in the mere presence of a session_object_t entry + its shared_object_t header inside the shared session for the process’s first thread. The specific mechanism was never isolated further – all named side-effects were ruled out, leaving only a memory-layout / seqlock-interaction class of cause.

Moving the ring to a per-queue memfd eliminates all of it: no session_object_t entry, no shared_object_t header bump, no queue_shm_t locator publish, no interaction with session shmem’s bump allocator. The ring protocol itself (slot layout, state machine, cache discipline, fast paths) was unchanged – only the allocation + discovery layer swapped. Library regression resolved end-to-end.


See also: - nspa/docs/architecture.md – full Wine-NSPA architecture reference - nspa/docs/msg-ring-memfd-redesign.md – implementation plan + phase notes - nspa/docs/io_uring-architecture.md – orthogonal I/O bypass layer