Wine-NSPA – Shmem IPC Architecture

Wine-NSPA 11.6 | Kernel 6.19.11-rt1-1-nspa (PREEMPT_RT) | 2026-04-15 Author: Jordan Johnston

Table of Contents

  1. Overview
  2. Upstream vs NSPA Comparison
  3. Dispatcher Architecture
  4. PI Boost Protocol (v2.5)
  5. Global Lock PI
  6. Appendix: Rejected FUTEX_LOCK_PI Redesign

1. Overview

Upstream Wine uses a single-threaded wineserver that communicates with client processes over Unix domain sockets. Every SERVER_START_REQ / SERVER_END_REQ pair requires a full round-trip: client writes request to socket, wineserver’s epoll loop wakes, dispatches, writes reply, client reads reply.

Wine-NSPA v1.5 (Torge Matthies forward-port) adds per-thread shared memory between each client thread and the wineserver. Instead of socket I/O, requests and replies are written to a shared page, and futexes signal readiness. The wineserver spawns a per-client dispatcher pthread that watches each thread’s futex and dispatches requests under global_lock.

This eliminates the socket round-trip but introduces two new challenges: - The wineserver is now multi-threaded (dispatchers + main epoll loop), requiring global_lock serialization - RT client threads can be blocked waiting for a reply from a normal-priority dispatcher, creating priority inversion


2. Upstream vs NSPA Comparison

Upstream Wine (socket IPC) Wine-NSPA (shmem IPC + PI) Client: SERVER_START_REQ write() to Unix socket Wineserver (single-threaded) epoll_wait() -> fd ready dispatch request (no lock needed) read() reply from socket Client: SERVER_END_REQ Cost per server request: 2 socket I/O syscalls (write + read) 1 epoll wakeup + context switch to wineserver No multi-threading, no PI needed But: every request pays full socket round-trip Client: write to shmem page CAS futex 0->1, wake PI boost dispatcher (v2.5) Dispatcher pthread (boosted to client's prio) global_lock.lock() (PI) dispatch + write reply Client: read from shmem PI unboost dispatcher Cost per server request: 0 socket syscalls (shmem is mapped, no I/O) 1 futex wake + 2 sched_setscheduler (PI boost/unboost)
Aspect Upstream Wine Wine-NSPA Shmem
IPC mechanism Unix socket write/read Shared memory page + futex
Server threading Single-threaded epoll loop Multi-threaded: epoll + per-client dispatchers
Serialization None (single thread) global_lock (PI-aware pi_mutex_t)
Syscalls per request 2 socket I/O + epoll wake 1 futex wake + 2 sched_setscheduler
Priority inversion Not applicable Mitigated by PI boost (v2.5)
Context switches Client -> wineserver -> client Client -> dispatcher (same process)

3. Dispatcher Architecture

Each client thread that connects to the wineserver gets a dedicated dispatcher pthread on the server side. The dispatcher watches the thread’s shmem futex and processes requests under global_lock.

Wineserver Process — Per-Client Dispatcher Model Main Epoll Loop epoll_pwait2() fd events (file, socket) async lifecycle mgmt global_lock.lock() for each event Dispatcher pthread (thread 1) futex_wait(shmem->futex, 0) wakes -> global_lock.lock() dispatch(req) -> write reply CAS futex 1->0, wake client Dispatcher pthread (thread 2) same pattern, different shmem page Dispatcher pthread (thread N) 1 dispatcher per client thread global_lock (pi_mutex_t) Serializes all server state access FUTEX_LOCK_PI -> kernel rt_mutex PI: highest-prio dispatcher wins Holder boosted if contended Client Processes Client thread 1 shmem page + futex Client thread 2 shmem page + futex Client thread N shmem page + futex RT thread (SCHED_FIFO) PI boosts its dispatcher

Dispatcher Lifecycle

  1. Client thread calls wine_server_call() with a request
  2. Request data written to the thread’s shared memory page
  3. Client CAS’s the shmem futex from 0 -> 1, then futex_wake()
  4. Client PI-boosts the dispatcher (v2.5 protocol)
  5. Client futex_wait(futex, 1) – sleeps until reply
  6. Dispatcher wakes, acquires global_lock, dispatches the request
  7. Dispatcher writes reply to shmem, CAS futex 1 -> 0, futex_wake()
  8. Client wakes, reads reply, PI-unboosts the dispatcher

4. PI Boost Protocol (v2.5)

When an RT client thread (SCHED_FIFO) sends a request, it must boost the dispatcher pthread so the dispatcher runs at sufficient priority to process the request promptly. Without boosting, CFS could delay the dispatcher behind dozens of other normal-priority threads.

Protocol

Client (SCHED_FIFO:80):
  1. Write request to shmem
  2. CAS futex 0->1, futex_wake (wake dispatcher)
  3. Read dispatcher TID from shmem (atomic load, cached by dispatcher)
  4. sched_getscheduler(TID) + sched_getparam(TID)  – save original
  5. sched_setscheduler(TID, SCHED_FIFO, client_prio) – BOOST
  6. futex_wait(futex, 1) – sleep
Dispatcher (now boosted):
  7. Wakes at boosted priority
  8. global_lock.lock() (PI mutex – if contended, holder also boosted)
  9. Dispatch request, write reply
  10. CAS futex 1->0, futex_wake (wake client)
  11. global_lock.unlock()
Client (wakes):
  12. Read reply
  13. sched_setscheduler(TID, original_policy, original_prio) – UNBOOST

Syscall Cost: v2.4 vs v2.5

v2.4: 4 syscalls per RT request sched_getscheduler() sched_getparam() ← eliminated by v2.5 sched_setscheduler(BOOST) ← dispatch → sched_setscheduler(UNBOOST) ~2-4us overhead (4 sched syscalls) v2.5: 2 syscalls per RT request sched_setscheduler(BOOST) ← dispatch → sched_setscheduler(UNBOOST) TLS cache: nspa_rt_cached_policy + nspa_rt_cached_prio Set once at thread RT init, read on every boost — eliminates get* calls ~1-2us overhead (2 sched syscalls) Why not FUTEX_LOCK_PI? (attempted and REJECTED) Dispatcher sleeps on notify futex (not PI futex). FUTEX_LOCK_PI on a separate word requires unlock + re-acquire between dispatches. Under SMP contention, this causes deadlocks Deadlock: Client A holds PI lock, Client B boosts dispatcher, dispatcher blocks on PI lock held by A Manual boost avoids this by never creating lock dependencies between clients and the dispatcher

2 syscalls per RT request: sched_setscheduler (boost) + sched_setscheduler (unboost). Down from 4 in v2.4 (v2.5 caches the scheduler state, eliminating sched_getscheduler + sched_getparam).

Race Window

Between steps 3 and 5, another client’s unboost could lower the dispatcher’s priority. The window is small (~100ns on modern hardware) and the consequence is a one-request delay (the next request re-boosts). Accepted as a practical trade-off vs kernel-managed PI (see appendix).


5. Global Lock PI

server/fd.c:global_lock serializes all wineserver state access between the main epoll loop and the per-client dispatcher pthreads. Converted from pthread_mutex_t to pi_mutex_t (FUTEX_LOCK_PI), providing kernel-managed priority inheritance.

When a boosted dispatcher (SCHED_FIFO:80) contends with a normal-priority thread holding global_lock, the kernel’s rt_mutex PI chain automatically boosts the holder. This is transitive: if the holder is itself blocked on another PI mutex, the boost propagates through the chain.

Files Changed What
server/fd.c pthread_mutex_t global_lock -> pi_mutex_t global_lock
server/file.h Declaration + #include <rtpi.h>
server/thread.c All lock/unlock calls updated

6. Appendix: Rejected FUTEX_LOCK_PI Redesign

Status: Implemented and tested 2026-04-15. REJECTED – deadlocks on SMP.

Concept

Replace the manual sched_setscheduler PI boost with FUTEX_LOCK_PI on a shared pi_lock. The dispatcher would hold pi_lock while idle; the client’s futex_lock_pi would atomically boost the dispatcher through the kernel’s rt_mutex. Zero race window, zero sched_* syscalls.

Why It Failed

The dispatcher must unlock pi_lock (to wake the client) then re-acquire it (for the next request). On SMP, if the dispatcher is faster than the client:

  1. Dispatcher UNLOCK_PI – no waiters (client hasn’t blocked yet), futex cleared to 0
  2. Dispatcher LOCK_PI – re-acquires immediately (futex was 0)
  3. Dispatcher WAIT(notify) – sleeps, holding pi_lock
  4. Client LOCK_PI – blocks (dispatcher holds it)
  5. Deadlock: client waits for pi_lock, dispatcher waits for notify

Root cause: FUTEX_LOCK_PI can’t serve as both reply notification and PI mechanism. The unlock/re-acquire has a window where ownership transfer to the client isn’t guaranteed.

Conclusion

The v2.5 manual boost (2 syscalls per RT request) remains correct. A kernel-managed solution would require a combined notify+PI atomic operation that doesn’t exist in the Linux futex API.