Wine-NSPA – Hot-Path Optimizations

This page documents implementation-level optimizations that reduce Wine-side overhead without changing feature boundaries or public API shape, and the design choices behind them.

Table of Contents

  1. Overview
  2. Optimization classes
  3. Locality and published-state caching
  4. TEB-relative hot state
  5. Cache and slab layout
  6. Small-call removal on the wait path
  7. String and Unicode vectorization
  8. GUI and flush-path trims
  9. Current measured effect
  10. Related docs

1. Overview

Wine-NSPA carries several classes of optimizations that are not new bypass surfaces by themselves. They make existing fast paths cheaper: more local answers from already-published state, fewer libc TLS lookups, fewer cross-DSO helper calls, better cacheline and slab layout, and less pointless work on already-empty or already-local paths.

These optimizations matter because the remaining hot code paths are already short. Once a request or wait is mostly local, wrapper overhead becomes visible. That makes the “why this implementation choice?” question worth documenting, not just the “what got faster?” result.

Wine-NSPA also makes some deliberately narrower platform choices than upstream Wine. The project is Linux-only, and a subset of the newest hot-path carries is Linux-x86_64-specific. That is visible here: some optimizations exploit Linux futex, io_uring, and ntsync behavior generally, while others rely on the x86_64 TEB / GS-base setup specifically.


2. Optimization classes

Class Current use Why this choice fits
Locality and published-state caching hook Tier 1+2, paint cache, get_message empty-poll cache, thread/process shared-state readers, and zero-time wait polls answer locally once state is already published these paths already had an authoritative shared state block, so the win is to reuse it instead of inventing another transport
TEB-relative state access unix-side NtCurrentTeb is inline on Linux x86_64, get_thread_data() reads through a TEB backpointer, common thread/process/PEB helpers read from the TEB, and win32u msg-ring per-thread caches read through TEB->Win32ClientInfo repeated thread-local helper calls were pure wrapper cost, so direct TEB reads preserve ownership while removing libc / PLT overhead
Cacheline and slab layout struct inproc_sync entries are padded to one cacheline each, LFH struct bin is cacheline-shaped, ntsync hot structs use dedicated caches, and the production kernel keeps those caches isolated with SLAB_NO_MERGE the work was already concurrent and hot, so layout and allocator shaping reduce coherence and slab noise without changing behavior
Heap grow/shrink hysteresis non-hugetlb heap commit and tail-decommit paths widen from 64 KiB to 1 MiB under the RT-keyed huge-arenas gate these syscalls run under heap->cs, so amortizing them trims lock-held VM work without changing the Windows heap contract
Batching and burst drain gamma TRY_RECV2 drains bursts after one aggregate-wait wake instead of paying one kernel round-trip per entry request bursts are real, so the right optimization is to amortize wake cost rather than only shave single-request overhead
Small helper removal ntdll_io_uring_flush_deferred() folds to an inline empty check when no deferred completions exist, the ring eventfd getter is inline, and NtGetTickCount() folds to one KUSER_SHARED_DATA load once a helper becomes “usually empty” or “just return one TLS value,” the abstraction cost outweighs the abstraction value on the hot path
SIMD ASCII-burst loops memicmp_strW, hash_strW, utf8_wcstombs, and utf8_mbstowcs use x86_64 AVX2 fast paths for all-ASCII windows while keeping scalar fallback for mixed or non-ASCII windows filenames, registry names, and object names are often ASCII-dominant, so vectorizing the common window harvests real cost without changing the Unicode contract
GUI / memory-copy tightening flush throttling and the AVX2 X11 alpha-bit flush loop cut repeated GUI flush overhead without changing surface semantics these are stable high-frequency loops, so throttling and vectorization fit better than architectural rewrites

These are intentionally distinct from larger architectural features such as gamma, local-file, or shared-state readers. The feature pages explain the surfaces. This page explains the recurring optimization patterns that make those surfaces cheaper once they are already in place.


3. Locality and published-state caching

One optimization class shows up throughout Wine-NSPA: publish a small, authoritative, read-mostly state block once, then answer the common case locally until that state changes.

Surface Published state Local win
Hook cache queue-shared hook metadata no per-dispatch hook lookup RPC on the common path
Paint cache queue-shared redraw state repeated paint probes avoid needless server work
get_message empty-poll cache filter tuple + queue_shm->nspa_change_seq same empty poll does not pay the same RPC twice
Thread / process shared-state shared object snapshots with seqlock discipline 7 thread query classes, 6 process query classes, and zero-time waits answer locally
Gamma thread-token return kernel returns the registered sender token dispatcher avoids a second userspace thread lookup on each request

The shapes differ, but the pattern is the same:

This is why so many of the project’s measurable wins come from “small” caches. The point is not speculative behavior. It is to stop paying for the same answer over and over when the state already exists locally.


4. TEB-relative hot state

The current x86_64 Unix-side hot path avoids repeated libc TLS lookups by reading thread-local Wine state from the TEB and adjacent per-thread structs.

TEB-relative hot state replaces repeated TLS helper calls hot Wine-side caller msg-ring, wait path, queue access, signal helpers used to bounce through `NtCurrentTeb()` / `pthread_getspecific()` current x86_64 path inline `NtCurrentTeb()` single `mov %gs:0x30, %reg` on Linux x86_64 unix-side thread data backpointer `get_thread_data()` reads via `TEB->GdiTebBatch` extension win32u follow-on msg-ring per-thread caches `nspa_msg_cache` and `nspa_own_bypass` live in `Win32ClientInfo` hot reads stay inside the TEB slow path still registers destructor state when needed measured effect on the playback path `NtCurrentTeb` function calls: `9,961,441 -> 566` per 30 s cumulative cycles after both carries: `257.8B -> 212.4B` (`-17.6%`)

4.1 Inline NtCurrentTeb() on x86_64

On Linux x86_64, Wine-NSPA keeps GS_BASE = teb from thread startup, so most Unix-side callers can inline NtCurrentTeb() instead of paying a cross-DSO pthread_getspecific() call chain. This is intentionally narrower than upstream Wine’s portability envelope: it trades portability for a cheaper thread anchor on the platform Wine-NSPA actually targets.

Measured on a 30-second Ableton playback capture:

Metric Before After Delta
CPU cycles 257.8B 220.9B -14.3%
Instructions 309.1B 269.7B -12.7%
IPC 1.20 1.221 +1.7%
iTLB-load-misses 242M 185M -23.4%
LLC-load-misses 537M 482M -10.2%
NtCurrentTeb function calls / 30 s 9,961,441 566 -99.994%

This is Linux-x86_64-specific by design. The public point is not the assembly detail. It is that a foundational thread-state accessor is no longer hot, and that Wine-NSPA is willing to use a Linux-x86_64-specific setup when the win is load-bearing on its target platform.

4.2 Msg-ring per-thread caches via the TEB

The msg-ring hot path reads both of its per-thread caches from struct user_thread_info inside TEB->Win32ClientInfo:

That removes repeated pthread_getspecific() reads from the message path while keeping the destructor-bearing slow path intact for the peer cache. The choice here was not “invent a new cache.” It was “keep the current cache model, but move the hot lookup into the TEB.”

Measured on the same workload, on top of the inline NtCurrentTeb() carry:

Metric Before After Delta
CPU cycles 220.9B 212.4B -3.84%
Instructions 269.7B 262.8B -2.55%
iTLB-load-misses 185M 181M -2.21%
pthread_getspecific self time 0.46% 0.09% -80%
nspa_get_own_bypass_shm 0.26% 0.20% -23%
get_shared_queue 1.70% 1.61% -5%

4.3 Inline process/thread/PEB/tick helpers on x86_64

The same x86_64 TEB foundation also shrinks a second layer of helper cost on the Unix side.

Helper family Current path Source
PsGetCurrentProcessId() / PsGetCurrentThreadId() inline TEB-relative read via unix_private.h ClientId and thread-local unix state are already published in the TEB path
RtlGetCurrentPeb() inline TEB-relative read on the Unix side avoids a separate out-of-line helper for a fixed per-thread pointer
GetCurrentProcessId() / GetCurrentThreadId() in WINE_UNIX_LIB macro parity with the PE-side ClientId read removes an extra unix-thread-data load from hot ntsync and server-call sites
NtGetTickCount() one KUSER_SHARED_DATA::TickCount.LowPart load avoids a PLT thunk and function frame on a call site measured at ~3.08M calls / 30 s

These carries are small individually, but they all fit the same rule: once the TEB and KUSER_SHARED_DATA are already the authoritative source, the hot Unix path should read them directly instead of wrapping the same answer in another function call.


5. Cache and slab layout

The current inproc_sync cache is optimized for concurrent hot waits and signals, not just for compact storage. The same optimization class also shows up in the kernel overlay: the hot ntsync allocation classes live in dedicated caches, and the production kernel keeps those caches isolated.

`inproc_sync` cache: one entry per cacheline, original handle capacity restored old layout 16-byte entries packed 4 per 64-byte cacheline unrelated waits/signals still shared one line every refcount `LOCK` op invalidated peers on other CPUs hot cost showed up as distributed coherence pressure current layout 64-byte aligned entries, one cacheline each different handles no longer false-share refcount traffic same layout retained after capacity restore block size widened to keep `524288` cacheable handles current public contract faster concurrent wait/signal traffic with the same behavior capacity still stays at `524288` handles after the block-size increase

5.1 Userspace inproc_sync layout

The first layout carry padded struct inproc_sync to one cacheline so refcount LOCK traffic no longer ping-ponged unrelated handles on the same line. The follow-on widened each cache block from 64 KiB to 256 KiB so the total cached handle capacity stayed at 524288 after the padding change.

This is a pure internal layout change. It does not alter the handle protocol or the wait/signal API surface.


5.2 Kernel-side ntsync cache shaping

The same “shape the allocator around the hot object” pattern also exists in the kernel overlay:

Kernel-side optimization Effect
dedicated kmem_caches for hot ntsync objects hot small allocations stop competing with unrelated slab users
SLAB_HWCACHE_ALIGN hot fields land on cacheline-friendly boundaries
dedicated ntsync_wait_q cache common wait objects stop using the generic path
SLAB_NO_MERGE on all four ntsync caches cache isolation remains true on the production kernel, not just in theory

These are not user-visible features, but they matter to the same workloads the userspace cacheline work targets: lots of short waits, signals, and channel operations on PREEMPT_RT under real contention.


5.3 LFH cachelines and heap commit/decommit shaping

The same layout-first rule also applies to Wine’s internal heap machinery.

Heap-path shaping: less false sharing, fewer VM syscalls under `heap->cs` LFH bin counters `count_alloc`, `count_freed`, `enabled`, and group counters adjacent bins used to false-share atomic updates cacheline-shaped bins `struct bin` is `DECLSPEC_ALIGN(64)` and `sizeof % 64 == 0` different size classes stop cross-invalidating one another VM syscall amortization under the RT huge-arenas gate, commit/decommit hysteresis widens from `64 KiB` to `1 MiB` on non-hugetlb subheaps common result less coherence traffic on concurrent LFH metadata and fewer `MEM_COMMIT` / `MEM_DECOMMIT` syscalls while the heap lock is held

Two carries landed together here:

Carry Current behavior Why it matters
LFH bin cacheline padding struct bin is DECLSPEC_ALIGN(64) so adjacent LFH size classes stop false-sharing atomic counters the heap hot path has the same “small counters, many threads” shape as inproc_sync, so cacheline isolation helps for the same reason
Commit/decommit hysteresis under the RT-keyed huge-arenas gate, non-hugetlb subheaps widen commit and tail-decommit hysteresis from 64 KiB to 1 MiB NtAllocateVirtualMemory(MEM_COMMIT) and NtFreeVirtualMemory(MEM_DECOMMIT) run under heap->cs; a larger grain amortizes those syscalls across more alloc/free traffic

This does not change the Windows heap contract. It changes the internal grain at which Wine amortizes VM work on paths that were already legal to over-keep.


6. Small-call removal on the wait path

Some of the remaining hot-path cost was not algorithmic at all. It was helper overhead on paths that were already almost always empty or already local.

Small helper overhead removed from the steady-state wait path `ntdll_io_uring_flush_deferred()` current steady state has no deferred queue users inline check folds away the empty fast path wait-path result audio-path no-op helper cost removed measured at `0.82%` of audio-thread time before the carry `ntdll_io_uring_get_eventfd()` getter reads the TLS ring eventfd inline measured helper self-time `0.15%` before the carry common theme once the real work is already local, tiny empty or one-value helpers become visible enough to inline away

These are small on their own. Together they keep the wait path from carrying old scaffold cost after the architectural reason for that scaffold has gone.


7. String and Unicode vectorization

The current x86_64 AVX2 carries also trim a different class of hot loop: short, repeated string and Unicode helpers that sit on the path-resolution, registry, object-name, and locale-conversion surfaces. These are not new features. They are implementation-level reductions in per-call cost on paths that are already semantically local.

x86_64 AVX2 fast paths keep ASCII bursts vectorized and edge cases scalar server/unicode hot loops `memicmp_strW`: 16-WCHAR ASCII compare + fold window `hash_strW`: 8-WCHAR weighted Horner window common call sites: object names, registry traversal, handle lookup non-ASCII windows fall back to scalar `to_lower()` logic ntdll locale helpers `utf8_wcstombs`: 16 WCHAR -> 16 byte ASCII burst `utf8_mbstowcs`: 16 byte -> 16 WCHAR ASCII burst common call sites: NT path conversion, registry names, section names multi-byte UTF-8 and surrogate cases stay on the scalar path common fast-path rule if the window is all ASCII and large enough, vectorize the whole block and advance contract rule mixed/non-ASCII windows do not reinterpret semantics they reuse the existing scalar path unchanged

7.1 Server Unicode compare and hash

The wineserver name path now has two x86_64 AVX2 ASCII-window carries in server/unicode.c:

Helper AVX2 fast window Scalar reuse
memicmp_strW 16 WCHARs at a time, ASCII-only window, SIMD case-fold and compare short strings and any non-ASCII window reuse the scalar to_lower() compare
hash_strW 8 WCHARs at a time, ASCII-only window, weighted Horner unroll with vector multiply short strings and any non-ASCII window reuse the scalar Horner loop

These helpers are hot because object-name and registry paths repeatedly compare and hash short Unicode names. The vectorization is deliberately narrower than “SIMD all Unicode”: it only harvests the ASCII-dominant windows and preserves the older scalar path for the rest.

Synthetic ASCII-path measurements recorded with the carry:

Helper Before After Delta
memicmp_strW (50 WCHAR ASCII) ~250 cycles ~12 cycles ~20x
hash_strW (50 WCHAR ASCII) ~150 cycles ~38 cycles ~4x

7.2 Locale conversion helpers

The Unix-side locale helpers in dlls/ntdll/locale_private.h now have matching x86_64 AVX2 ASCII-burst paths:

Helper AVX2 fast window Scalar reuse
utf8_wcstombs 16 WCHARs detected as ASCII, packed to 16 bytes and stored in one burst short buffers, non-ASCII WCHARs, and surrogate pairs remain scalar
utf8_mbstowcs 16 source bytes detected as ASCII, zero-extended to 16 WCHARs and stored in one burst short buffers, multi-byte UTF-8, and invalid UTF-8 remain scalar

These conversions sit on every PE-to-Unix and Unix-to-PE path/name boundary. The hot AVX2 carry is intentionally Unix-side only: dlls/ntdll/unix/env.c keeps the vectorized path, while PE-side dlls/ntdll/locale.c remains scalar because the PE build cannot use the runtime __builtin_cpu_supports("avx2") probe without dragging in an unresolved __cpu_model dependency. That still matches the hot path shape: the expensive callers are on the Unix side.

Synthetic ASCII-path measurements recorded with the carries:

Helper Before After Delta
utf8_wcstombs (200-byte ASCII path) ~500 cycles ~25 cycles ~20x
utf8_mbstowcs (200-byte ASCII path) ~1000 cycles ~40 cycles ~25x

The architectural point is the same as the TEB carries: once the remaining hot path is dominated by wrapper work on an already-local operation, a narrow platform-specific implementation can be the right trade.


8. GUI and flush-path trims

Two GUI-side optimizations remain part of the current baseline:

Optimization Before After Delta
x11drv_surface_flush throttle 8.23% 4.74% -43%
copy_rect_32 memmove 4.38% 2.49% -43%
x11drv_surface_flush AVX2 6.72% 2.39% -4.33pp / -64%
total winex11.so after AVX2 6.76% 2.43% -4.33pp

These are feature-adjacent because they live on the GUI path, but they are still optimizations rather than new surfaces. The relevant architecture is unchanged; the hot implementation is just cheaper.


9. Current measured effect

The current optimization stack is measured with a three-part profiling pass on the same workload window:

For the most recent x86_64 inline + AVX2 bundle, the comparison window is:

Capture Baseline After
workload Ableton project + browse + plugin + steady playback same workload shape
baseline window 2026-05-09 12:06
post-bundle window 2026-05-10 18:00
bundle scope inline current-thread/current-process/PEB/tick helpers plus AVX2 memicmp_strW, hash_strW, and Unix-side utf8_wcstombs / utf8_mbstowcs
2026-05-10 inline + AVX2 bundle: why the carries matter together bundle inputs inline current-thread/current-process/PEB/tick helpers inline `NtGetTickCount()` and TEB-relative helper collapse AVX2 ASCII-window compare, hash, and UTF conversion loops all on the already-local hot paths locality effect fewer helper frames and indirect branches hot code and hot state stay in tighter cache windows same contracts, less wrapper work around them measured result user samples `97K -> 86K` (`-11.3%`) iTLB `229.7M -> 180.8M` (`-21.3%`) dTLB `51.5M -> 42.4M` (`-17.7%`) branch-miss `348.3M -> 308.4M` (`-11.45%`) load-bearing read this bundle is not "one fast function" it is a compound locality win across the same call graph the counter signature is tighter instruction-TLB use, tighter data-TLB use, fewer branches, and lower user CPU

9.1 Current triplet-diff result

The newest bundle confirms the main point of this page: once the dominant work is already local, shaving helper layers and tightening hot loops compounds.

Counter Baseline Post-bundle Delta
cpu-cycles 227.0B 223.1B -1.73%
instructions 273.3B 269.1B -1.55%
iTLB-load-misses 229.7M 180.8M -21.30%
dTLB-load-misses 51.5M 42.4M -17.69%
dTLB-store-misses 16.9M 14.8M -12.49%
branch-misses 348.3M 308.4M -11.45%
cache-references 3.87B 3.49B -9.73%
cache-misses 2.11B 1.99B -5.80%
LLC-load-misses 499.3M 491.7M -1.52%
LLC-store-misses 519.1M 550.0M +5.97%
context-switches 1,463K 1,417K -3.19%
cpu-migrations 144,103 133,448 -7.39%
page-faults 130,349 71,754 -44.95%
IPC 1.204 1.206 flat

The key read is not any single micro-benchmark number. It is the compound signature:

That is what “same work in less of everything” looks like for this kind of bundle.

9.2 Dispatcher-entry confirmation

The Nt* distribution capture confirms that the helper inlining is visible at the entrypoint level, not only in synthetic loops.

Nt entry Baseline Post-bundle Delta
NtGetTickCount 3,081,551 0 (absent) -100%
NtSetEvent 3,269,629 5,586,449 +70.9%
NtQueryPerformanceCounter 3,075,360 5,392,023 +75.3%
NtWaitForMultipleObjects 3,071,442 5,387,977 +75.4%
NtResetEvent 175,029 174,835 flat
NtWaitForSingleObject 170,050 170,172 flat
NtQuerySystemTime 58,676 58,997 flat
NtFlushInstructionCache 13,764 16,380 +19.0%
NtCurrentTeb fallback 448 548 both ~negligible

NtGetTickCount dropping from 3,081,551 to 0 is the clearest end-to-end confirmation in the set: the inline path is not theoretical, it has removed the dispatcher-visible entry entirely on this workload.

The large rises on NtSetEvent, NtQueryPerformanceCounter, and NtWaitForMultipleObjects are most likely workload-phase differences between the two captures. If they do reflect real extra traffic, the fact that cycles and user-mode samples still fall means the per-entry cost is lower, not higher.

9.3 Callgraph read

The callgraph view turns the same result into a user-CPU number:

Post-bundle, the NSPA-local fast-path surface is easier to see directly in the resolved top symbols:

That matters because the relative percentages on untouched shared symbols can be misleading once the denominator falls. Symbols such as libc bulk-copy helpers, apply_alpha_bits_avx2, or entry_SYSCALL_64 may rise in share even when their absolute weight is flat, simply because the total user sample pool got smaller.

9.4 Bundle interpretation

Taken together, the newest hot-path carries changed three important things:

That is why these changes belong together even though they touch different files. The common result is lower wrapper cost around work that was already largely local.