This page documents implementation-level optimizations that reduce Wine-side overhead without changing feature boundaries or public API shape, and the design choices behind them.
Wine-NSPA carries several classes of optimizations that are not new bypass surfaces by themselves. They make existing fast paths cheaper: more local answers from already-published state, fewer libc TLS lookups, fewer cross-DSO helper calls, better cacheline and slab layout, and less pointless work on already-empty or already-local paths.
These optimizations matter because the remaining hot code paths are already short. Once a request or wait is mostly local, wrapper overhead becomes visible. That makes the “why this implementation choice?” question worth documenting, not just the “what got faster?” result.
Wine-NSPA also makes some deliberately narrower platform choices than upstream
Wine. The project is Linux-only, and a subset of the newest hot-path carries is
Linux-x86_64-specific. That is visible here: some optimizations exploit Linux
futex, io_uring, and ntsync behavior generally, while others rely on the
x86_64 TEB / GS-base setup specifically.
| Class | Current use | Why this choice fits |
|---|---|---|
| Locality and published-state caching | hook Tier 1+2, paint cache, get_message empty-poll cache, thread/process shared-state readers, and zero-time wait polls answer locally once state is already published |
these paths already had an authoritative shared state block, so the win is to reuse it instead of inventing another transport |
| TEB-relative state access | unix-side NtCurrentTeb is inline on Linux x86_64, get_thread_data() reads through a TEB backpointer, common thread/process/PEB helpers read from the TEB, and win32u msg-ring per-thread caches read through TEB->Win32ClientInfo |
repeated thread-local helper calls were pure wrapper cost, so direct TEB reads preserve ownership while removing libc / PLT overhead |
| Cacheline and slab layout | struct inproc_sync entries are padded to one cacheline each, LFH struct bin is cacheline-shaped, ntsync hot structs use dedicated caches, and the production kernel keeps those caches isolated with SLAB_NO_MERGE |
the work was already concurrent and hot, so layout and allocator shaping reduce coherence and slab noise without changing behavior |
| Heap grow/shrink hysteresis | non-hugetlb heap commit and tail-decommit paths widen from 64 KiB to 1 MiB under the RT-keyed huge-arenas gate |
these syscalls run under heap->cs, so amortizing them trims lock-held VM work without changing the Windows heap contract |
| Batching and burst drain | gamma TRY_RECV2 drains bursts after one aggregate-wait wake instead of paying one kernel round-trip per entry |
request bursts are real, so the right optimization is to amortize wake cost rather than only shave single-request overhead |
| Small helper removal | ntdll_io_uring_flush_deferred() folds to an inline empty check when no deferred completions exist, the ring eventfd getter is inline, and NtGetTickCount() folds to one KUSER_SHARED_DATA load |
once a helper becomes “usually empty” or “just return one TLS value,” the abstraction cost outweighs the abstraction value on the hot path |
| SIMD ASCII-burst loops | memicmp_strW, hash_strW, utf8_wcstombs, and utf8_mbstowcs use x86_64 AVX2 fast paths for all-ASCII windows while keeping scalar fallback for mixed or non-ASCII windows |
filenames, registry names, and object names are often ASCII-dominant, so vectorizing the common window harvests real cost without changing the Unicode contract |
| GUI / memory-copy tightening | flush throttling and the AVX2 X11 alpha-bit flush loop cut repeated GUI flush overhead without changing surface semantics | these are stable high-frequency loops, so throttling and vectorization fit better than architectural rewrites |
These are intentionally distinct from larger architectural features such as gamma, local-file, or shared-state readers. The feature pages explain the surfaces. This page explains the recurring optimization patterns that make those surfaces cheaper once they are already in place.
One optimization class shows up throughout Wine-NSPA: publish a small, authoritative, read-mostly state block once, then answer the common case locally until that state changes.
| Surface | Published state | Local win |
|---|---|---|
| Hook cache | queue-shared hook metadata | no per-dispatch hook lookup RPC on the common path |
| Paint cache | queue-shared redraw state | repeated paint probes avoid needless server work |
get_message empty-poll cache |
filter tuple + queue_shm->nspa_change_seq |
same empty poll does not pay the same RPC twice |
| Thread / process shared-state | shared object snapshots with seqlock discipline | 7 thread query classes, 6 process query classes, and zero-time waits answer locally |
| Gamma thread-token return | kernel returns the registered sender token | dispatcher avoids a second userspace thread lookup on each request |
The shapes differ, but the pattern is the same:
This is why so many of the project’s measurable wins come from “small” caches. The point is not speculative behavior. It is to stop paying for the same answer over and over when the state already exists locally.
The current x86_64 Unix-side hot path avoids repeated libc TLS lookups by reading thread-local Wine state from the TEB and adjacent per-thread structs.
NtCurrentTeb() on x86_64On Linux x86_64, Wine-NSPA keeps GS_BASE = teb from thread startup, so
most Unix-side callers can inline NtCurrentTeb() instead of paying a
cross-DSO pthread_getspecific() call chain. This is intentionally narrower
than upstream Wine’s portability envelope: it trades portability for a cheaper
thread anchor on the platform Wine-NSPA actually targets.
Measured on a 30-second Ableton playback capture:
| Metric | Before | After | Delta |
|---|---|---|---|
| CPU cycles | 257.8B |
220.9B |
-14.3% |
| Instructions | 309.1B |
269.7B |
-12.7% |
| IPC | 1.20 |
1.221 |
+1.7% |
| iTLB-load-misses | 242M |
185M |
-23.4% |
| LLC-load-misses | 537M |
482M |
-10.2% |
NtCurrentTeb function calls / 30 s |
9,961,441 |
566 |
-99.994% |
This is Linux-x86_64-specific by design. The public point is not the assembly detail. It is that a foundational thread-state accessor is no longer hot, and that Wine-NSPA is willing to use a Linux-x86_64-specific setup when the win is load-bearing on its target platform.
The msg-ring hot path reads both of its per-thread caches from
struct user_thread_info inside TEB->Win32ClientInfo:
nspa_msg_cachenspa_own_bypassThat removes repeated pthread_getspecific() reads from the message path while
keeping the destructor-bearing slow path intact for the peer cache. The choice
here was not “invent a new cache.” It was “keep the current cache model, but
move the hot lookup into the TEB.”
Measured on the same workload, on top of the inline NtCurrentTeb() carry:
| Metric | Before | After | Delta |
|---|---|---|---|
| CPU cycles | 220.9B |
212.4B |
-3.84% |
| Instructions | 269.7B |
262.8B |
-2.55% |
| iTLB-load-misses | 185M |
181M |
-2.21% |
pthread_getspecific self time |
0.46% |
0.09% |
-80% |
nspa_get_own_bypass_shm |
0.26% |
0.20% |
-23% |
get_shared_queue |
1.70% |
1.61% |
-5% |
The same x86_64 TEB foundation also shrinks a second layer of helper cost on the Unix side.
| Helper family | Current path | Source |
|---|---|---|
PsGetCurrentProcessId() / PsGetCurrentThreadId() |
inline TEB-relative read via unix_private.h |
ClientId and thread-local unix state are already published in the TEB path |
RtlGetCurrentPeb() |
inline TEB-relative read on the Unix side | avoids a separate out-of-line helper for a fixed per-thread pointer |
GetCurrentProcessId() / GetCurrentThreadId() in WINE_UNIX_LIB |
macro parity with the PE-side ClientId read |
removes an extra unix-thread-data load from hot ntsync and server-call sites |
NtGetTickCount() |
one KUSER_SHARED_DATA::TickCount.LowPart load |
avoids a PLT thunk and function frame on a call site measured at ~3.08M calls / 30 s |
These carries are small individually, but they all fit the same rule: once the
TEB and KUSER_SHARED_DATA are already the authoritative source, the hot Unix
path should read them directly instead of wrapping the same answer in another
function call.
The current inproc_sync cache is optimized for concurrent hot waits and
signals, not just for compact storage. The same optimization class also shows
up in the kernel overlay: the hot ntsync allocation classes live in
dedicated caches, and the production kernel keeps those caches isolated.
inproc_sync layoutThe first layout carry padded struct inproc_sync to one cacheline so refcount
LOCK traffic no longer ping-ponged unrelated handles on the same line. The
follow-on widened each cache block from 64 KiB to 256 KiB so the total cached
handle capacity stayed at 524288 after the padding change.
This is a pure internal layout change. It does not alter the handle protocol or the wait/signal API surface.
The same “shape the allocator around the hot object” pattern also exists in the kernel overlay:
| Kernel-side optimization | Effect |
|---|---|
dedicated kmem_caches for hot ntsync objects |
hot small allocations stop competing with unrelated slab users |
SLAB_HWCACHE_ALIGN |
hot fields land on cacheline-friendly boundaries |
dedicated ntsync_wait_q cache |
common wait objects stop using the generic path |
SLAB_NO_MERGE on all four ntsync caches |
cache isolation remains true on the production kernel, not just in theory |
These are not user-visible features, but they matter to the same workloads the userspace cacheline work targets: lots of short waits, signals, and channel operations on PREEMPT_RT under real contention.
The same layout-first rule also applies to Wine’s internal heap machinery.
Two carries landed together here:
| Carry | Current behavior | Why it matters |
|---|---|---|
| LFH bin cacheline padding | struct bin is DECLSPEC_ALIGN(64) so adjacent LFH size classes stop false-sharing atomic counters |
the heap hot path has the same “small counters, many threads” shape as inproc_sync, so cacheline isolation helps for the same reason |
| Commit/decommit hysteresis | under the RT-keyed huge-arenas gate, non-hugetlb subheaps widen commit and tail-decommit hysteresis from 64 KiB to 1 MiB |
NtAllocateVirtualMemory(MEM_COMMIT) and NtFreeVirtualMemory(MEM_DECOMMIT) run under heap->cs; a larger grain amortizes those syscalls across more alloc/free traffic |
This does not change the Windows heap contract. It changes the internal grain at which Wine amortizes VM work on paths that were already legal to over-keep.
Some of the remaining hot-path cost was not algorithmic at all. It was helper overhead on paths that were already almost always empty or already local.
These are small on their own. Together they keep the wait path from carrying old scaffold cost after the architectural reason for that scaffold has gone.
The current x86_64 AVX2 carries also trim a different class of hot loop: short, repeated string and Unicode helpers that sit on the path-resolution, registry, object-name, and locale-conversion surfaces. These are not new features. They are implementation-level reductions in per-call cost on paths that are already semantically local.
The wineserver name path now has two x86_64 AVX2 ASCII-window carries in
server/unicode.c:
| Helper | AVX2 fast window | Scalar reuse |
|---|---|---|
memicmp_strW |
16 WCHARs at a time, ASCII-only window, SIMD case-fold and compare |
short strings and any non-ASCII window reuse the scalar to_lower() compare |
hash_strW |
8 WCHARs at a time, ASCII-only window, weighted Horner unroll with vector multiply |
short strings and any non-ASCII window reuse the scalar Horner loop |
These helpers are hot because object-name and registry paths repeatedly compare and hash short Unicode names. The vectorization is deliberately narrower than “SIMD all Unicode”: it only harvests the ASCII-dominant windows and preserves the older scalar path for the rest.
Synthetic ASCII-path measurements recorded with the carry:
| Helper | Before | After | Delta |
|---|---|---|---|
memicmp_strW (50 WCHAR ASCII) |
~250 cycles |
~12 cycles |
~20x |
hash_strW (50 WCHAR ASCII) |
~150 cycles |
~38 cycles |
~4x |
The Unix-side locale helpers in dlls/ntdll/locale_private.h now have matching
x86_64 AVX2 ASCII-burst paths:
| Helper | AVX2 fast window | Scalar reuse |
|---|---|---|
utf8_wcstombs |
16 WCHARs detected as ASCII, packed to 16 bytes and stored in one burst |
short buffers, non-ASCII WCHARs, and surrogate pairs remain scalar |
utf8_mbstowcs |
16 source bytes detected as ASCII, zero-extended to 16 WCHARs and stored in one burst |
short buffers, multi-byte UTF-8, and invalid UTF-8 remain scalar |
These conversions sit on every PE-to-Unix and Unix-to-PE path/name boundary.
The hot AVX2 carry is intentionally Unix-side only: dlls/ntdll/unix/env.c
keeps the vectorized path, while PE-side dlls/ntdll/locale.c remains scalar
because the PE build cannot use the runtime __builtin_cpu_supports("avx2")
probe without dragging in an unresolved __cpu_model dependency. That still
matches the hot path shape: the expensive callers are on the Unix side.
Synthetic ASCII-path measurements recorded with the carries:
| Helper | Before | After | Delta |
|---|---|---|---|
utf8_wcstombs (200-byte ASCII path) |
~500 cycles |
~25 cycles |
~20x |
utf8_mbstowcs (200-byte ASCII path) |
~1000 cycles |
~40 cycles |
~25x |
The architectural point is the same as the TEB carries: once the remaining hot path is dominated by wrapper work on an already-local operation, a narrow platform-specific implementation can be the right trade.
Two GUI-side optimizations remain part of the current baseline:
| Optimization | Before | After | Delta |
|---|---|---|---|
x11drv_surface_flush throttle |
8.23% |
4.74% |
-43% |
copy_rect_32 memmove |
4.38% |
2.49% |
-43% |
x11drv_surface_flush AVX2 |
6.72% |
2.39% |
-4.33pp / -64% |
total winex11.so after AVX2 |
6.76% |
2.43% |
-4.33pp |
These are feature-adjacent because they live on the GUI path, but they are still optimizations rather than new surfaces. The relevant architecture is unchanged; the hot implementation is just cheaper.
The current optimization stack is measured with a three-part profiling pass on the same workload window:
perf stat hardware counters over 30 secondsperf record DWARF callgraph over 30 secondsbpftrace Nt* entry distribution over 30 secondsFor the most recent x86_64 inline + AVX2 bundle, the comparison window is:
| Capture | Baseline | After |
|---|---|---|
| workload | Ableton project + browse + plugin + steady playback | same workload shape |
| baseline window | 2026-05-09 12:06 |
|
| post-bundle window | 2026-05-10 18:00 |
|
| bundle scope | inline current-thread/current-process/PEB/tick helpers plus AVX2 memicmp_strW, hash_strW, and Unix-side utf8_wcstombs / utf8_mbstowcs |
The newest bundle confirms the main point of this page: once the dominant work is already local, shaving helper layers and tightening hot loops compounds.
| Counter | Baseline | Post-bundle | Delta |
|---|---|---|---|
| cpu-cycles | 227.0B |
223.1B |
-1.73% |
| instructions | 273.3B |
269.1B |
-1.55% |
| iTLB-load-misses | 229.7M |
180.8M |
-21.30% |
| dTLB-load-misses | 51.5M |
42.4M |
-17.69% |
| dTLB-store-misses | 16.9M |
14.8M |
-12.49% |
| branch-misses | 348.3M |
308.4M |
-11.45% |
| cache-references | 3.87B |
3.49B |
-9.73% |
| cache-misses | 2.11B |
1.99B |
-5.80% |
| LLC-load-misses | 499.3M |
491.7M |
-1.52% |
| LLC-store-misses | 519.1M |
550.0M |
+5.97% |
| context-switches | 1,463K |
1,417K |
-3.19% |
| cpu-migrations | 144,103 |
133,448 |
-7.39% |
| page-faults | 130,349 |
71,754 |
-44.95% |
| IPC | 1.204 |
1.206 |
flat |
The key read is not any single micro-benchmark number. It is the compound signature:
-21.30%-17.69%-11.45%-11.3%That is what “same work in less of everything” looks like for this kind of bundle.
The Nt* distribution capture confirms that the helper inlining is visible at the entrypoint level, not only in synthetic loops.
| Nt entry | Baseline | Post-bundle | Delta |
|---|---|---|---|
NtGetTickCount |
3,081,551 |
0 (absent) |
-100% |
NtSetEvent |
3,269,629 |
5,586,449 |
+70.9% |
NtQueryPerformanceCounter |
3,075,360 |
5,392,023 |
+75.3% |
NtWaitForMultipleObjects |
3,071,442 |
5,387,977 |
+75.4% |
NtResetEvent |
175,029 |
174,835 |
flat |
NtWaitForSingleObject |
170,050 |
170,172 |
flat |
NtQuerySystemTime |
58,676 |
58,997 |
flat |
NtFlushInstructionCache |
13,764 |
16,380 |
+19.0% |
NtCurrentTeb fallback |
448 |
548 |
both ~negligible |
NtGetTickCount dropping from 3,081,551 to 0 is the clearest
end-to-end confirmation in the set: the inline path is not theoretical, it has
removed the dispatcher-visible entry entirely on this workload.
The large rises on NtSetEvent, NtQueryPerformanceCounter, and
NtWaitForMultipleObjects are most likely workload-phase differences between
the two captures. If they do reflect real extra traffic, the fact that cycles
and user-mode samples still fall means the per-entry cost is lower, not higher.
The callgraph view turns the same result into a user-CPU number:
97K -> 86K (-11.3%)Post-bundle, the NSPA-local fast-path surface is easier to see directly in the resolved top symbols:
inproc_waitget_cached_inproc_syncnspa_try_pop_own_ring_postnspa_try_pop_own_timer_ringnspa_try_pop_own_ring_sendnspa_get_own_bypass_shmnspa_getmsg_cache_record_emptynspa_getmsg_cache_lookupThat matters because the relative percentages on untouched shared symbols can
be misleading once the denominator falls. Symbols such as libc bulk-copy
helpers, apply_alpha_bits_avx2, or entry_SYSCALL_64 may rise in share even
when their absolute weight is flat, simply because the total user sample pool
got smaller.
Taken together, the newest hot-path carries changed three important things:
KUSER_SHARED_DATA reads on Linux x86_64That is why these changes belong together even though they touch different files. The common result is lower wrapper cost around work that was already largely local.