
We present a highly scalable, hardware-optimized implementation of Wilson $SU(N_c)$ lattice gauge theory that resolves the memory-bandwidth bottleneck inherent in GPU-accelerated Markov Chain Monte Carlo (MCMC) simulations. By deploying a novel Block-Stride Weyl mixing hash as a register-level pseudorandom number generator (PRNG), we eliminate the requirement for pre-allocated random arrays in global memory, effectively trading abundant arithmetic logic unit (ALU) cycles for scarce VRAM bandwidth. Utilizing commercial off-the-shelf hardware (NVIDIA RTX 4060), we achieve a sustained simulation throughput of $\sim\!511$ million updates per second (MUPS). Furthermore, we detail critical CPU-side architectural optimizations, including the prevention of implicit 64-bit promotion to break read/write dependencies and restore 8-bit single instruction, multiple data (SIMD) vectorization. We demonstrate that this register-forced stochastic engine strictly preserves detailed balance, gauge invariance, and ergodicity, maintaining thermodynamic equilibrium at extreme scales, including $SU(256)$ criticality sweeps on $512^3$ spatial lattices.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
