
Performance evaluation of 8x NVIDIA A100-SXM4-80GB GPUs interconnected via NVSwitch (NV12) on UCL ARC's Kubeflow platform. Benchmarking suite includes: (1) NVBandwidth point-to-point GPU transfer measurements comparing bare metal vs Kubeflow, NVLink-enabled vs disabled, and A100-SXM4 vs A100-PCIe configurations; (2) NCCL collective communication benchmarks (all-reduce, all-gather, broadcast, reduce-scatter, send-recv) with analysis of bus bandwidth scaling, GPU count scaling, thread count impact, and protocol/algorithm variants; (3) P2P bandwidth and latency tests via CUDA samples across NVLink and PCIe. Statistical analysis using z-scores identifies minor per-GPU performance asymmetries attributable to NVSwitch topology rather than systemic bottlenecks. NVLink provides 14-15x bandwidth improvement over PCIe-only communication
