
We present XRFv2 Plus, a synchronized multimodal dataset for sensor-vision-language action understanding. Built from the XRFv2 recording corpus, XRFv2 Plus reorganizes 853 valid continuous action sequences around a common cropped-video timeline and releases aligned WiFi CSI, five-position IMU, AirPods IMU, RGB video embeddings, Kinect depth videos, Kinect infrared videos, 2D pose, depth-assisted 3D pose, SMPL mesh, and DensePose-style human-surface information. The dataset further provides relative-time temporal action localization annotations, action captioning annotations, and action question answering annotations. This paper does not introduce a new recording campaign; instead, it defines a new public benchmark built on a different release contract, modality set, annotation set, and task scope. XRFv2 Plus defines a unified video-aligned benchmark contract: standardized tensor shapes, fixed device order, per-second sensor resampling, privacy-aware no-RGB public packaging, and explicit handling of shortened Kinect-video cases. This paper describes the dataset construction, alignment protocol, modality formats, annotation schemas, and public release organization. Project page: https://github.com/airslab2020/XRFV2Dataset: https://www.kaggle.com/datasets/airslab2020/xrfv2-multimodal-tal-caption-qa-no-rgb
