XRFv2 Plus: A Multimodal Sensor-Vision-Language Dataset for Action Understanding

Fei, Wang

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

XRFv2 Plus: A Multimodal Sensor-Vision-Language Dataset for Action Understanding

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Fei, Wang;

doi: 10.5281/zenodo.20564312

XRFv2 Plus: A Multimodal Sensor-Vision-Language Dataset for Action Understanding

- Summary

Abstract

We present XRFv2 Plus, a synchronized multimodal dataset for sensor-vision-language action understanding. Built from the XRFv2 recording corpus, XRFv2 Plus reorganizes 853 valid continuous action sequences around a common cropped-video timeline and releases aligned WiFi CSI, five-position IMU, AirPods IMU, RGB video embeddings, Kinect depth videos, Kinect infrared videos, 2D pose, depth-assisted 3D pose, SMPL mesh, and DensePose-style human-surface information. The dataset further provides relative-time temporal action localization annotations, action captioning annotations, and action question answering annotations. This paper does not introduce a new recording campaign; instead, it defines a new public benchmark built on a different release contract, modality set, annotation set, and task scope. XRFv2 Plus defines a unified video-aligned benchmark contract: standardized tensor shapes, fixed device order, per-second sensor resampling, privacy-aware no-RGB public packaging, and explicit handling of shortened Kinect-video cases. This paper describes the dataset construction, alignment protocol, modality formats, annotation schemas, and public release organization. Project page: https://github.com/airslab2020/XRFV2Dataset: https://www.kaggle.com/datasets/airslab2020/xrfv2-multimodal-tal-caption-qa-no-rgb

Found an issue? Give us feedback