
Automated multi-dancer tracking is a critical yet challenging task in Dance Quality Assessment (DanceQA), requiring precise motion estimation to evaluate synchronization, formation transitions, and rhythmic accuracy. Traditional Multi-Object Tracking (MOT) frameworks predominantly rely on appearance-based features and Kalman Filter-based motion models, which struggle with complex, non-linear motion patterns exhibited in dance performances. These conventional approaches often suffer from identity fragmentation, occlusion-related failures, and inaccurate motion predictions due to their inherent assumption of constant velocity. Although recent deep learning-based trackers incorporating recurrent architectures and transformers have improved motion modeling, they still lack adaptability to highly dynamic motion variations and remain heavily reliant on large-scale training datasets. To bridge this gap, we propose the Multi-Dancer Spatio-Temporal Tracker (MDSTT), a novel transformer-based framework that exclusively leverages historical motion cues for robust and identity-consistent tracking. Unlike conventional tracking methods that integrate appearance features, MDSTT processes historical bounding box trajectories through a transformer encoder, capturing both long-range and short-term spatio-temporal dependencies while mitigating occlusion-induced identity switches. The proposed framework introduces a Historical Trajectory Embedding module to enhance motion-based representation learning, an Adaptable Motion Predictor with a learnable prediction token for improved trajectory continuity, and a refined Hungarian Matching strategy incorporating Intersection-over-Union (IoU), motion direction difference, and L1 distance to optimize object association. Additionally, probabilistic masked token augmentation is incorporated to simulate real-world occlusion scenarios, improving resilience against missing detections. Extensive evaluations on the DanceTrack dataset demonstrate that MDSTT achieves state-of-the-art (SoTA) tracking performance, surpassing existing methods with a 22.3% improvement in HOTA (77.4 vs. 63.3), 7.6% higher detection accuracy (86.4 vs. 80.3), and 26.6% better identity association accuracy (63.4 vs. 50.1) compared to SoTA transformer-based MOT models.
DanceSports, tracking-by-detection, Deep learning, occlusion, Electrical engineering. Electronics. Nuclear engineering, vision transformer, multiple object tracking, TK1-9971
DanceSports, tracking-by-detection, Deep learning, occlusion, Electrical engineering. Electronics. Nuclear engineering, vision transformer, multiple object tracking, TK1-9971
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
