Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition

Name: Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition
Keywords: masked autoencoder, Chemical technology, self-supervised learning, spatial–temporal heatmap, TP1-1185, visual transformer, skeleton-based action recognition, Article

Cunling Bian; Yang Yang; Tao Wang; Weigang Lu

Found an issue? Give us feedback

Sensorsarrow_drop_down

Sensors

Article . 2025 . Peer-reviewed

License: CC BY

Data sources: Crossref

PubMed Central

Other literature type . 2025

License: CC BY

Data sources: PubMed Central

Sensors

Article . 2025

Data sources: DOAJ

Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 16 May 2025 English Publisher:MDPI AGJournal:Sensors, volume 25, page 3,146 (eissn: 1424-8220,

Copyright policy )Publicly funded

Authors: Cunling Bian; Yang Yang; Tao Wang; Weigang Lu;

doi: 10.3390/s25103146

Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition

- Summary
- Subjects
- Metrics

Abstract

Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the Spatial–Temporal Heatmap Masked Autoencoder (STH-MAE), a novel self-supervised framework tailored for skeleton-based action recognition. Unlike coordinate-based methods, STH-MAE adopts heatmap volumes as its primary representation, mitigating noise inherent in pose estimation while capitalizing on advances in Vision Transformers. The framework constructs a spatial–temporal heatmap (STH) by aggregating 2D joint heatmaps across both spatial and temporal axes. This STH is partitioned into non-overlapping patches to facilitate local feature learning, with a masking strategy applied to randomly conceal portions of the input. During pre-training, a Vision Transformer-based autoencoder equipped with a lightweight prediction head reconstructs the masked regions, fostering the extraction of robust and transferable skeletal representations. Comprehensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 benchmarks demonstrate the superiority of STH-MAE, achieving state-of-the-art performance under multiple evaluation protocols.

Related Organizations

View all View all

Keywords

masked autoencoder, Chemical technology, self-supervised learning, spatial–temporal heatmap, TP1-1185, visual transformer, skeleton-based action recognition, Article

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Top 10%

Average

Green

gold

Related to Research communities

UArctic