CFTformer: End-to-End Cross-Frame Multi-Object Tracking With Transformer

Name: CFTformer: End-to-End Cross-Frame Multi-Object Tracking With Transformer
Keywords: positional encoding, trajectory refinement, CFTformer, transformer network, Electrical engineering. Electronics. Nuclear engineering, multi-object tracking (MOT), TK1-9971

Abdollah Amirkhani; Seyed Alireza Khoshnevis

Found an issue? Give us feedback

IEEE Accessarrow_drop_down

IEEE Access

Article . 2025 . Peer-reviewed

License: CC BY NC ND

Data sources: Crossref

IEEE Access

Article . 2025

Data sources: DOAJ

CFTformer: End-to-End Cross-Frame Multi-Object Tracking With Transformer

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2025Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Access, volume 13, pages 80,587-80,600 (eissn: 2169-3536,

Copyright policy )Funded by:UKRI | Production Capable Additi...

Authors: Abdollah Amirkhani; Seyed Alireza Khoshnevis;

doi: 10.1109/access.2025.3567349

CFTformer: End-to-End Cross-Frame Multi-Object Tracking With Transformer

- Summary
- Subjects
- Metrics

Abstract

Multi-object tracking (MOT) has been at the center of numerous applications from autonomous vehicles (AVs) to surveillance and even retail analytics. Traditional MOT methods typically rely on motion-based and appearance-based similarity information to associate detections across frames. However, the new transformer attention-based approach to MOT has removed the need for complex post-processing steps, such as graph optimization, allowing for end-to-end query tracking across frames. While the new transformer-based approaches offer many advantages, in the majority of these models the temporal dimension of the sequence is only considered in either the iterative processing of the frames or the memory of the queries. The proposed cross-frame multi-object tracking transformer (CFTforrmer) aims to improve one of the challenging areas of tracking, the association across different frames in the temporal dimension. In the proposed approach, the temporal identities of the frames are included in the positional encoding of the patches. This approach allows the encoder-decoder to track the queries more efficiently across the frames. For this model, scalable deformable-attention layers were used to design the encoder and decoder to decrease the computational cost. CFTformer also employs the proposed attention-based trajectory refinement (ATR) scheme to improve the tracking performance in blurred frames. The three-dimensional positional encoding of the patches helps the proposed ATR module to better capture the trajectories of the queries and generate smoother predictions. Overall, the model was able to achieve 1.7% and 0.6% improvement in the identification F1 score (IDF1) metric on MOT17 and MOT20 datasets while having ~0-15% lower number of identity switches, compared to other transformer-based approaches. More accurate tracking and lower identity switches make this algorithm more suitable to be used in the field of autonomous driving. To access model’s performance in AV applications, the BDD100K dataset was utilized for training and evaluation where the proposed approach achieved a 1.9% improvement in the IDF1 compared to other transformer-based models.

Related Organizations

Florida Southern College
United States
Iran University of Science and Technology
Iran (Islamic Republic of)
University of South Florida
United States
University of Florida
United States

Keywords

positional encoding, trajectory refinement, CFTformer, transformer network, Electrical engineering. Electronics. Nuclear engineering, multi-object tracking (MOT), TK1-9971

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold

Funded by

UKRI| Production Capable Additive Manufacturing of Polymers (ProAMP)