
Multi-object tracking (MOT) has been at the center of numerous applications from autonomous vehicles (AVs) to surveillance and even retail analytics. Traditional MOT methods typically rely on motion-based and appearance-based similarity information to associate detections across frames. However, the new transformer attention-based approach to MOT has removed the need for complex post-processing steps, such as graph optimization, allowing for end-to-end query tracking across frames. While the new transformer-based approaches offer many advantages, in the majority of these models the temporal dimension of the sequence is only considered in either the iterative processing of the frames or the memory of the queries. The proposed cross-frame multi-object tracking transformer (CFTforrmer) aims to improve one of the challenging areas of tracking, the association across different frames in the temporal dimension. In the proposed approach, the temporal identities of the frames are included in the positional encoding of the patches. This approach allows the encoder-decoder to track the queries more efficiently across the frames. For this model, scalable deformable-attention layers were used to design the encoder and decoder to decrease the computational cost. CFTformer also employs the proposed attention-based trajectory refinement (ATR) scheme to improve the tracking performance in blurred frames. The three-dimensional positional encoding of the patches helps the proposed ATR module to better capture the trajectories of the queries and generate smoother predictions. Overall, the model was able to achieve 1.7% and 0.6% improvement in the identification F1 score (IDF1) metric on MOT17 and MOT20 datasets while having ~0-15% lower number of identity switches, compared to other transformer-based approaches. More accurate tracking and lower identity switches make this algorithm more suitable to be used in the field of autonomous driving. To access model’s performance in AV applications, the BDD100K dataset was utilized for training and evaluation where the proposed approach achieved a 1.9% improvement in the IDF1 compared to other transformer-based models.
positional encoding, trajectory refinement, CFTformer, transformer network, Electrical engineering. Electronics. Nuclear engineering, multi-object tracking (MOT), TK1-9971
positional encoding, trajectory refinement, CFTformer, transformer network, Electrical engineering. Electronics. Nuclear engineering, multi-object tracking (MOT), TK1-9971
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
