On Segment-Aware Monocular Depth Estimation Using Vision Transformers

Vasileios Arampatzakis; George Pavlidis; Nikolaos Mitianoudis; Nikos Papamarkos

Found an issue? Give us feedback

Informationarrow_drop_down

Information

Article . 2026 . Peer-reviewed

License: CC BY

Data sources: Crossref

On Segment-Aware Monocular Depth Estimation Using Vision Transformers

descriptionPublicationkeyboard_double_arrow_right Article 02 Feb 2026 English Publisher:MDPI AGJournal:Information, volume 17, page 145 (eissn: 2078-2489,

Copyright policy )Funded by:EC | ARGUS

Authors: Vasileios Arampatzakis; George Pavlidis; Nikolaos Mitianoudis; Nikos Papamarkos;

doi: 10.3390/info17020145

On Segment-Aware Monocular Depth Estimation Using Vision Transformers

- Summary
- Metrics

Abstract

Monocular Depth Estimation (MDE) infers per-pixel scene geometry from a single RGB image. Despite recent progress, global MDE models often blur depth discontinuities at object boundaries and fail to capture object-level structure. Segment-aware depth estimation addresses this limitation by exploiting semantic segmentation to decompose depth prediction into simpler, class-specific subproblems. In this work, we study semantic-aware MDE in a multi-branch design where each semantic class is handled by a lightweight Vision Transformer (ViT) branch that predicts dense depth for its class while suppressing interference from other regions. We further examine fusion strategies that merge the branch outputs into a single prediction: (i) a learnable cross-attention fusion module that predicts depth from the stack of per-class proposals and masks, and (ii) a parameter-free stitched summation that sums mask-gated outputs. The proposed architecture is simple, scalable, end-to-end trainable, and compatible with arbitrary transformer backbones. Experiments on Virtual KITTI 2, where ground-truth depth and semantic labels are available, show that segment-aware modeling produces sharper depth boundaries and improves standard error metrics compared to a single-branch baseline (AbsRel 0.243→0.152; RMSE 11.952→9.101). Finally, we find that the parameter-free summation matches, and in most cases improves upon, the accuracy of learned fusion while adding no computational overhead.

Related Organizations

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold

Funded by

EC| ARGUS