
doi: 10.3390/info17020145
Monocular Depth Estimation (MDE) infers per-pixel scene geometry from a single RGB image. Despite recent progress, global MDE models often blur depth discontinuities at object boundaries and fail to capture object-level structure. Segment-aware depth estimation addresses this limitation by exploiting semantic segmentation to decompose depth prediction into simpler, class-specific subproblems. In this work, we study semantic-aware MDE in a multi-branch design where each semantic class is handled by a lightweight Vision Transformer (ViT) branch that predicts dense depth for its class while suppressing interference from other regions. We further examine fusion strategies that merge the branch outputs into a single prediction: (i) a learnable cross-attention fusion module that predicts depth from the stack of per-class proposals and masks, and (ii) a parameter-free stitched summation that sums mask-gated outputs. The proposed architecture is simple, scalable, end-to-end trainable, and compatible with arbitrary transformer backbones. Experiments on Virtual KITTI 2, where ground-truth depth and semantic labels are available, show that segment-aware modeling produces sharper depth boundaries and improves standard error metrics compared to a single-branch baseline (AbsRel 0.243→0.152; RMSE 11.952→9.101). Finally, we find that the parameter-free summation matches, and in most cases improves upon, the accuracy of learned fusion while adding no computational overhead.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
