StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth

ICCV 2025(Highlight)

Zheng Zhang^1,2* Lihe Yang^1* Tianyu Yang^{2,4 †} Chaohui Yu^2,4
Xiaoyang Guo³ Yixing Lao¹ Hengshuang Zhao^{1 †}

¹The University of Hong Kong ²DAMO Academy, Alibaba Group
³The Chinese University of Hong Kong ⁴Hupan Lab

StableDepth achieves efficient online monocular depth estimation that produces scene-consistent and scale-invariant predictions frame by frame.

Abstract

Recent advances in monocular depth estimation have significantly improved its robustness and accuracy. Despite these improvements, relative depth models, which offer strong generalization capability, fail to provide real-world depth measurements. Notably, these models exhibit severe flickering and 3D inconsistency when applied to video data, limiting their application for 3D reconstruction. To address these challenges, we introduce StableDepth, a scene-consistent and scale-invariant depth estimation method that achieves stable predictions with scene-level 3D consistency. We propose a dual decoder structure to learn smooth depth supervised by large-scale unlabeled video data. Our approach not only enhances the generalization capability but also reduces flickering during video depth estimation. Leveraging the vast amount of unlabeled video data, our method offers extensive stability and is easy to scale up with low cost. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13× faster speed, while achieving significant accuracy improvements (6.4%-86.8%) across multiple benchmarks and delivering comparable temporal consistency to video diffusion models.

Framework

StableDepth employs a shared encoder with dual decoders to process both labeled images and unlabeled videos. The upper branch handles labeled images with direct metric depth supervision ("Sup"), while the lower branch processes unlabeled video sequences using temporal-coherent pseudo labels generated by pre-trained video diffusion priors ("Bake"). This architecture enables scene-consistent and scale-invariant (SCSI) depth prediction by combining the benefits of metric depth supervision with temporal consistency learning from video sequences. Solid arrows indicate the primary data flow, while dashed arrows represent the auxiliary training path for unlabeled data.

Comparison with Depth Pro (SOTA Metric Depth) on Open-World Videos

Comparison with Depth-Anything-V2 (Metric Depth Version) on Open-World Videos

Direct Point Cloud Reconstruction without Any Scaling

BibTeX

@inproceedings{zhang2025-stabledepth,
    author      = {Zhang, Zheng and Yang, Lihe and Yang, Tianyu and Yu, Chaohui and Guo, Xiaoyang and Lao, Yixing and Zhao, Hengshuang},
    title       = {StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth},
    booktitle   = {ICCV},
    year        = {2025}
}