DVD: Deterministic Video Depth Estimation with Generative Priors

저자: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He 외 (EnVision Research) 발표: 2026-03-12 (arXiv: 2603.12250) 프로젝트: https://dvd-project.github.io/ 코드: https://github.com/EnVision-Research/DVD

문제 정의

비디오 깊이 추정의 근본적 딜레마:

→ DVD는 사전학습된 video diffusion model을 single-pass depth regressor로 결정론적 변환하는 최초의 프레임워크.

메트릭	수치
성능	벤치마크 전반에서 SOTA zero-shot 달성
데이터 효율	367K 프레임으로 학습 — 선두 discriminative 모델(VDA: 60M 프레임) 대비 163× 적은 데이터
기반 모델	WanV2.1 video diffusion model

전체 학습 파이프라인 공개 — SOTA 비디오 깊이 추정을 위한 완전한 training suite.

항목	내용
GitHub	https://github.com/EnVision-Research/DVD
Pre-trained weights	HuggingFace에서 다운로드 (ckpt/ 디렉토리)
추론 결과	inference_results/ — depth map + visualizer video
설정	train_config/ — YAML 하이퍼파라미터
핵심 코드	`diffsynth/pipelines/wan_video_new_determine.py`

항목	상태
코드	✅ github.com/EnVision-Research/DVD (Apache-2.0)
모델	✅ HuggingFace에 공개 (CC-BY-NC-4.0, 비상용)
데이터	367K 프레임 학습 데이터 (공개 여부 확인 필요)
라이선스	코드: Apache-2.0 / 모델: CC-BY-NC-4.0 (비상용)
요구사양	WanV2.1 기반 → A100 80GB 권장 추정 / RTX 4090으로 추론 가능성

our depth estimation project 프로젝트에 직결되는 연구:

Diffusion → Depth의 결정론적 전환: 기존 depth diffusion 모델들은 확률적 샘플링으로 일관성 문제가 있었는데, DVD는 이를 single-pass regression으로 근본 해결. bit depth expansion에서도 동일 접근이 가능 — diffusion 기반 HDR 복원에서 deterministic output을 보장하는 방법론.
163× 데이터 효율: video foundation model의 깊은 geometric prior를 활용하면 소량 데이터로도 SOTA 달성. our depth estimation project에서 HDR 학습 데이터 부족 문제를 해결하는 전이학습 전략으로 활용 가능.
LMR(Latent Manifold Rectification): over-smoothing 방지 기법이 bit depth expansion의 false contour 제거에 직접 적용 가능. 미분 제약조건으로 gradient edge를 보존하면서 양자화 노이즈를 제거.
Global Affine Coherence: 장시간 비디오의 temporal consistency를 복잡한 정렬 없이 달성 → 영상 전체에 걸친 일관된 depth/HDR 추정에 필수적.
our rendering research project에도 관련: depth estimation은 Gaussian Splatting 재구성의 초기화에 핵심. DVD의 zero-shot 고품질 depth는 sparse-view 3DGS reconstruction의 기하학적 prior로 활용 가능.
코드+모델 완전 공개 (코드 Apache-2.0) → 즉시 실험 가능. 모델은 CC-BY-NC이므로 연구용 활용에 제한 없음.