AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors

Xiaozhen Qiao1,* Wenjia Wang2,*,† Zhiyuan Zhao1 Jiacheng Sun3 Ping Luo2 Hongyuan Zhang1,2,‡ Xuelong Li1,‡

1 Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China 2 University of Hong Kong 3 Huawei Technologies Co., Ltd.

* Equal Contribution. Project Lead. Corresponding Authors.

Teaser image
\textbf{(a)} AHAP achieves 180$\times$ speedup over optimization-based HSfM while maintaining competitive accuracy. \textbf{(b)} Results on EgoHumans. \textbf{(c)} Results on EgoExo4D.

Abstract

Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{\ours{}} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that \ours{} achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches. \end{abstract}

Method Overview

Method overview
\textbf{Overall pipeline of \ours{}.} Given multi-view images, the scene encoder estimates scene geometry and camera poses, while the human encoder extracts human-centric features. Our cross-view identity association module matches the same person across views via learnable queries. The human head fuses scene tokens, aggregated tokens, and reference view tokens through a multi-view fusion decoder to predict SMPL parameters. Finally, we align humans and scene point clouds via scale alignment and multi-view triangulation for precise human localization.

Video Demonstrations

Demo 1.

Demo 2.

Qualitative Results

Qualitative results overview
\textbf{Qualitative results.} Visualization of human-scene reconstruction on EgoHumans and EgoExo4D. AHAP produces accurate human meshes within reconstructed scenes, maintaining consistent identity association across views.

BibTeX

@misc{qiao2026ahap,
  title      = {AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors},
  author     = {Xiaozhen Qiao and Wenjia Wang and Zhiyuan Zhao and Jiacheng Sun and Ping Luo and Hongyuan Zhang and Xuelong Li}
}