SIMS: Simulating Stylized Human-Scene Interactions
with Retrieval-Augmented Script Generation
(Arxiv 2025)

Wenjia Wang1        Liang Pan1, 2        Zhiyang Dou1        Jidong Mei 1       
Zhouyingcheng Liao1        Yuke Lou1        Yifan Wu1        Lei Yang2        Jingbo Wang2, †         Taku Komura1, †        

1The University of Hong Kong      2Shanghai AI Laboratory    

(†:equal advising.)

Your Image
The insteresting long demos generated by our method:
(the emojis and sound effects were added in post-production to improve immersion:):
Long-term demo 1:
Long-term demo 2:
Long-term demo 3:

Long-term demo 4:
Long-term demo 5:
Long-term demo 6:

We prompt the LLM again to rewrite the long script of Demo 6 to a happy ending, which demonstrates the story editing capability of our approach:



Single skill demos:
Carry
Idle
Walk
Sit
Lie



ViconStyle dataset demos (SMPL-X skeleton):
char1_LieDown_Side_01
char1_LieDown_Side_06
char2_Idle_Angry_03
char2_Idle_Drunk_01

char3_Carry_Careful_01
char3_Carry_Tired_02
char3_Idle_Happy_02
char3_Idle_Relax_02

Abstract

Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges high-level script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multi-condition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.

Pipeline Overview
Your Image

BibTeX

@article{wang2025sims,
  author    = {Wang, Wenjia and Pan, Liang and Dou, Zhiyang and Mei, Jidong and Liao, Zhouyingcheng and Lou, Yuke and Wu, Yifan and Yang, Lei and Wang, Jingbo and Komura, Taku},
  title     = {SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented
    Script Generation.},
  journal   = {Arxiv.},
  year      = {2025},
}