Democratizing High-Fidelity Co-Speech Gesture Video Generation

Xu Yang1*‡, Shaoli Huang2*, Shenbo Xie1*, Xuelin Chen2, Yifei Liu1, Changxing Ding1✝
1South China University of Technology    2Tencent AI Lab   
*Equal Contribution   
Corresponding Authors    Part of his work was done during an internship at Tencent AI Lab.

Showcases

Click the video to full screen.

Abstract

Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405—the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts. Code, models, and CSG-405 will be publicly released.

Method

model.jpg

Overview of our co-speech gesture video generation framework. We concatenate the 2D skeleton of the reference image R with the noisy skeleton sequence xT along the frame dimension, providing the body shape cue of the speaker. We then concatenate the embeddings of skeletons and those of audio segments along the feature dimension as the input of the diffusion model, enforcing strict temporal synchronization. Finally, we employ one off-the-shelf human video generation model to produce the co-speech gesture video V with the synthesized skeleton sequence as an auxiliary condition.

CSG-405 Database Introduction

statistics_dataset.png

Table 1. Comparison in statistics between our CSG-405 database and existing public ones for co-speech gesture video generation.

codata_dataset.png

Figure 2. More details of CSG-405. (a) The proportion of clips for each speech type. (b) Attribute distribution in gender, ethnicity, age, and emotion.

CSG-405 Data Construction Pipeline

Figure_3_17_data_filter_AI.png

We collected a substantial amount of high-resolution video data from the Internet and conducted meticulous filtering throughout the process, ultimately obterning the CSG-405 dataset, which contains 405 hours of high-resolution videos across 71 speech types.

Comparisons on CSG-405

Click the video to full screen.

Comparisons on AI-Generated Images

Click the video to full screen.

Comparisons on PATS

Click the video to full screen.

Comparisons on TED-talks

Click the video to full screen.

Comparisons on CyberHost

Click the video to full screen.

Comparisons on VLOGGER

Click the video to full screen.

Ablation study

Click the video to full screen.

BibTeX


@inproceedings{yang2025demo,
    title={Democratizing High-Fidelity Co-Speech Gesture Video Generation},
    author={Xu Yang and Shaoli Huang and Shenbo Xie and Xuelin Chen and Yifei Liu and Changxing Ding},
    booktitle={Proceedings of the 2025 International Conference on Computer Vision(ICCV)},
    year={2025}
}