Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405—the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts. Code, models, and CSG-405 will be publicly released.
Overview of our co-speech gesture video generation framework. We concatenate the 2D skeleton of the reference image R with the noisy skeleton sequence xT along the frame dimension, providing the body shape cue of the speaker. We then concatenate the embeddings of skeletons and those of audio segments along the feature dimension as the input of the diffusion model, enforcing strict temporal synchronization. Finally, we employ one off-the-shelf human video generation model to produce the co-speech gesture video V with the synthesized skeleton sequence as an auxiliary condition.
Table 1. Comparison in statistics between our CSG-405 database and existing public ones for co-speech gesture video generation.
Figure 2. More details of CSG-405. (a) The proportion of clips for each speech type. (b) Attribute distribution in gender, ethnicity, age, and emotion.
We collected a substantial amount of high-resolution video data from the Internet and conducted meticulous filtering throughout the process, ultimately obterning the CSG-405 dataset, which contains 405 hours of high-resolution videos across 71 speech types.
@inproceedings{yang2025demo,
title={Democratizing High-Fidelity Co-Speech Gesture Video Generation},
author={Xu Yang and Shaoli Huang and Shenbo Xie and Xuelin Chen and Yifei Liu and Changxing Ding},
booktitle={Proceedings of the 2025 International Conference on Computer Vision(ICCV)},
year={2025}
}