CCTalker

Demos, comparison with FaceDiffuser, ProbTalk3D, CodeTalker, FaceFormer...

Abstract

Speech-driven 3D talking-head animation often exhibits uncontrolled phenomena--cross-modal misalignment, speech–expression mismatches, and abrupt transitions--that degrade emotional naturalness and user experience. Compensatory Control Theory (CCT) in psychology holds that when individuals perceive a loss of control, they actively seek to restore order to alleviate anxiety and regain psychological balance. Inspired by this insight, we reinterpret these uncontrolled phenomena as ``loss-of-control'' events, and design compensatory mechanisms to restore ``control''. Based on this perspective, we propose CCTalker, a diffusion-based facial animation generation framework that emulates the full CCT cycle of ``loss-of-control -> compensation -> order restoration” to enhance expressivity in generated animations. Specifically, CCTalker integrates CCT into three collaborative modules. Control–Sense Modeling quantifies target emotion categories and discrete intensities or infers continuous control strength from audio dynamics, monitors deviations in generated expressions, and applies corrective adjustments to ensure visual-emotion alignment.Compensatory Dynamic Enhancement uses dynamic region detection and strength combiner to identify critical vertices and routes them to the EnhancementNetwork for strengthening under-articulated motion and to the SmoothingNetwork for suppressing abrupt deformation, thereby recovering lost geometric details and maintaining animation coherence. Information Order Modeling enforces CCT-based constraints to realign cross-modal emotional cues and restore temporal synchronization and overall sequence order between speech and facial motion. Experiments on the 3DMEAD and BIWI datasets demonstrate that CCTalker outperforms state-of-the-art baselines, validating the effectiveness of CCT mechanisms for generating natural and expressive 3D facial animations.

Proposed Method

Illustration of the proposed CCTalker. The left half shows a diffusion‐based denoiser, conditioning on HuBERT audio features, style embeddings, and timestep MLP, built on a Transformer+GRU backbone, with three CCT‐inspired modules (right half) that contain the “loss-of-control -> compensation -> order restoration” cycle: (1) Control–Sense Modeling gates mesh residuals via $\gamma_t$; (2) Compensatory Dynamic Enhancement uses per‐vertex speed masks and Strength Combiner weights to route vertices through high‐frequency enhancement or low‐frequency smoothing; (3) Information Order Modeling enforces semantic alignment, temporal coherence, and physical plausibility.

BibTeX

@article{2025cctalker,
  title={CCTalker: When Compensatory Control Theory Meets 3D Emotional Talking Head Animation},
  author={},
  journal={arXiv preprint arXiv},
  year={2025}
}