AnyTop: Character Animation Diffusion with Any Topology

Anonymous Author(s)

AnyTop generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input.

Abstract

Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing.

Overview

The input to AnyTop is a noised motion \( X_t\) and the skeleton \(S = \{ \mathcal{P}_\mathcal{S}, \mathcal{R}_\mathcal{S}, \mathcal{D}_\mathcal{S}, \mathcal{N}_\mathcal{S} \}\), where \(\mathcal{P}_\mathcal{S} \) refers to the rest-pose, \(\mathcal{R}_\mathcal{S}\) denotes joints' relations, \(\mathcal{D}_\mathcal{S}\) defines topological distances between each pair of joints and \(\mathcal{N}_\mathcal{S}\) denotes joint names. The Enrichment Block incorporates the skeletal features into the noised motion by concatenating the embedded \(\mathcal{P}_\mathcal{S}\) to the sequence as an additional temporal token and adding a T5-embedded name to each joint. The enriched motion is then passed through a stack of L Skeletal Temporal Transformer layers. We apply skeletal attention along the joint axis to capture interactions between all joints, and incorporate the topology information \(\mathcal{R}_\mathcal{S}\) and \(\mathcal{D}_\mathcal{S}\) to attention maps. Next, we apply temporal attention along the frame axis. Finally, the output is projected back to the motion features dimension, facilitating the reconstruction of the motion sequence.

Correspondence in Latent Space

AnyTop's cross-skeleton manifold enables the capture of both \(\textit{spatial}\) and \(\textit{temporal}\) correspondences, as semantically similar body parts and analogous poses across different skeletons exhibit similar latent representations.

Spatial Correspondence

Monkey (top left) depicts the reference skeleton, while the fox, scorpion, and bird depict different target skeletons. Target skeleton joints are color-coded to match their corresponding joints in the reference. For better visualization, we color the bones to match their adjacent joints. Note the correspondence in limbs, spine, and tail.

Temporal Correspondence

Monkey (top row) features the reference motion, while the Crab and Lynx represent two target motions. The frames of the targets are color-coded to align with their corresponding reference frames. Note the correspondence: aggressive motion segments are pink, idle frames blue, and transitional frames green.

In-skeleton Generalization

Generalization within a specific skeleton, featured as both \(\textit{temporal composition}\) - combining motion segments from dataset instances, and \(\textit{spatial composition}\) - introducing novel poses by combining skeletal parts of ground truth poses.

Source Motion

The chicken walks

Source Motion

The chicken pecks

Synthesized Motion

The chicken walks and pecks

Cross-skeleton Genaralization

A form of generalization expressed through shared motion motifs across different characters, allowing the adaptation of motion behaviors originally performed by other skeletons.

Flamingo

Raptor

Unseen-skeleton Generalization

Zero-shot inference of skeletons not encountered during training.

Cat

Komodo Dragon

Motion Editing

We demonstrate our method's versatility through two motion editing applications: \(\textit{in-betweening}\) for temporal manipulation and \(\textit{body-part editing}\) for spatial modifications.

In-betweening

(Pink=Input, Orange=Synthesis)

Ostrich

Tyrannosaurus

Centipede



Body-part Editing

(Pink=Input, Orange=Synthesis)

Ostrich

Tyrannosaurus

Fire Ant