With the rapid development of cross-modal generative models, video-to-audio generation has become a novel endeavor. Some previous papers process the video as a whole, which loses dynamic features and results in coarse-grained generation while others use autoregressive models that utilize each frame of the video to generate the corresponding audio, and then splice the audio segments together to form a single whole, which is inefficient and generates audio that may be discontinuous. In our work, we propose Flow-Audio-Synth, a model designed for latent space video-guided audio generation. We construct the robust and efficient model by: 1) combining the powerful CLIP image encoder with an optical flow algorithm, focusing on both the rich static semantics and dynamic features of videos to facilitate the generation of higher fidelity, fine-grained audio; 2) using slow and fast channels to capture both overall scene changes and dynamic features over localized time periods of videos; 3) utilizing mel-spectrograms for audio generation rather than complex waveforms, enhancing the efficiency and accuracy of the model. Evaluation results show that Flow-Audio-Synth outperforms state-ofthe-art(SOTA) methods in terms of both correlation and fidelity. Moreover, the ablation experiment shows that the optical flow algorithm and the fast and slow channels bring improvements to the model.
GT
Ours
MM-Diffusion
IM2WAV
GT
Ours
MM-Diffusion
IM2WAV
GT
Ours
MM-Diffusion
IM2WAV
GT
Ours
MM-Diffusion
IM2WAV
GT
Ours
MM-Diffusion
IM2WAV