|
|
|
|
|
|
2Department of Informatics and Telecommunications, University of Athens, GR | 3Archimedes Unit, Athena Reseach Center, GR |
|
[Code] | [Dataset] |
In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community. |
We start from a pre-trained text-to-audio diffusion model, i.e. AudioLDM, and to the best of our knowledge are the first to investigate the ability to personalize its outputs for newly learned musical concepts in a few-shot manner. Motivated by the computer vision literature, we explore the application of two established methods, i.e. Textual Inversion and Dreambooth. We adapt these methods for music personalization by employing a set of audio-specific data augmentation methods. We evaluate the capacity of the model to learn new concepts, along two dimensions, reconstruction, i.e. the ability to faithfully reconstruct input examples, and editability, i.e. the ability to manipulate the generation through textual prompts. To this end, we construct a new dataset of various instruments and playing styles. Our evaluation protocol consists of a) embedding distance-based metrics, b) music-specific metrics, and c) an A/B testing user study comparing the two adaptation approaches. Finally, we adapt AudioLDM to perform text-guided style transfer for newly learned concepts. Our key contributions are a) the personalization of AudioLDM's generation and style-transfer abilities for new concepts, b) the exploration of audio-specific augmentations and evaluation metrics, and c) the construction of a new dataset for text-to-music personalization methods. |
Given a musical concept, the objective of personalization
methods is to inject it into the model
such that it can be reconstructed with a unique identifier, e.g. S*. In the following we demonstrate: (a) Reference Audio: an audio file from the concept's training set (b),(c) An audio clip generated with the prompt: "A recording of a S*" by the model (d) An audio clip generated by the base model, using a prompt that roughly describes the concept |
|||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Here we demonstrate our method's ability to manipulate concepts using different text prompts. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Some examples of personalized style transfer. By denoising a shallow reverse process and conditioning on a
learned concept via the textual prompt S*,
we can transfer the learned concept's style to the input audio. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Here we demonstrate the ability of the methods to learn a concept's tempo, harmonic properties
like scale and key, and dynamics.
While both methods (and especially Dreambooth) can, to some extent, maintain the original tempo and key,
they fail to maintain extreme dynamics, as the model has a normalizing effect on audio. All features were extracted automatically using Essentia. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Here we present some additional reconstruction and editability examples for some of the concepts used in the paper, utilizing another base model, AudioLDM2.
All examples were generated using DreamBooth. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|