Investigating Personalization Methods In Text To Music Generation

Manos Plitsis*1,2
Theodoros Kouzelis*1
Georgios Paraskevopoulos1
Vassilis Katsouros1
Yannis Panagakis1,3

1Institute of Language and Speech Processing, Athena Research Center, GR
2Department of Informatics and Telecommunications, University of Athens, GR 3Archimedes Unit, Athena Reseach Center, GR

[ArXiv]
[Code] [Dataset]

Abstract

In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.



Text-to-Audio Personalization

We start from a pre-trained text-to-audio diffusion model, i.e. AudioLDM, and to the best of our knowledge are the first to investigate the ability to personalize its outputs for newly learned musical concepts in a few-shot manner. Motivated by the computer vision literature, we explore the application of two established methods, i.e. Textual Inversion and Dreambooth. We adapt these methods for music personalization by employing a set of audio-specific data augmentation methods. We evaluate the capacity of the model to learn new concepts, along two dimensions, reconstruction, i.e. the ability to faithfully reconstruct input examples, and editability, i.e. the ability to manipulate the generation through textual prompts. To this end, we construct a new dataset of various instruments and playing styles. Our evaluation protocol consists of a) embedding distance-based metrics, b) music-specific metrics, and c) an A/B testing user study comparing the two adaptation approaches. Finally, we adapt AudioLDM to perform text-guided style transfer for newly learned concepts. Our key contributions are a) the personalization of AudioLDM's generation and style-transfer abilities for new concepts, b) the exploration of audio-specific augmentations and evaluation metrics, and c) the construction of a new dataset for text-to-music personalization methods.



Main Results

Concept Reconstruction Examples

Given a musical concept, the objective of personalization methods is to inject it into the model such that it can be reconstructed with a unique identifier, e.g. S*.

In the following we demonstrate:

(a) Reference Audio: an audio file from the concept's training set

(b),(c) An audio clip generated with the prompt: "A recording of a S*" by the model

(d) An audio clip generated by the base model, using a prompt that roughly describes the concept



(a) Reference Audio
(b) Dreambooth
(c) Textual Inversion
(d) AudioLDM
"A recording of a S*"
"A recording of a S*"
"A recording of an Amen break"
"A recording of a S*"
"A recording of a S*"
"A recording of an oud, a traditional turkish string instrument"
"A recording of a S*"
"A recording of a S*"
"A recording of Eminem rapping"
"A recording of a S*"
"A recording of a S*"
"A recording of a hip hop drum beat"
"A recording of S*"
"A recording of a S*"
"A recording of polyphonic singing from Epirus"

Editability Examples

Here we demonstrate our method's ability to manipulate concepts using different text prompts.

(a) Reference Audio
(b) Dreambooth
(c) Textual Inversion
(d) AudioLDM
"A disco song with a S*"
"A disco song with a S*"
"A recording of a disco song with a turkish traditional oud melody"
"A recording of a S* song with a hip hop drum beat accompaniment"
"A recording of a S* song with a hip hop drum beat accompaniment"
"A recording of a free jazz saxophone solo with a hip hop drum beat"
"A recording of a S* song with a jazz drum beat accompaniment"
"A recording of a S* song with a jazz drum beat accompaniment"
"A recording of the rapper Eminem with a jazz drum beat accompaniment"
"A jazz song with a S*"
"A jazz song with a S*"
"A jazz song with an Amen break drum beat"
"A recording of a S* in a cathedral"
"A recording of a S* in a cathedral"
"A recording of a hip hop drum beat in a cathedral"

Personalized Style Transfer

Some examples of personalized style transfer. By denoising a shallow reverse process and conditioning on a learned concept via the textual prompt S*, we can transfer the learned concept's style to the input audio.

(a) Input Audio
(b) Target Concept
(c) Style Transfer

Can the methods learn musical properties?

Here we demonstrate the ability of the methods to learn a concept's tempo, harmonic properties like scale and key, and dynamics. While both methods (and especially Dreambooth) can, to some extent, maintain the original tempo and key, they fail to maintain extreme dynamics, as the model has a normalizing effect on audio.

All features were extracted automatically using Essentia.

(a) Reference Audio
(b) Dreambooth
(c) Textual Inversion
Maintaining Reference Tempo:
118 BPM
122 BPM
160 BPM
Maintaining Reference Scale and Key:
A major
D major
Bb major
Maintaining Reference Dynamics:
-11.73 LUFS
-23.93 LUFS
-16.32 LUFS



Additional Examples using AudioLDM2

Here we present some additional reconstruction and editability examples for some of the concepts used in the paper, utilizing another base model, AudioLDM2. All examples were generated using DreamBooth.

(a) Reference Audio
(b) Reconstruction
(c) Editability
"A recording of a S*""
"A rock song with a S*"
"A recording of a S*"
"A techno song with a S*"
"A recording of a S*"
"A heavy metal song with a S*"
"A recording of a S*"
"A drum'n'bass song with a S*"
"A recording of a S*"
"An industrial gabber techno song with a S*"
"A recording of a S*"
"A rock song with a S* bass riff"
"A recording of a S*"
"A techno track with S*"
"A recording of a S*"
"A drum'n'bass beat with S* in the background"
"A recording of a S*"
"A recording of S* in heavy metal style"