Palindromic-VC

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data
Overview figure
Hover over elements to view details.
Hover over elements to view details.
Abstract
Loading abstract...

How It Works

During pretraining, we use KNN over WavLM features to allign triplets of synthetic features - $\hat{B}_1\in\mathbb{R}^{1024\times{L}_1}$, real reference features - $A_2\in\mathbb{R}^{1024\times{L}_2}$, and real target wav segment - $a_1\in\mathbb{R}^{1\times{l}_1}$.

As a preliminary stage, we train a vocoder to reconstrcuct audio segments from WavLM features using a set of a reconstruction loss (MR-STFT, Yamamoto et al.) and adversarial losses (MPD, MSD; Kong et al.). We use this vocoder at the training stage to calculate losses at the waveform level.

At the training stage, we train a transformer using a set of reconstruction losses at the features level (L1) and a speaker loss at the waveform level. The speaker loss utilizes a pretrained speaker verification model (Ecapa-tdnn, Desplanques et al.). We extract speaker embeddings and hiddent representations from both the ground-truth reference waveform $a_2$ and the converted waveform $\tilde{a}_1$, and compute the cosine similarity and L1 distance between them.

Finally, in order to improve the naturalness and reduce artifacts of the final synthesized speech, we train a vocoder on the convereted features. The vocoder is optimized using the same objectives as in the first stage.

Quantitative Results

Loading table...

Samples

Ablation Study

Stage 2: without vocoder post-training
Stage 3: with vocoder post-training