Palindromic-VC

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

INTERSPEECH 2026

Moshe Mandel Shlomo E. Chazan

arXiv

Hover over elements to view details.

Abstract

Loading abstract...

How It Works

During pretraining, we use KNN over WavLM features to allign triplets of synthetic features - $\hat{B}_1\in\mathbb{R}^{1024\times{L}_1}$, real reference features - $A_2\in\mathbb{R}^{1024\times{L}_2}$, and real target wav segment - $a_1\in\mathbb{R}^{1\times{l}_1}$.

As a preliminary stage, we train a vocoder to reconstrcuct audio segments from WavLM features using a set of a reconstruction loss (MR-STFT, Yamamoto et al.) and adversarial losses (MPD, MSD; Kong et al.). We use this vocoder at the training stage to calculate losses at the waveform level.

Let $G$ denote the vocoder, $\tilde{a}_1=G(A_1)$, and $\mathcal{D}=\mathcal{D}_{\mathrm{MPD}}\cup\mathcal{D}_{\mathrm{MSD}}=\{D_k\}_{k=1}^{K}$ be the set of MPD and MSD sub-discriminators.

\[ \mathcal{L}_{\mathrm{voc}}^{G} = \lambda_{\mathrm{stft}}\mathcal{L}_{\mathrm{MR\text{-}STFT}}(a_1,\tilde{a}_1) +\lambda_{\mathrm{adv}}\mathcal{L}_{\mathrm{adv}}^{G}(G;\mathcal{D}), \] \[ \mathcal{L}_{\mathrm{MR\text{-}STFT}}(a,\tilde{a}) = \frac{1}{M}\sum_{m=1}^{M} \left( \mathcal{L}_{\mathrm{sc}}^{(m)}(a,\tilde{a}) +\mathcal{L}_{\mathrm{mag}}^{(m)}(a,\tilde{a}) \right), \] \[ \mathcal{L}_{\mathrm{sc}}^{(m)}(a,\tilde{a}) = \frac{ \sqrt{\sum_{t,f}\left(|S_m(a)_{t,f}|-|S_m(\tilde{a})_{t,f}|\right)^2} }{ \sqrt{\sum_{t,f}|S_m(a)_{t,f}|^2} }, \] \[ \mathcal{L}_{\mathrm{mag}}^{(m)}(a,\tilde{a}) = \frac{1}{T_mN_m} \sum_{t,f} \left| \log |S_m(a)_{t,f}|-\log |S_m(\tilde{a})_{t,f}| \right|. \] \[ \mathcal{L}_{\mathrm{adv}}^{G}(G;\mathcal{D}) = \sum_{k=1}^{K} \mathbb{E}_{A_1}\left[\left(D_k(G(A_1))-1\right)^2\right] = \sum_{k=1}^{K} \mathbb{E}_{\tilde{a}_1}\left[\left(D_k(\tilde{a}_1)-1\right)^2\right], \] \[ \mathcal{L}_{\mathrm{adv}}^{D}(\mathcal{D};G) = \sum_{k=1}^{K} \mathbb{E}_{(a_1,A_1)} \left[ \left(D_k(a_1)-1\right)^2 +D_k(G(A_1))^2 \right]. \]

At the training stage, we train a transformer using a set of reconstruction losses at the features level (L1) and a speaker loss at the waveform level. The speaker loss utilizes a pretrained speaker verification model (Ecapa-tdnn, Desplanques et al.). We extract speaker embeddings and hiddent representations from both the ground-truth reference waveform $a_2$ and the converted waveform $\tilde{a}_1$, and compute the cosine similarity and L1 distance between them.

\[ \mathcal{L}_{\mathrm{spk}} = 1-\cos\left(e(a_2),e(\tilde{a}_1)\right) +\lambda_{\mathrm{hid}}\sum_{\ell\in\mathcal{H}} \frac{1}{N_\ell} \left\lVert h_\ell(a_2)-h_\ell(\tilde{a}_1)\right\rVert_1, \]

where $e(\cdot)$ is the ECAPA-TDNN speaker embedding and $h_\ell(\cdot)$ is the hidden representation extracted from layer $\ell$.

Finally, in order to improve the naturalness and reduce artifacts of the final synthesized speech, we train a vocoder on the convereted features. The vocoder is optimized using the same objectives as in the first stage.

Quantitative Results

Loading table...

Samples

0 / 0

Ablation Study

Stage 2: without vocoder post-training
Stage 3: with vocoder post-training

0 / 0