Introduction to Voice Conversion

Report
Voice Conversion
Dr. Elizabeth Godoy
Speech Processing Guest Lecture
December 11, 2012
E.Godoy Biography
United States (Rhode Island): Native
I.
Hometown: Middletown, RI
Undergrad & Masters at MIT (Boston Area)


France (Lannion): 2007-2011
II.

Worked on my PhD at Orange Labs
Iraklio: Current
III.

2
Work at FORTH with Prof Stylianou
E.Godoy, Voice Conversion
December 11, 2012
Professional Background
B.S. & M.Eng from MIT, Electrical Eng.
I.
Specialty in Signal Processing



Underwater acoustics: target physics, environmental modeling, torpedo homing
Antenna beamforming (Masters): wireless networks
PhD in Signal Processing at Orange Labs
II.
Speech Processing:Voice Conversion



Speech Synthesis Team (Text-to-Speech)
Focus on Spectral Envelope Transformation
Post-Doctoral Research at FORTH
III.
LISTA: Speech in Noise & Intelligibility



3
Analyses of Human Speaking Styles (e.g. Lombard, Clear)
Speech Modifications to Improve Intelligibility
E.Godoy, Voice Conversion
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.


Spectral Envelope Transformation in VC
II.


III.


IV.
4
Speech Synthesis Context (TTS)
Overview of Voice Conversion
Standard: Gaussian Mixture Model
Proposed: Dynamic Frequency Warping + Amplitude Scaling
Conversion Results
Objective Metrics & Subjective Evaluations
Sound Samples
Summary & Conclusions
E.Godoy, Voice Conversion
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.


5
Speech Synthesis Context (TTS)
Overview of Voice Conversion
E.Godoy, Voice Conversion
December 11, 2012
Voice Conversion (VC)
Transform the speech of a (source) speaker so that it
sounds like the speech of a different (target) speaker.

This is awesome!
Voice
Conversion
This is awesome!
He sounds like me!
???
source
6
target
E.Godoy, Voice Conversion
December 11, 2012
Context: Speech Synthesis

Increase in applications using speech technologies
 Cell phones, GPS, video gaming, customer service apps…
Turn left!
Insert your card.
Ha ha!
This is Abraham
Lincoln speaking…
Next Stop: Lannion
Information communicated through speech!
Text-to-Speech!

7
Text-to-Speech (TTS) Synthesis
 Generate speech from a given text
E.Godoy,
E.Godoy,
Voice
Guest
Conversion
Lecture
December 11, 2012
Text-to-Speech (TTS) Systems
TTS Approaches

Concatenative: speech synthesized from recorded segments
1.

Unit-Selection: parts of speech chosen from corpora & strung together
High-quality synthesis, but need to record & process corpora
Parametric: speech generated from model parameters
2.

HMM-based: speaker models built from speech using linguistic info
Limited quality due to simplified speech modeling & statistical averaging
Concatenative
or
Paramateric?
8
E.Godoy, Voice Conversion
December 11, 2012
Text-to-Speech (TTS) Example
voice
9
E.Godoy, Voice Conversion
December 11, 2012
Voice Conversion: TTS Motivation

Concatenative speech synthesis



High-quality speech
But, need to record & process a large corpora for each voice
Voice Conversion


10
Create different voices by speech-to-speech transformation
Focus on acoustics of voices
E.Godoy, Voice Conversion
December 11, 2012
What gives a voice an identity?

“Voice” notion of identity (voice rather than speech)

Characterize speech based on different levels
1.
Segmental
 Pitch – fundamental frequency
 Timbre – distinguishes between different types of sounds
2.
Supra-Segmental
 Prosody – intonation & rhythm of speech
11
E.Godoy, Voice Conversion
December 11, 2012
Goals of Voice Conversion
Synthesize High-Quality Speech
1.

Maintain quality of source speech (limit degradations)
Capture Target Speaker Identity
2.

Requires learning between source & target features
Difficult task!

12
significant modifications of source speech needed that risk
severely degrading speech quality…
E.Godoy, Voice Conversion
December 11, 2012
Stages of Voice Conversion
1) Analysis, 2) Learning, 3) Transformation
TARGET
speech
PARAMETER
EXTRACTION
LEARNING
• Generating Acoustic Feature Spaces
• Establish Mappings between
Source & Target Parameters
CORPORA
SOURCE
speech
PARAMETER
EXTRACTION
TRANSFORMATION
• Classify Source Parameters
in Feature Space
• Apply Transformation Function

13
Converted
Speech
SYNTHESIS
Key Parameter: the spectral envelope (relation to timbre)
E.Godoy, Voice Conversion
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.


Speech Synthesis Context (TTS)
Overview of Voice Conversion
Spectral Envelope Transformation in VC
II.


14
Standard: Gaussian Mixture Model
Proposed: Dynamic Frequency Warping + Amplitude Scaling
E.Godoy, Voice Conversion
December 11, 2012
The Spectral Envelope

Spectral Envelope: curve approximating the DFT magnitude
0
-20
-40
-60
-80
-100
-120

15
1000
2000
3000
4000
5000
6000
7000
8000
Related to voice timbre, plays a key role in many speech applications:


0
Coding, Recognition, Synthesis,Voice transformation/conversion
Voice Conversion: important for both speech quality and voice identity
E.Godoy, Voice Conversion
December 11, 2012
Spectral Envelope Parameterization

Two common methods
1) Cepstrum
- Discrete Cepstral Coefficients
- Mel-Frequency Cepstral Coefficients (MFCC)
 change the frequency scale to reflect bands of human hearing
2) Linear Prediction (LP)
- Line Spectral Frequencies (LSF)
16
E.Godoy, Voice Conversion
December 11, 2012
Standard Voice Conversion
Extract target
parameters
Corpora
(Parallel)
Align source & target frames
Extract source
parameters
LEARNING
TRANSFORMATION
Converted
Speech synthesis
Focus: Learning & Transforming the Spectral Envelope



Parallel corpora: source & target utter same sentences
Parameters are spectral features (e.g. vectors of cepstral coefficients)
Alignment of speech frames in time
 Standard: Gaussian Mixture Model
17
E.Godoy, Voice Conversion
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.
Speech Synthesis Context (TTS)
Overview of Voice Conversion


Spectral Envelope Transformation in VC
II.
Standard: Gaussian Mixture Model

Formulation
Limitations


1.
2.

18
Acoustic mappings between source & target parameters
Over-smoothing of the spectral envelope
Proposed: Dynamic Frequency Warping + Amplitude Scaling
E.Godoy, Voice Conversion
December 11, 2012
Gaussian Mixture Model for VC

Origins:




Underlying principle:


Evolved from "fuzzy" Vector Quantization (i.e. VQ with "soft" classification)
Originally proposed by [Stylianou et al; 98]
Joint learning of GMM (most common) by [Kain et al; 98]
Exploit joint statistics exhibited by aligned source & target frames
Methodology:


19
Represent distributions of spectral feature vectors as mix of Q Gaussians
Transformation function then based on MMSE criterion
E.Godoy, Voice Conversion
December 11, 2012
GMM-based Spectral Transformation
1) Align N spectral feature vectors in time. (discrete cepstral coeffs)
source: X  x1,..., xN , target: Y  y1,..., yN , joint : Z  X ,Y 
2) Represent PDF of vectors as mixture of Q multivariate Gaussians
Q
p( z ) 
Q
 N ( z;  ,  ), 
q
q
q 1
q
q
 1,  q  0
q 1
Learn q , q , q , q  1 : Q from Expectation Maximization (EM) on Z
3) Transform source vectors using weighted mixture of Maximum
Likelihood (ML) estimator for each component.
Q

 
yˆ n ( xn )   wqx ( xn ) qy   qyx  qxx
q 1
1
( xn  qx )

wqx ( xn ) : probability source frame belongs to acoustic class
described by component q (calculated in Decoding)
E.Godoy, Voice Conversion
GMM-Transformation Steps
1) Source frame xn  want to estimate target vector: yˆ n
x
2) Classify xn  calculate wq ( xn )
wqx ( xn ) : probability source frame belongs to acoustic class
described by component q (Decoding step)
3) Apply transformation function:
Q

 
yˆ n ( xn )   wqx ( xn ) qy   qyx  qxx
q 1
1
( xn  qx )

weighted sum ML estimator for class
E.Godoy, Voice Conversion
Acoustically Aligning Source & Target Speech

First question: are the acoustic events from the source & target
speech appropriately associated?
Time alignment  Acoustic alignment ???

One-to-Many Problem


Occurs when single acoustic event of source aligned to multiple
acoustic events of target [Mouchtaris et al; 07]
Problem is that cannot distinguish distinct target events given only
source information
Source: one single event  As ,
Target: two distinct events  Bt, Ct
[As ; Bt]
[As ; Ct]
Joint Frames
22
Bt
0.8
0.7
0.6
0.5
0.4
As
0.3
0.2
0.1
Ct
0
-4
-2
0
Acoustic Space
E.Godoy, Voice Conversion
December 11, 2012
2
4
6
8
Cluster by Phoneme (“Phonetic GMM”)

Motivation: eliminate mixtures to alleviate one-to-many problem!

Introduce contextual information
 Phoneme (unit of speech describing linguistic sound): /a/,/e/,/n/, etc


Formulation: cluster frames according to phoneme label
Each Gaussian class q then corresponds to a phoneme:
Nq
1 Nq
1 Nq
T
q 
z
,


(
z


)(
z


)
,


 l q
 l l l l
q
N q l 1
N q l 1
N
N q framesfor phonemeq

23
Outcome: error indeed decreases using classification by phoneme
 Specifically, errors from one-to-many mappings reduced!
E.Godoy, Voice Conversion
December 11, 2012
Still GMM Limitations for VC

Unfortunately, converted speech quality poor 
original
converted

What about the joint statistics in the GMM?

Difference between GMM and VQ-type (no joint stats) transformation?
GMM :
joint statistics

Q
 
 wqx ( xn ) qy   qyx  qxx ( xn  qx )
q 1
1
VQ  type :
Q
 w ( x )
q 1
24
x
q
n
y
q

No significant difference!
(originally shown [Chen et al; 03])
E.Godoy, Voice Conversion
December 11, 2012
"Over-Smoothing" in GMM-based VC

The Over-Smoothing Problem:
(Explains poor quality!)



Transformed spectral envelopes using GMM are "overly-smooth"
Loss of spectral details  converted speech "muffled", "loss of
presence"
Cause: (Goes back to those joint statistics…)

Low inter-speaker parameter correlation
(weak statistical link exhibited by aligned source & target frames)
Q

 
yˆ n ( xn )   wqx ( xn ) qy  qyx 
q 1
xx 1
q

Q
( xn  qx )   wqx ( xn )qy
class mean
q 1
Frame-level alignment of source & target parameters not effective!
25
E.Godoy, Voice Conversion
December 11, 2012
“Spectrogram” Example
(GMM over-smoothing)
Spectral Envelopes across a sentence.
Y
DFWA
DFWA
GMM
Phonetic
Phonetic
X GMM
X
Y
0
50
50
50
50
50
50
50
0
100
100
100
100
100
100
100
0
150
150
150
150
150
150
150
0
200
200
200
200
200
200
200
0
250
250
250
250
250
250
250
0
2000
4000
Hz
26
6000
00
2000
2000
4000
4000
Hz
Hz
6000
6000
00
2000
2000
4000
4000
Hz
Hz
6000
6000
00
2000
2000
4000
4000
Hz
Hz
6000
6000
E.Godoy, Voice Conversion
0
2000
4000
Hz
6000
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.
Speech Synthesis Context (TTS)
Overview of Voice Conversion


Spectral Envelope Transformation in VC
II.

Standard: Gaussian Mixture Model

Proposed: Dynamic Frequency Warping + Amplitude Scaling


27
Related work
DFWA description
E.Godoy, Voice Conversion
December 11, 2012
Dynamic Frequency Warping (DFW)

28
Dynamic Frequency Warping [Valbret; 92]
 Warp source spectral envelope in frequency to
resemble that of target

Alternative to GMM
 No transformation with joint statistics!

Maintains spectral details  higher quality speech 

Spectral amplitude not adjusted explicitly
 poor identity transformation 
E.Godoy, Voice Conversion
December 11, 2012
Spectral Envelopes of a Frame: DFW
00
Y Y
X X
DFW
-10
-10
-20
-20
dB
-30
-30
-40
-40
-50
-50
-60
-60
-70
-70
00
29
1000
1000
2000
2000
3000
3000
4000
4000
Hz
Hz
5000
5000
6000
6000
E.Godoy, Voice Conversion
7000
7000
8000
8000
December 11, 2012
Hybrid DFW-GMM Approaches

Goal to adjust spectral amplitude after DFW

Combine DFW- and GMM- transformed envelopes

Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10]

Impose compromise between
1. maintaining spectral details (via DFW)
2. respecting average trends (via GMM)
30
E.Godoy, Voice Conversion
December 11, 2012
Proposed Approach: DFWA

Observation: Do not need the GMM to adjust amplitude!

Alternatively, can use a simple amplitude correction term
 Dynamic Frequency Warping with Amplitude Scaling (DFWA)
S dfwa,n ( f )  Aq ( f ) S xn (Wq1 ( f ))
S xn (Wq1 ( f ))  containsspectraldetails;
Aq ( f )  respectsaveragetrends
31
E.Godoy, Voice Conversion
December 11, 2012
Levels of Source-Target Parameter Mappings
(3) Class-level: Global (DFWA)

(1) Frame-level: Standard (GMM)

32
E.Godoy, Voice Conversion
December 11, 2012
Recall Standard Voice Conversion
Extract target
parameters
Corpora
(Parallel)
Align source & target frames
Extract source
parameters
LEARNING
TRANSFORMATION
Converted
Speech synthesis

Parallel corpora: source & target utter same sentences
Alignment of individual source & target frames in time

Mappings between acoustic spaces determined on frame level

33
E.Godoy, Voice Conversion
December 11, 2012
DFW with Amplitude Scaling (DFWA)

Associate source & target parameters based on class statistics
Unlike typical VC, no frame alignment required in DFWA!
1.
2.
3.
Define Acoustic Classes
DFW
Amplitude Scaling
Extract target
parameters
Clustering
Corpora
DFW
Extract source
parameters
34
Amplitude Scaling
Converted
Speech synthesis
Clustering
E.Godoy, Voice Conversion
December 11, 2012
Defining Acoustic Classes

Acoustic classes built using clustering, which serves 2 purposes:
1.
2.

Choice of clustering approach depends on available information:



35
Classify individual source or target frames in acoustic space
Associate source & target classes
Acoustic information:

With aligned frames: joint clustering (e.g. joint GMM or VQ)

Without aligned frames: independent clustering & associate classes
(e.g. with closest means)
Contextual information: use symbolic information (e.g. phoneme labels)
Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence
E.Godoy, Voice Conversion
December 11, 2012
DFW Estimation

Compares distributions of observed source & target spectral envelope peak
frequencies (e.g. formants)



Global vision of spectral peak behavior (for each class)
Only peak locations (and not amplitudes) considered
DFW function: piecewise linear

Intervals defined by aligning maxima in spectral peak frequency distributions
–
Dijkstra algorithm: min sum abs diff between target & warped source distributions
Wq ( f )  Bq,m f  Cq,m

f  f qx,im , f qx,im1

( f qx,im , f qy, jm ), m  1,...,M q
Bq ,m 
f qy, jm1  f qy, jm
f qx,im1  f qx,im
Cq ,m  f qy, jm  Bq ,m f qx,im
E.Godoy, Voice Conversion
DFW Estimation
Peak Occurrence Distributions
0.015
source

0.01

0.005


0
0
1000
2000
3000
4000
5000
Frequency (Hz)
6000
7000
8000
Clear global trends (peak locations)
Avoid sporadic frame-to-frame
comparisons
Dijkstra algorithm selects pairs
DFW statistically aligning most
probable spectral events (peak
locations)
0.015
target target
warped source

0.01
0.005
0
0
1000
2000
3000
4000
5000
Frequency (Hz)
6000
7000
Amplitude scaling then adjusts
differences between the target &
warped source spectral envelopes
8000
E.Godoy, Voice Conversion
Spectral Envelopes of a Frame
0
Y Y
X X
DFW
-10
-20
dB
-30
-40
-50
-60
-70
38
0
1000
2000
3000
4000
Hz
5000
6000
E.Godoy, Voice Conversion
7000
8000
December 11, 2012
Amplitude Scaling

Goal of amplitude scaling:
Adapt frequency-warped source envelopes to better resemble those of target

estimation using statistics of target and warped source data

For class q, amplitude scaling function Aq ( f ) defined by:
log( Aq ( f ))  log( S qy ( f ))  log( S qx (Wq1 ( f )))
-
Comparison between source & target data strictly on acoustic class-level
- Difference in average log spectra
No arbitrary weighting or smoothing!
E.Godoy, Voice Conversion
DFWA Transformation

Transformation (for source frame n in class q):
S dfwa,n ( f )  Aq ( f ) S xn (Wq1 ( f ))
S xn (Wq1 ( f ))  containsspectraldetails;
Aq ( f )  respectsaveragetrends log( Aq ( f ))  log( S qy ( f ))  log( S qx (Wq1 ( f )))

In discrete cepstral domain:

yˆ ndfwa  qy  n  q

(log spectrum parameters)
n  cepstralcoefficients for S (W ( f ))
xn
1
q
q  mean of n , representslog(S x (Wq1 ( f )))
n
40
Average target
envelope respected
("unbiased estimator")
Captures timbre!
E.Godoy, Voice Conversion
Spectral details maintained:
difference between frame
realization & average
Ensures Quality!
December 11, 2012
Spectral Envelopes of a Frame
0
Y
YY Y
X
XX X
DFW
DFW
DFW
DFWA DFWA
Phonetic GMM
-10
-20
dB
-30
-40
-50
-60
-70
41
0
1000
2000
3000
4000
Hz
5000
6000
E.Godoy, Voice Conversion
7000
8000
December 11, 2012
Spectral Envelopes across a Sentence
X
Phonetic GMM
DFWA
Y
50
50
50
50
100
100
100
100
150
150
150
150
200
200
200
200
250
250
250
250
zoom
0
42
2000
4000
Hz
6000
0
2000
4000
Hz
6000
0
2000
4000
Hz
6000
E.Godoy, Voice Conversion
0
2000
4000
Hz
6000
December 11, 2012
Spectral Envelopes across a Sound
X
265
270

275
DFWA looks like natural speech
280

285
0
500
1000
1500
2000
Hz
2500
3000
3500
Phonetic GMM

265
270
275
can see warping appropriately
shifting source events
overall, DFWA more closely
resembles target
280
285
0
500
1000
1500
2000
Hz
2500
3000
3500
2500
3000
3500
DFWA

Important to note that examining
spectral envelope evolution in
time very informative!

Can see poor quality with GMM
right away (less evident w/in
frame)
265
270
275
280
285
0
500
1000
1500
2000
Hz
Y
265
270
275
280
285
0
500
1000
1500
2000
Hz
2500
3000
3500
E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.


Speech Synthesis Context (TTS)
Overview of Voice Conversion
Spectral Envelope Transformation in VC
II.


III.


44
Standard: Gaussian Mixture Model
Proposed: Dynamic Frequency Warping + Amplitude Scaling
Conversion Results
Objective Metrics & Subjective Evaluations
Sound Samples
E.Godoy, Voice Conversion
December 11, 2012
Formal Evaluations

Speech Corpora & Analysis






Spectral envelope transformation methods:




45
CMU ARCTIC, US English speakers (2 male, 2 female)
Parallel annotated corpora, speech sampled at 16kHz
200 sentences for learning, 100 sentences for testing
Speech analysis & synthesis: Harmonic model
All spectral envelopes  discrete cepstral coefficients (order 40)
Phonetic & Traditional GMMs (diagonal covariance matrices)
DFWA (classification by phoneme)
DFWE (Hybrid DFW-GMM, E-energy correction)
source (no transformation)
E.Godoy, Voice ConversionDecember 11, 2012
Objective Metrics for Evaluation

Mean Squared Error (standard) alone is not sufficient!



Does not adequately indicate converted speech quality
Does not indicate variance in transformed data
Variance Ratio (VR)


Average ratio of the transformed-to-target data variance
More global indication of behavior
Does the transformed data
behave like the target data?
46
Is the transformed data
varying from the mean?
E.Godoy, Voice Conversion
December 11, 2012
Objective Results

GMM-based methods yield lowest MSE but no variance


DFWA yields higher MSE but more natural variations in speech


VR of DFWA directly from variations in the warped source envelopes (not model adaptations!)
Hybrid (DFWE) in-between DFWA & GMM (as expected)


Confirms over-smoothing
Compared to DFWA, VR notably decreased after energy correction filter
Different indications from MSE & VR: both important, complementary
E.Godoy, Voice Conversion
DFWA for VC with Nonparallel Corpora

Parallel Corpora: big constraint !


DFWA for Nonparallel corpora


Limits the data that can be used in VC
Nonparallel Learning Data: 200 disjoint sentences
MSE (dB)
VR
Parallel
-7.74
0.930
Nonparallel
-7.71
0.936
DFWA equivalent for parallel or nonparallel corpora!
E.Godoy, Voice Conversion
Implementation in a VC System
s x (t )
Speech
Analysis
(Harmonic)
source
speech

Spectral
Envelope
Transformation
Speech Frame
Synthesis
(pitch modification)
TD-PSOLA
(pitch modification)
s yˆ (t )
converted
speech
Converted speech synthesis

Pitch modification using TD-PSOLA
(Time-Domain Pitch Synchronous Overlap Add)

Pitch modification factor determined by classic linear transformation
y

f
f   fy  x  f 0x   xf
f
yˆ
0

Frame synthesis
 voiced frames:
0
0
0

0
harmonic amplitudes from transformed spectral envelopes
 harmonic phases from nearest-neighbor source sampling
 unvoiced frames: white noise through AR filter

49
E.Godoy, Voice Conversion
December 11, 2012
Subjective Testing

Listeners evaluate
1.
2.
Mean Opinion Score


50
Quality of speech
Similarity of voice to
the target speaker
MOS scale from 1-5
E.Godoy, Voice Conversion
December 11, 2012
Subjective Test Results
~Equal success for similarity
DFWA yields
better quality than
other methods

No observable differences between acoustic decoding & phonetic classification

Converted speech quality consistently higher for DFWA than GMM-based

Equal (if not slightly more) success for DFWA in capturing identity
No compromise between quality & capturing identity for DFWA!
51
E.Godoy, Voice Conversion
December 11, 2012
Conversion Examples
Source
Target
GMM
DFWA
DFWE
slt  clb (FF)
bdl  clb (MF)
Target analysis-synthesis with
converted spectral envelopes

DFWA consistently highest quality
GMM-based suffer “loss of presence”

Key observation: DFWA can deliver high-quality VC!

52
E.Godoy, Voice Conversion
December 11, 2012
VC Perspectives

Conversion within gender gives better results (e.g. FF)


For inter-gender conversion, spectral envelope not enough



Speakers share similar voice characteristics
Large pitch modification factors degrade speech
Prosody important
Need to address phase spectrum & glottal source too…
53
E.Godoy, Voice Conversion
December 11, 2012
Today’s Lecture: Voice Conversion
Introduction to Voice Conversion
I.


Speech Synthesis Context (TTS)
Overview of Voice Conversion
Spectral Envelope Transformation in VC
II.


III.


IV.
54
Typical: Gaussian Mixture Model
Proposed: Dynamic Frequency Warping + Amplitude Scaling
Conversion Results
Objective Metrics & Subjective Evaluations
Sound Samples
Summary & Conclusions
E.Godoy, Voice Conversion
December 11, 2012
Summary
Text-to-Speech (TTS)

1.
2.
Concatenative (unit-selection)
Parametric (HMM-based speaker models)
Acoustic Parameters in speech

1.
2.
Segmental (pitch, spectral envelope)
Supra-segmental (prosody, intonation)
Voice Conversion




55
Modify speech to transform speaker identity
Can create new voices!
Focus: spectral envelope  related to timbre
E.Godoy, Voice Conversion
December 11, 2012
Summary (continued)
Spectral Envelope Transformation

1.
2.
3.
Spectral envelope parameterzation (cepstral, LPC, …)
Level of source-target acoustic mappings
Method for learning & transformation
Standard VC Approach
1.


Parallel corpora & source-target frame alignment
GMM-based transformation
Proposed VC Approach
2.


56
Source-target acoustic mapping on class level
Dynamic Frequency Warping + Amplitude Scaling (DFWA)
E.Godoy, Voice Conversion
December 11, 2012
Important Considerations
Spectral Envelope Transformation

Speaker identity  need to capture average speaker timbre
Speech quality  need to maintain spectral details
1.
2.
Source-Target Acoustic Mappings

Contextual information (Phoneme) helpful
Mapping features on more global class-level effective
1.
2.

Better conversion quality & more flexible (no parallel corpora)
(Frame-level alignment/mappings too narrow & restrictive)
Speaker characteristics (prosody, voice quality)



57
Success of Voice Conversion can depend on speakers
Many features of speech to consider & treat…
E.Godoy, Voice Conversion
December 11, 2012
Thank You!
Questions?

similar documents