Speaker Identification and Verification

Report
Speaker Identification and
Verification
Dan Burnett, Nuance
58th IETF
Terminology
• Speaker identification -- using utterances from a
speaker, determine who the caller is out of a set of
known speakers
• Speaker verification -- using utterances from a
speaker, determine whether the caller is who
he/she claims to be (requires an identity claim)
• Training -- using utterances from a speaker to train
a unique voiceprint that can later be used to
identify/verify a speaker. Applies to both SI/SV.
draft-burnett-mrcpext-00.txt
• Created by Nuance and Intervoice
• Proposes extensions to MRCP v1
(draft-shanmugham-mrcp-04.txt)
• Based originally on Nuance functionality,
modified to be more general
• Starting point for MRCP v2 functionality
discussions
• Also extensions for speaker-enrolled grammars,
hotword recognition, and to the recognition
resource
Proposed SI/SV process
(simplified, see section 6.7)
VER-BUFFERING-START
VER-START-SESSION
VER-SET-VOICEPRINT
VER-DELETE-VOICEPRINT
VER-ROLLBACK
GET-PARAMS
SET-PARAMS
VER-BUFFERING-CONTROL
VERIFY
VER-FROM-BUFFER*
VER-FROM-BUFFER*
VER-BUFFERING-STOP
VER-END-SESSION
* Requires active buffering and ver/id sessions.
Discussion points
• Why buffering?
• Registry for return info
• Anything else before I convert to MRCPv2?
Voice/Text Grammar Enrollment
(simplified, see section 5.5)
START-ENROLLMENT-SESSION
• Extension to existing
recognition resource
• Creates speaker-produced
grammar entries
• E.g., voice-enrolled entries
for voice dialing
• Both speech and text can
be used to create grammar
entries
PAUSE/RESUME-ENROLLMENT-SESSION
ENROLLMENT-ROLLBACK
RECOGNIZE/STOP*
ADD/DELETE/MODIFY-PHRASE
END/ABORT-ENROLLMENT-SESSION
* These methods already exist in the recognizer resource
Hotword
(see section 7)
• New recognition resource
• Instead of listening for a set time period,
listens continuously until it matches a
grammar
• Non-matching speech is ignored and does
not affect the state of the recognizer
Other Extensions
• Record method (sec. 4.4)
– Allows end-pointed recording of an audio
stream
• Interpret method (sec. 4.5)
– Behaves as a recognition except that text input
is given instead of an audio stream. It returns a
standard recognition result minus any audiospecific values.

similar documents