Speaker Identification and Verification Dan Burnett, Nuance 58th IETF Terminology • Speaker identification -- using utterances from a speaker, determine who the caller is out of a set of known speakers • Speaker verification -- using utterances from a speaker, determine whether the caller is who he/she claims to be (requires an identity claim) • Training -- using utterances from a speaker to train a unique voiceprint that can later be used to identify/verify a speaker. Applies to both SI/SV. draft-burnett-mrcpext-00.txt • Created by Nuance and Intervoice • Proposes extensions to MRCP v1 (draft-shanmugham-mrcp-04.txt) • Based originally on Nuance functionality, modified to be more general • Starting point for MRCP v2 functionality discussions • Also extensions for speaker-enrolled grammars, hotword recognition, and to the recognition resource Proposed SI/SV process (simplified, see section 6.7) VER-BUFFERING-START VER-START-SESSION VER-SET-VOICEPRINT VER-DELETE-VOICEPRINT VER-ROLLBACK GET-PARAMS SET-PARAMS VER-BUFFERING-CONTROL VERIFY VER-FROM-BUFFER* VER-FROM-BUFFER* VER-BUFFERING-STOP VER-END-SESSION * Requires active buffering and ver/id sessions. Discussion points • Why buffering? • Registry for return info • Anything else before I convert to MRCPv2? Voice/Text Grammar Enrollment (simplified, see section 5.5) START-ENROLLMENT-SESSION • Extension to existing recognition resource • Creates speaker-produced grammar entries • E.g., voice-enrolled entries for voice dialing • Both speech and text can be used to create grammar entries PAUSE/RESUME-ENROLLMENT-SESSION ENROLLMENT-ROLLBACK RECOGNIZE/STOP* ADD/DELETE/MODIFY-PHRASE END/ABORT-ENROLLMENT-SESSION * These methods already exist in the recognizer resource Hotword (see section 7) • New recognition resource • Instead of listening for a set time period, listens continuously until it matches a grammar • Non-matching speech is ignored and does not affect the state of the recognizer Other Extensions • Record method (sec. 4.4) – Allows end-pointed recording of an audio stream • Interpret method (sec. 4.5) – Behaves as a recognition except that text input is given instead of an audio stream. It returns a standard recognition result minus any audiospecific values.