WO1999023643A1 - Model adaptation system and method for speaker verification - Google Patents

Model adaptation system and method for speaker verification Download PDF

Info

Publication number
WO1999023643A1
WO1999023643A1 PCT/US1998/023477 US9823477W WO9923643A1 WO 1999023643 A1 WO1999023643 A1 WO 1999023643A1 US 9823477 W US9823477 W US 9823477W WO 9923643 A1 WO9923643 A1 WO 9923643A1
Authority
WO
WIPO (PCT)
Prior art keywords
adaptable
speaker
adaptation
model
utterance
Prior art date
Application number
PCT/US1998/023477
Other languages
French (fr)
Inventor
Kevin Farrell
William Mistretta
Original Assignee
T-Netix, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by T-Netix, Inc. filed Critical T-Netix, Inc.
Priority to AU13057/99A priority Critical patent/AU1305799A/en
Priority to EP98956559A priority patent/EP1027700A4/en
Publication of WO1999023643A1 publication Critical patent/WO1999023643A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present invention relates to a system and method for adapting speaker verification models to achieve enhanced performance during verification and particularly, to a subword
  • NTN Gaussian mi)rture model
  • DTW dynamic time warping template
  • the invention relates to the fields of digital speech processing and speaker verification.
  • Speaker verification is a speech technology in which a person's identity is verified
  • speaker verification systems attempt to match
  • Speaker verification provides a robust method for security enhancement that can be
  • a speaker identification system attempts to determine the identify of a person
  • a speaker verification system attempts to determine if a person's claimed identity (whom the person
  • Speaker verification consists of determining whether or not a speech sample provides a
  • the speech sample can be text dependent or text independent.
  • Text dependent speaker verification systems verify the speaker after the
  • the password phrase is determined by the system or
  • the password phrase is constrained within a fixed vocabulary, such as a limited
  • a text independent speaker verification system does not require that the same text be
  • Speech identification and speaker verification tasks may involve large vocabularies in
  • dependent speaker verification systems build models based on phonetic subword units.
  • HMM hidden Markov models
  • DTW dynamic time warping
  • HMM techniques have the limitation of generally requiring a large amount of data to
  • NTN Neural Tree Networks
  • ahierarchial classifier that combines the properties of decision trees and neural networks, as
  • training data for the NTN consists of data for the desired speaker and data from other
  • the NTN partitions feature space into regions that are assigned probabilities which
  • Modeling at the subword level expands the versatility of the system. Moreover, it
  • Hierarchical speech segmentation involves a multi-level, fine-to-course segmentation which can be displayed in a tree-like fashion called dendogram.
  • the initial segmentation is a fine level with the limiting case being a vector equal to one segment.
  • a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
  • Non-hierarchical speech segmentation attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score metric.
  • the techniques for .hierarchial and non-hierarchial speech segmentation have the limitation of needing prior knowledge of the number of speech segments and corresponding segment modules.
  • blind clustering A technique not requiring prior knowledge of the number of clusters is defined as “blind” clustering. This method is disclosed in U.S. patent application no. 08/827, 562 entitled “Blind Clustering of Data With Application to Speech Processing Systems", filed on Api ⁇ l 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled “Blind Speech Segmentation”, filed on April 2, 1996, both of which are herein incorporated by reference.
  • blind clustering the number of clusters is unl ⁇ iown when the clustering is initiated.
  • an estimate of the range of the minimum number of clusters and maximum number of clusters of a data sample is determined.
  • a clustering data sample includes objects having a common homogeneity property. .An optimality criterion is defined for the estimated number of clusters. The optimality criterion determines how optimal
  • the fit is for the estimated number of clusters to the given clustering data samples.
  • the optimal number of clusters in the data sample is determined from the optimality criterion.
  • the speech sample is segmented based on the optimal boundary locations between segments and
  • the blind segmentation method can be used in text-dependent speaker verification
  • the blind segmentation method is used to segment an unknown password phrase
  • speaker's password is used by the blind segmentation module to estimate the number of
  • a subword segmentator model such as a neural tree network or a
  • Gaussian mixture model can be used to model the data of each subword.
  • the present invention relates to new model adaptation schemes for speaker verification systems.
  • Model adaptation changes the models learned during the enrollment component dynamically over time, to track aging of the user's voice.
  • the speaker adaptation system of the present invention has the advantage of only requiring a single enrollment for the speaker.
  • performance of the speaker verification system will degrade due to voice distortions as a consequence of the aging process as well as intersession variability. Consequently, performance of a speaker verification system may become so degraded that the speaker is required to re-enroll, thus, requiring the user to repeat his or her enrollment process. Generally, this process must be repeated every few months.
  • the adaptation process is completely transparent to the user. For example, a user may telephone into his or her "Private Branch Exchange" to gain access to an unrestricted outside line. As is customary with a speaker verification system, the user may be
  • this one updated utterance can be used to adapt the speaker verification model. For example,
  • test data may be considered as enrollment data, and the models trained and modeled using the steps following segmentation. If the password
  • the adapted system uses the updated voice features to update the voice
  • the adaptation schemes of the present invention can be applied to several types of speaker recognition systems including neural tree networks (NTN), Gaussian Mixture, and neural tree networks (NTN), Gaussian Mixture
  • NTNs NTNs
  • GMMs GMMs
  • DTW DTW
  • the present invention provides an adaptation system and process that
  • NTN neural network tree
  • the neural tree network learns to distinguish regions of feature space that belong
  • the probabilities represent the likelihood of the target speaker having generated data that falls
  • the probability at each leaf of the NTN is computed as the ratio of speaker observations to total observations encountered at
  • Each vector of the adaptation utterance is applied to the NTN and the speaker observation count of the leaf that
  • the probability can be updated in this manner.
  • statistical models such as a Gaussian
  • GMM mixture model
  • the utterance based on the added observations obtained with the updated voice utterance. For example, the
  • mean is updated by first scaling the mean by the number of original observations. This value is
  • mixture weights can also be updated.
  • template-based approaches such as
  • DTW dynamic time warping
  • the utterance can be averaged into this template.
  • the data for the original data template can be scaled by multiplying it by the number of utterances used to train it, or in this case, N.
  • the data for the new utterance is then added to this scaled data and then the sum is
  • the adaptation method occurs during verification.
  • For adapting the DTW template it is preferred that whole-
  • the feature data is segmented into sub-words for input into the NTN and GMM
  • spectral features be applied to a blind segmentation algorithm, such as that disclosed in U.S.
  • the repetition in the speaker's voice is used by the blind segmentation module to estimate the number of subwords in the password, and to locate the optimal subword
  • the data at each sub-word is then modeled preferably with a first and second modeling
  • the first modeling module can be a neural tree network (NTN) and the
  • the second modeling module can be a Gaussian mijcture model (GMM).
  • GMM Gaussian mijcture model
  • the invention can be used with a number of other adaptation techniques, in addition to
  • Figure 1 is a schematic diagram of the speaker verification system in accordance with
  • Figure 2 is a flow diagram illustrating the adaptation of the dynamic time warping
  • Figure 3 is a flow diagram of the neural network tree adaptation system during speaker
  • FIG 4 is a diagram illustrating the adaptation of the neural network tree (NTN)
  • FIG. 5 is a flow diagram illustrating the adaptation of the Gaussian mixture model (GMM) during speaker verification.
  • GMM Gaussian mixture model
  • Figure 1 illustrates a schematic diagram of a multiple model speaker recognition
  • the model is a text-dependent speaker recognition system comprising a
  • NNN neural tree network components
  • GMM Gaussian mixture model
  • NTN models NTN models, and DTW with GMM models, or individual models.
  • Sub-word processing is performed by the segmentor 18, with each sub-word output
  • the speech sample is applied as a speech signal to pre-processing and feature extraction
  • the analog to digital conversion includes analog to digital conversion of the speech signal.
  • the analog to digital conversion includes analog to digital conversion of the speech signal.
  • speech encoding method such as ITU G711 standard ⁇ and A law can be used to encode the speech samples.
  • a sampling rate of 8000 Hz is used.
  • the speech may
  • a telephony board is used to handle the Telco signaling protocol.
  • CPU Intel Pentium platform general purpose computer processing unit
  • an additional embodiment could be the Dialogic Antares card.
  • Pre-processing can include mean removal of the DC offset in the signal, pre-emphasis
  • the pre-processed speech is Hamming windowed and analyzed; for example in 30 millisecond analysis frames with a 10 millisecond shift between
  • ./-vfter preprocessing feature extraction is performed on the processed speech in module 14. Spectral features are represented by speech feature vectors determined within
  • a preferred method for obtaining spectral feature vectors is a 12th order LP analysis to determine 12 cepstral coefficients.
  • the result of the feature extraction module is that vectors representing a template of the utterance are generated.
  • the template is stored in a database.
  • the speech undergoes dynamic time warping.
  • the feature data is warped using a dynamic time warping template 16. This removes extraneous noise or speech that precedes or follows the spoken password.
  • warped feature data is used for the subsequent segmentation and model evaluation.
  • This score is computed and stored during this warping process. This score provides
  • This score represents a distance value ranging
  • the score can be mapped onto the scale of a probability by raising its
  • the NTN and GMM to provide a third score component towards the overall model score.
  • the speech is preferably segmented into sub-words using a blind segmentation
  • the preferred technique for subword generation is automatic blind speech segmentation, or "Blind Clustering", such as that disclosed in U.S. patent application no. 08/827, 562 entitled “Blind Clustering of Data With Application to Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled “Blind Speech Segmentation”, filed on April 2, 1996, both of which are herein incorporated by reference and assigned to the assignees of the present invention.
  • the automatic blind speech segmentation determines the number of subwords in the password and the location of optimal subword boundaries. Additionally, the subword durations are normalized by the total duration of the voice phrase and stored in a database for subsequent use during verification.
  • a first alternative is the traditional approach, where segmentation and labeling of speech data is performed manually by a trained phonetician using listening and visual cues.
  • a second alternative to subword generation is automatic hierarchical speech segmentation, which involves a multi-level, fine-to-course segmentation.
  • This segmentation can be displayed in a tree-like fashion called dendogram.
  • the initial segmentation is a fine level with the limiting case being a vector equal to one segment.
  • a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
  • a third alternative to subword generation is automatic non-hierarchical speech segmentation. This segmentation method attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score
  • the data at each sub-word is then modeled preferably with one or more combinations of a first and second modeling module, as shown in Figure 1.
  • the first modeling module can be a neural tree network (NTN) 22 and the second
  • the modeling module can be a Gaussian mixture model (GMM) 26.
  • GMM Gaussian mixture model
  • the NTN 22 provides a
  • FIG. 1 shows N models for the NTN classifier 22 and N models for the GMM classifier 26. Both modules 22, 26 can determine a score for each spectral vector of
  • the scores of the NTN 22 and GMM 26 modules can be combined to obtain a
  • NTN modules 22 are used to model the subword segments of the user password.
  • NTN 22 is a hierarchical classifier that uses a tree architecture to implement a sequential linear
  • the training data for an NTN 22 consists of data from a target
  • a database which may be RAM, ROM, EPROM,
  • EEPROM electrically erasable programmable read-only memory
  • hard disk hard disk
  • CD ROM compact disc ROM
  • file server or other storage device
  • the NTN 22 learns to distinguish regions of feature space that belong to the target
  • a Gaussian mixture model GMM 26 is also used to model each of the subwords.
  • a region of feature space for a target speaker is represented by a set of multivariate Gaussian distributions.
  • the mean vector and covariance matrix of the subword segments are obtained as a by-product of the blind segmentation module 18 and are saved as part of the GMM modules 26, as described in U.S. patent application no. 08/827, 562 entitled “Blind Clustering of Data With Application to Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled “Blind Speech Segmentation”, filed on April 2, 1996, both of which are herein incorporated by reference.
  • Each of the C mixture components is defined by a mixture weight P(w ; ) and normal
  • the normal distribution is constrained to have a diagonal
  • a scoring algorithm is used for each of the NTN and GMM models.
  • the output score is used for each of the NTN and GMM models.
  • the scoring algorithm for combining the score of the subword models 22, 26 can be
  • Transitional (or durational) probabilities between the subwords can also be used while
  • the preferred embodiment is (a)
  • phase-average scoring provides a GMM 26 score and an NTN 22 score, which must then be combined.
  • a linear opinion pool method is used to combine the output scores from the DTW 16, NTN 22 and GMM 26.
  • a threshold value is output and stored in the database.
  • the threshold value output is compared to a "final score" in the testing component to determine whether a test user's voice has so closely matched the model that it can be said that the two voices are from the same person.
  • the DTW 16 warps the feature data for subsequent use by the segmentor 18.
  • the DTW template 16 can be adapted by averaging the warped feature data into the original DTW template 16. The resulting template is then updated in the model.
  • the DTW adaptation method can be better explained with reference to Figure 2.
  • the first step 100 is to retrieve the stored number of utterances (referred to as M) used to compute
  • the incoming feature data is then warped onto the DTW template, as described in step 104.
  • a result of warping feature data onto a DTW template is
  • the new feature data is of the same length as the DTW template.
  • incoming data now has the same number of feature vectors as the DTW template.
  • Each feature vector of the DTW template is then scaled (meaning multiplied) by the number of
  • the warped feature data is added to the scaled DTW feature data. This is accomplished
  • the probability at each leaf of the NTN 22 is the probability at each leaf of the NTN 22.
  • step 46 new number of speaker observations is now divided by the total number of observations to obtain an updated probability, as set forth in step 46.
  • the NTN adaptation method can be better understood with reference to Figure 4.
  • the original speaker target vectors are designated as "1" in the figure.
  • Imposter vectors are designated by a "0".
  • the adaptation vectors based on the updated voice utterance are those within the dashed circles 70, 74.
  • the original probability is computed as 0.6, by dividing the number of original speaker target vectors (i.e., three) by the total number of vectors (i.e., five).
  • the adapted probability is determined to be 0.67, by dividing the speaker target vectors (i.e., 4) by the total number of vectors (i.e., 6).
  • Advantages can also be obtained by applying more weight to the new observations.
  • the NTN 22 Since the NTN 22 also retains imposter counts at each leaf, it can also be adapted with an imposter utterance. This would be accomplished in a similar manner as to how the speaker counts were added. Specifically, the feature vectors for an imposter utterance are applied to the NTN 22 and the leaf imposter counts are updated to reflect the imposter data that came to that leaf.
  • the NTN 22 is unique in this sense (as compared to the DTW and GMM models) in that it can be adapted with imposter data.
  • the GMM 26 modules are also adapted individually using sub-word data acquired from the blind segmentation.
  • the adaptation of a single sub-word GMM module 26 is described since the process is identical for each sub-word.
  • the adaptation method for a single sub-word GMM is shown in Figure 5. Referring to the first equation
  • a clustering of the adaptation data is performed as the first step in the individual GMM adaptation, as shown in step 82. If the adaptation features are
  • distribution means is used to partition the data.
  • the verification model retains information on the number of utterances used to train
  • step 86 The algorithm also makes the assumption that the prior utterances all contain N training vectors. It does this because the true sizes of the previous training and adaptation
  • adapted component distribution parameters i.e., mixture weight, mean and covariance
  • the first set contains two data sets with collections separated by a six month period.
  • the first set contains
  • the two data collections are referred to as
  • the first three training repetitions are kept fixed, while the second set of three repetitions are varied using a resampling scheme.
  • M 3.
  • For each training three new repetitions are used. This allows for three independent training sequences for the ten available true-speaker repetitions.
  • the fixed training repetitions used for scenarios 2 and 3 are the same as those used in scenario 1.
  • the first scenario provides a baseline system performance
  • the second shows the benefit of adding speaker information to the original training
  • the third shows the benefit of adapting the model using the additional speaker information.
  • a set of three experiments are initially performed for each training scenario. This include testing the GMM 26 and NTN 22 models individually along with testing a combined model. DTW adaptation was not examined for tiiis example. All testing repetitions are taken from the recent collection set. For the baseline training scenario, ten true-speaker repetitions and 45 impostor repetitions are tested for each speaker model. Equal error rates (EER) are then calculated for the system by collecting performance across speakers. For scenarios 2 and 3, three resampling tests are performed for each individual experiment. For each test, the appropriate three true-speaker repetitions are excluded from the experiment. This results in 7 true-speaker and 45 impostor repetitions for each test or 21 true-speaker and 135 impostor repetitions for each speaker.
  • EER Equal error rates
  • Table 1 displays the performance of these experiments. Several observations can be made when inspecting the table. First, the additional speech data provides a performance benefit when the model is trained on all the data. Second adapting on the additional training data also improves performance to some degree. The GMM adaptation does a better job at matching the training performance than the NTN adaptation. .Although the NTN does not adapt as well as the GMM, it still helps reduce the EER when applying adaptation to the combined model.
  • a second set of experiments are performed for the combined verification model. For this set, true-speaker testing repetitions are taken from the aged collection set. All other training and testing conditions are kept the same as the previous experiments. These results are displayed in Table 2. The table shows that all training scenarios suffer when evaluating the aged true-speaker repetitions. This is to be expected, since the verification model is trained on data collected over a short period of time. There is still improvement even though when the model is trained on additional data from the recent set. As with the previous experiments, the adaptation also improves the performance but not as much as the mil training.
  • the present model adaptation method and system has been described with respect to a text-dependent speaker verification system.
  • the invention could also be applied to text-independent systems, as well.
  • Because there is no temporal ordering of feature data preferably only a single NTN or GMM is trained.
  • the DTW template is omitted in this case as it does not rely on the temporal ordering of the feature data.
  • the adaptation procedures described above can be applied to any such models.
  • the adaptation method and procedure has been described above with respect to a multiple model system, clearly the present invention can be applied to enhance performance of template-based speaker verification models (DTW), neural tree network model based speaker verification systems or statistical speaker verification models (GMM),
  • adaptation is an effective method for improving the perfonnance of a speaker verification model.
  • criteria that determines when adaptation should take place. Adapting a model with utterances that are not from the speaker for which the model was trained, can have negative performance implications. Hence, a strategy must be used for recommending which data should be used for adaptation and which data should be discarded.
  • Three recommendations for adaptation criteria are as follows. One is to compare the composite model score to a threshold and deem it acceptable for adaptation if it passes some threshold criteria. .Another method is to analyze the model scores individually and if the majority of the models recommend adaptation (by evaluating threshold criteria) then use the data to adapt all models.
  • the threshold is computed as follows. During model training, estimates are made for the average speaker score and average imposter score. The average speaker score is obtained by evaluating the trained model with the original training utterances and recording the scores. The average score is then computed from these scores and the average score is scaled to account for the bias in the data. This is done to compensate for the fact that the data used to train a model will always score higher than data that is independent of the model training.
  • the average imposter score is obtained by applying imposter data to the trained model and computing the average of the resulting scores.
  • Imposter attempts on the speaker model can be synthesized by accessing feature data from the antispeaker database that is similar to the subword data used to train a model. This data can be pieced together into an imposter attempt and then applied to a speaker model.
  • the threshold is currently computed by selecting a value between the average imposter score and average speaker score. Adaptation can be applied to the threshold component of the model as follows. First, the number of utterances used to compute the imposter average (referred to as N) and speaker
  • M average
  • the imposter mean is multiplied by N and the adaptation
  • the adaptable speaker recognition system of the present invention can be employed for the following reasons
  • phone services can also be used for account validation for information system access.
  • model adaptation techniques of the present invention can be combined with fusion

Abstract

The model adaptation system of the present invention is a speaker verification system that embodies the capability to adapt models learned during the enrollment component to track aging of a user's voice. The system has the advantage of only requiring a single enrollment for recognition models including neural tree networks (22), Gaussian Mixture Models (26), and dynamic time warping (16) or to multiple models (30) (i.e. combinations of neural tree networks (22), Gaussian Mixture Models (26) and Dynamic time warping (30)). Moreover, the present invention can be applied to text-dependent or text-independent systems.

Description

MODEL ADAPTATION SYSTEM AND METHOD FOR SPEAKER VERIFICATION
CROSS REFERENCE TO RELATED APPLICATIONS Tahis application claims priority from Provisional Application 60/064,069, filed
November 3, 1997, entitled Model Adaptation System and Method for Speaker Verification
Background of the Invention
1. Field of the Invention
The present invention relates to a system and method for adapting speaker verification models to achieve enhanced performance during verification and particularly, to a subword
based speaker verification system having the capability of adapting a neural tree network
(NTN), Gaussian mi)rture model (GMM), dynamic time warping template (DTW), or
combinations of the above, without requiring additional time consuming retraining of the
models.
The invention relates to the fields of digital speech processing and speaker verification.
2. Description of the Related Art
Speaker verification is a speech technology in which a person's identity is verified
using a sample of his or her voice. In particular, speaker verification systems attempt to match
the voice of the person whose identity is undergoing verification with a .known voice. It
provides an advantage over other security measures such as personal identification numbers
(PINs) and personal information, because a person's voice is uniquely tied to his or her
identity. Speaker verification provides a robust method for security enhancement that can be
applied in many different application areas including computer telephony.
Within speaker recognition, the two main areas are speaker identification and verification. A speaker identification system attempts to determine the identify of a person
within a .known group of people using a sample of his or her voice. In contrast, a speaker verification system attempts to determine if a person's claimed identity (whom the person
claims to be) is valid using a sample of his or her voice. Speaker verification consists of determining whether or not a speech sample provides a
sufficient match to a claimed identity. The speech sample can be text dependent or text independent. Text dependent speaker verification systems verify the speaker after the
utterance of a specific password phrase. The password phrase is determined by the system or
by the user during enrollment and the same password is used in subsequent verification.
Typically, the password phrase is constrained within a fixed vocabulary, such as a limited
number of numerical digits. The limited number of password phrases gives the imposter a
higher probability of discovering a person's password, reducing the reliability of the system.
A text independent speaker verification system does not require that the same text be
used for enrollment and testing as in a text dependent speaker verification system. Hence,
there is no concept of a password and a user will be recognized regardless of what he or she
speaks.
Speech identification and speaker verification tasks may involve large vocabularies in
which the phonetic content of different vocabulary words may overlap substantially. Thus,
storing and comparing whole word patterns can be unduly redundant, since the constituent
sounds of individual words are treated independently regardless of their identifiable
similarities. For these reasons, conventional vocabulary speech recognition and text-
dependent speaker verification systems build models based on phonetic subword units.
Conventional approaches to performing text-dependent speaker verification include statistical modeling, such as hidden Markov models (HMM), or template-based modeling, such as dynamic time warping (DTW) for modeling speech. For example, subword models, as
described in A.E. Rosenberg, CH. Lee ad F.K. Soong, "Subword Unit Talker Verification Using Hidden Markov Models", Proceedings ICASSP. pages 269-272 (1990) and whole word
models, as described in A.E. Rosenberg, CH. Lee and S. Gokeen, "Connected Word Talker
Recognition Using Whole Word Hidden Markov Models", Proceedings ICASSP. pages 381- 384 (1991) have been considered for speaker verification and speech recognition systems.
HMM techniques have the limitation of generally requiring a large amount of data to
sufficiently estimate the model parameters. Other approaches include the use of Neural Tree Networks (NTN). The NTN is a
ahierarchial classifier that combines the properties of decision trees and neural networks, as
described in A. Sankar and R. J. Mammone, "Growing and Pruning Neural Tree Networks", IEEE Transactions on Computers. C-42: 221-229, March 1993. For speaker recognition,
training data for the NTN consists of data for the desired speaker and data from other
speakers. The NTN partitions feature space into regions that are assigned probabilities which
reflect how likely a speaker is to have generated a feature vector that falls within the speaker's
region.
The above described modeling techniques rely on speech being segmented into
subwords. Modeling at the subword level expands the versatility of the system. Moreover, it
is also conjectured that the variations in speaking styles among different speakers can be better
captured by modeling at the subword level. Traditionally, segmentation and labeling of speech data was performed manually by a trained phonetician using listening and visual cues.
However, there are several disadvantages to this approach, including the time consuming nature of the task and the highly subjective nature of decision-making required by these manual processes.
One solution to the problem of manual speech segmentation is to use automatic speech segmentation procedures. Conventional automatic speech segmentation processing has used hierarchical and nonhierarchical approaches.
Hierarchical speech segmentation involves a multi-level, fine-to-course segmentation which can be displayed in a tree-like fashion called dendogram. The initial segmentation is a fine level with the limiting case being a vector equal to one segment. Thereafter, a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
Non-hierarchical speech segmentation attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score metric. The techniques for .hierarchial and non-hierarchial speech segmentation have the limitation of needing prior knowledge of the number of speech segments and corresponding segment modules.
A technique not requiring prior knowledge of the number of clusters is defined as "blind" clustering. This method is disclosed in U.S. patent application no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing Systems", filed on Apiϊl 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both of which are herein incorporated by reference. In blind clustering, the number of clusters is unlαiown when the clustering is initiated. In the aforementioned application, an estimate of the range of the minimum number of clusters and maximum number of clusters of a data sample is determined. A clustering data sample includes objects having a common homogeneity property. .An optimality criterion is defined for the estimated number of clusters. The optimality criterion determines how optimal
the fit is for the estimated number of clusters to the given clustering data samples. The optimal number of clusters in the data sample is determined from the optimality criterion. The speech sample is segmented based on the optimal boundary locations between segments and
the optimal number of segments.
The blind segmentation method can be used in text-dependent speaker verification
systems. The blind segmentation method is used to segment an unknown password phrase
into subword units. During enrollment in the speaker verification system, the repetition of the
speaker's password is used by the blind segmentation module to estimate the number of
subwords in the password and locate optimal subword boundaries. For each subword segment of the speaker, a subword segmentator model, such as a neural tree network or a
Gaussian mixture model can be used to model the data of each subword.
Further, there are many multiple model systems that combine the results of different
models to further enhance performance.
One critical aspect of any of the above-described speaker verification systems that can
directly affect its success is robustness to intersession variability and aging. Intersession
variability refers to the situation where a person's voice can experience subtle changes when
using a verification system from one day to the next. A user can anticipate the best
performance of a speaker verification system when performing a verification immediately after
enrollment. However, over time the user may experience some difficulty when using the
system. For substantial periods of time, such as several months to years, the effects of aging
may also degrade system performance. Whereas the spectral variation of a speaker may be small when measured over a several week period, as time passes this variance will grow as described in S. Furui, "Comparison of Speaker Recognition Methods using Statistical Features and Dynamic Features", IEEE Transactions on Acoustics, Speech, and Signal Processing. ASSP-29:342-350, pages 342-350, April 1981. For some users, the effects of aging may render the original voice model unusable.
What is needed is a adaptation system and method for speaker verification systems, and in particular for discriminant and multiple model-based models, that requires minimal computational and storage resources. What is needed is an adaptation system that compensates for the effects of intersession variability and aging.
Summary of the Invention
Briefly described, the present invention relates to new model adaptation schemes for speaker verification systems. Model adaptation changes the models learned during the enrollment component dynamically over time, to track aging of the user's voice. The speaker adaptation system of the present invention has the advantage of only requiring a single enrollment for the speaker. Typically, if a person is merely enrolled in a single session, performance of the speaker verification system will degrade due to voice distortions as a consequence of the aging process as well as intersession variability. Consequently, performance of a speaker verification system may become so degraded that the speaker is required to re-enroll, thus, requiring the user to repeat his or her enrollment process. Generally, this process must be repeated every few months.
With the model adaptation system and method of the present invention, re-enrollment sessions are not necessary. The adaptation process is completely transparent to the user. For example, a user may telephone into his or her "Private Branch Exchange" to gain access to an unrestricted outside line. As is customary with a speaker verification system, the user may be
requested to state his or her password. With the adaptation system of the present invention,
this one updated utterance can be used to adapt the speaker verification model. For example,
every time a user is successfully verified, the test data may be considered as enrollment data, and the models trained and modeled using the steps following segmentation. If the password
is accepted by the system, the adapted system uses the updated voice features to update the
particular speaker recognition model almost instantaneously. Model adaptation effectively
increases the number of enrollment samples and improves the accuracy of the system.
Preferably, the adaptation schemes of the present invention can be applied to several types of speaker recognition systems including neural tree networks (NTN), Gaussian Mixture
Models (GMMs), and dynamic time warping (DTW) or to multiple models (i.e., combinations
of NTNs, GMMs and DTW). Moreover, the present invention can be applied to text-
dependent or text-independent systems. For example, the present invention provides an adaptation system and process that
adapts neural network tree (NTN) modules. The NTN is a hierarchical classifier that
combines the properties of decision trees and feed-forward Neural Networks. During initial
enrollment, the neural tree network learns to distinguish regions of feature space that belong
to the target speaker from those that are more likely to belong to an imposter. These regions
of feature space correspond to "leaves" in the neural tree network that contain probabilities.
The probabilities represent the likelihood of the target speaker having generated data that falls
within that region of feature space. Speaker observations within each region are determined
by the number of "target vectors" landing within the region. The probability at each leaf of the NTN is computed as the ratio of speaker observations to total observations encountered at
that leaf during enrollment.
During the adaptation method of the present invention, the number of targeted vectors,
or speaker observations, is updated based on the new utterance at a leaf. Each vector of the adaptation utterance is applied to the NTN and the speaker observation count of the leaf that
the vector arrives is incremented. By maintaining the original number of speaker observations
and imposter observations at each leaf, the probability can be updated in this manner. The
probabilities are then computed with new leaf counts. In this manner, the discriminant model
can be updated to offset the degraded performance of the model due to aging and intersession
variability.
In another embodiment of the present invention, statistical models such as a Gaussian
mixture model (GMM) can be adapted based on new voice utterances. In the GMM, a region
of feature space for a target speaker is represented by a set of multivariate Gaussian
distributions. During initial enrollment, certain component distribution parameters are
determined including the mean, covariance and mijcture weights corresponding to the
observations. Essentially, each of these parameters is updated during the adaptation process
based on the added observations obtained with the updated voice utterance. For example, the
mean is updated by first scaling the mean by the number of original observations. This value is
then added to a new mean based on the updated utterance, and the sum of these mean values
is divided by the total number of observations. In a similar manner, the covariance and
mixture weights can also be updated.
In another embodiment of the present invention, template-based approaches, such as
dynamic time warping (DTW), can be adapted using new voice utterances. Given a DTW template that has been trained with the features for N utterances, the features for a new
utterance can be averaged into this template. For example, the data for the original data template can be scaled by multiplying it by the number of utterances used to train it, or in this case, N. The data for the new utterance is then added to this scaled data and then the sum is
divided by the new number of utterances used in the model, N+l . This technique is very
similar to that used to update the mean component of the Gaussian mixture model.
.Although not necessary, the adaptive modeling approach used in the present invention
is preferably based on subword modeling for the NTN and GMM models. The adaptation method occurs during verification. For adapting the DTW template, it is preferred that whole-
word modeling be used. As part of verification, features are first extracted for an adaptation
utterance according to any conventional feature extraction method. The features are then
matched, or "warped", onto a DTW template. This provides 1) a modified set of features that best matches the DTW template and 2) a distance, or "distortion", value that can be used as a
measurement for speaker authenticity. The modified set of features output by the DTW
warping has been found to remedy the negative effects of noise or speech that precedes or
follows a spoken password. At this point, the warped features are used to adapt the DTW
template.
Next, the feature data is segmented into sub-words for input into the NTN and GMM
models. While several types of segmentation schemes can be used with the present invention,
including hierarchical and nonhierarchical speech segmentation schemes, it is preferred that the
spectral features be applied to a blind segmentation algorithm, such as that disclosed in U.S.
patent application no. 08/827, 562 entitled "Blind Clustering of Data With Application to
Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both
of which are herein incorporated by reference. During enrollment in the speaker verification system, the repetition in the speaker's voice is used by the blind segmentation module to estimate the number of subwords in the password, and to locate the optimal subword
boundaries.
The data at each sub-word is then modeled preferably with a first and second modeling
module. For example, the first modeling module can be a neural tree network (NTN) and the
second modeling module can be a Gaussian mijcture model (GMM). In this embodiment, the
adaptive method and system of the present invention is applied to both of these subword
models individually in addition to the DTW template to achieve enhanced overall performance.
The outputs of these models, namely the NTN, GMM and DTW, are then combined,
according to any one of several multiple model combination algorithms known in the art, to make a decision with respect to the speaker.
The resulting performance after adaptation is comparable to that obtained by retraining
the model with the addition of new speech utterances. However, while retraining is time- consuming, the adaptation process, can conveniently be performed following a verification,
while consuming minimal computational resources. Further, the adaptation is transparent to
the speaker. An additional benefit of adaptation is that the original training data does not need
to be stored, which can be burdensome for systems deployed within large populations. The invention can be used with a number of other adaptation techniques, in addition to
the model adaption described and claimed herein. These techniques include fusion adaption,
channel adaption and threshold adaption. The invention will be more fully described by reference to the following drawings.
Brief Description of the Drawings Figure 1 is a schematic diagram of the speaker verification system in accordance with
the teachings of the present invention.
Figure 2 is a flow diagram illustrating the adaptation of the dynamic time warping
(DTW) template during speaker verification.
Figure 3 is a flow diagram of the neural network tree adaptation system during speaker
verification.
Figure 4 is a diagram illustrating the adaptation of the neural network tree (NTN)
module according to the teachings of the present invention.
Figure 5 is a flow diagram illustrating the adaptation of the Gaussian mixture model (GMM) during speaker verification.
Description of the Preferred Embodiments
Figure 1 illustrates a schematic diagram of a multiple model speaker recognition
system 10. Preferably, the model is a text-dependent speaker recognition system comprising a
dynamic time warping component 16, neural tree network components (NTN) 22 and Gaussian mixture model (GMM) 26 components. .Alternatively, the present invention may be
used to adapt models comprising combinations including DTW with NTN models, GMM and
NTN models, and DTW with GMM models, or individual models.
Sub-word processing is performed by the segmentor 18, with each sub-word output
feeding into an NTN 22 and GMM 26 module. The adaptive modeling system and method of the present invention is described in detail below with reference to the speaker verification
system shown in Figure 1.
As part of verification, features must first be extracted for an adaptation utterance. Thus, the speech sample is applied as a speech signal to pre-processing and feature extraction
modules 14 for converting the speech signal into spectral feature vectors. The pre-processing
includes analog to digital conversion of the speech signal. The analog to digital conversion
can be performed with standard telephony boards such as those manufactured by Dialogic. A
speech encoding method such as ITU G711 standard μ and A law can be used to encode the speech samples. Preferably, a sampling rate of 8000 Hz is used. Alternatively, the speech may
be obtained in digital format, such as from an ISDN transmission. In such a case, a telephony board is used to handle the Telco signaling protocol.
In the preferred embodiment, the computer processing unit for the speaker verification
system is an Intel Pentium platform general purpose computer processing unit (CPU) of at
least 100 MHZ having approximately 10MB associated RAM memory and a hard or fixed
drive as storage. .Alternatively, an additional embodiment could be the Dialogic Antares card.
Pre-processing can include mean removal of the DC offset in the signal, pre-emphasis
to normalize the spectral tilt in the speech spectrum, as well as the removal of background
silence in the speech signal. Background silence in the speech signal can be removed using conventional methods such as speech and silence discrimination techniques using energy
and/or zero crossings. Thereafter, the pre-processed speech is Hamming windowed and analyzed; for example in 30 millisecond analysis frames with a 10 millisecond shift between
consecutive frames.
./-vfter preprocessing, feature extraction is performed on the processed speech in module 14. Spectral features are represented by speech feature vectors determined within
each frame of the processed speech signal. In the feature vector module 14, spectral feature
vectors can be obtained with conventional methods such as linear predictive (LP) analysis to determine LP cepstral coefficients, Fourier Transform .Analysis and filter bank analysis. One method of feature extraction is disclosed in U.S. Patent 5,522,012, entitled "Speaker
Identification and Verification System," issued on May 28, 1996 and incorporated herein by
reference. A preferred method for obtaining spectral feature vectors is a 12th order LP analysis to determine 12 cepstral coefficients.
The result of the feature extraction module is that vectors representing a template of the utterance are generated. Preferably, the template is stored in a database. Following
storage of the template, the speech undergoes dynamic time warping.
Next, the feature data is warped using a dynamic time warping template 16. This removes extraneous noise or speech that precedes or follows the spoken password. The
warped feature data is used for the subsequent segmentation and model evaluation.
Additionally, a score is computed and stored during this warping process. This score provides
a similarity measure between the spoken utterance and DTW template that can be used as a
speaker verification score. This score, referred to as "x", represents a distance value ranging
between 0 and infinity. The score can be mapped onto the scale of a probability by raising its
negative to an exponential, i.e., exp(-x). At this point it can be combined with the scores of
the NTN and GMM to provide a third score component towards the overall model score.
Next, the speech is preferably segmented into sub-words using a blind segmentation
module 18. The preferred technique for subword generation is automatic blind speech segmentation, or "Blind Clustering", such as that disclosed in U.S. patent application no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both of which are herein incorporated by reference and assigned to the assignees of the present invention. During enrollment in the speaker verification system, the automatic blind speech segmentation determines the number of subwords in the password and the location of optimal subword boundaries. Additionally, the subword durations are normalized by the total duration of the voice phrase and stored in a database for subsequent use during verification.
Alternative approaches to subword generation may be used with the present invention. A first alternative is the traditional approach, where segmentation and labeling of speech data is performed manually by a trained phonetician using listening and visual cues.
A second alternative to subword generation is automatic hierarchical speech segmentation, which involves a multi-level, fine-to-course segmentation. This segmentation can be displayed in a tree-like fashion called dendogram. The initial segmentation is a fine level with the limiting case being a vector equal to one segment. Thereafter, a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
A third alternative to subword generation is automatic non-hierarchical speech segmentation. This segmentation method attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score
metric.
.After subwords are obtained, the data at each sub-word is then modeled preferably with one or more combinations of a first and second modeling module, as shown in Figure 1. For example, the first modeling module can be a neural tree network (NTN) 22 and the second
modeling module can be a Gaussian mixture model (GMM) 26. The NTN 22 provides a
discriminative-based speaker score and the GMM 26 provides a speaker score that is based on a statistical measure. Figure 1 shows N models for the NTN classifier 22 and N models for the GMM classifier 26. Both modules 22, 26 can determine a score for each spectral vector of
a subword segment.
The scores of the NTN 22 and GMM 26 modules can be combined to obtain a
composite score for the subword in block 30. In the preferred embodiment, the results of the
dynamic time warping 16, neural tree network 22 and the Gaussian mixture models 26 are
combined using a linear opinion pool, as described below. Other ways of combining the data, however, can be used with the present invention including a log opinion pool or a "voting"
mechanism, wherein hard decisions from the DTW 16, NTN 22 and GMM 26 are considered
in the voting process. Since these three modeling approaches tend to have errors that are uncorrelated, performance improvements can be obtained by combining the model outputs.
NTN modules 22 are used to model the subword segments of the user password. The
NTN 22 is a hierarchical classifier that uses a tree architecture to implement a sequential linear
decision strategy. Specifically, the training data for an NTN 22 consists of data from a target
speaker, labeled as one, along with data from other speakers that are labeled as zero. The data
from other speakers is preferably stored in a database which may be RAM, ROM, EPROM,
EEPROM, hard disk, CD ROM, a file server, or other storage device.
The NTN 22 learns to distinguish regions of feature space that belong to the target
speaker from those that are more likely to belong to an impostor. These regions of feature
space coirespond to leaves in the NTN 22 that contain probabilities. These probabilities represent the likelihood of the target speaker having generated data that falls within that region of feature space, as described in K.R. Farrell, R.J. Mammone, and K.T. .Assaleh, "Speaker Recognition using Neural Networks and Conventional Classifiers", IEEE Trans. Speech and Audio Processing. 2(1), part 2 (1994). The functioning of NTN networks with respect to speaker recognition is also disclosed in U.S. Patent Application 08/159,397, filed November 29, 1993, entitled "Rapidly Trainable Neural Tree Network", U. S. Patent Application Serial No. 08/479,012 entitled "Speaker Verification System," U.S. Patent Application no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. Provisional Application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, each of which is incorporated herein by reference in its entirety. The adaptation of the NTN 22 model is described in detail below.
As discussed above, a Gaussian mixture model GMM 26 is also used to model each of the subwords. In the GMM 26, a region of feature space for a target speaker is represented by a set of multivariate Gaussian distributions. In the preferred embodiment, the mean vector and covariance matrix of the subword segments are obtained as a by-product of the blind segmentation module 18 and are saved as part of the GMM modules 26, as described in U.S. patent application no. 08/827, 562 entitled "Blind Clustering of Data With Application to Speech Processing Systems", filed on April 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled "Blind Speech Segmentation", filed on April 2, 1996, both of which are herein incorporated by reference. The GMM probability distribution function is expressed as p(x/ø) -= ∑P(w i)/?(x/// ! , σs 2). i = l
Each of the C mixture components is defined by a mixture weight P(w ;) and normal
distribution function ?(x/ , , σ2), where μ is the mean vector and σ ; is the covariance
matrix. In the preferred embodiment, the normal distribution is constrained to have a diagonal
covariance matrix defined by the vector σ,2. The PDF is used to produce the sub-word GMM
score.
A scoring algorithm is used for each of the NTN and GMM models. The output score
(estimated a-posteriori probabilities) of the subword model is combined across all the
subwords of the password phrase, so as to yield a composite score for the utterance.
The scoring algorithm for combining the score of the subword models 22, 26 can be
based on either of the following schemes:
(a) PHRASE- AVERAGE: Averaging the output scores for the vectors over the
entire phrase,
(b) SUB WORD- AVERAGE: Average the score of vectors within a subword, before
averaging the (averaged) subword scores, and
(c) SUBWORD-WEIGHING: Same as (b) subword-average scoring, but the
(averaged) subword scores are weighted in the final averaging process.
Transitional (or durational) probabilities between the subwords can also be used while
computing the composite score for the password phrase. The preferred embodiment is (a)
phase-average scoring. The result of scoring provides a GMM 26 score and an NTN 22 score, which must then be combined.
In the preferred embodiment, a linear opinion pool method is used to combine the output scores from the DTW 16, NTN 22 and GMM 26. The linear opinion pool method computes the final score as a weighted sum of the outputs for each model: n
Figure imgf000020_0001
i = l
Once the variables in the above equation are known, a threshold value is output and stored in the database. The threshold value output is compared to a "final score" in the testing component to determine whether a test user's voice has so closely matched the model that it can be said that the two voices are from the same person.
Now that the model 10 has been disclosed, in general, the adaptation methods applied to the aforementioned DTW 16, NTN 22 and GMM 26 modules are disclosed in detail. The adaptation occurs during verification. First, features are extracted from an adaptation utterance. These features are warped onto a DTW template 16 and then segmented into subword partitions in the segmentor 18 that can be processed by the corresponding NTN 22 and GMM 26 models at each sub- word. The prefeired method of DTW adaptation is shown in Figure 2. In summary, the
DTW 16 warps the feature data for subsequent use by the segmentor 18. The DTW template 16 can be adapted by averaging the warped feature data into the original DTW template 16. The resulting template is then updated in the model.
The DTW adaptation method can be better explained with reference to Figure 2. The first step 100 is to retrieve the stored number of utterances (referred to as M) used to compute
the current DTW template. The incoming feature data is then warped onto the DTW template, as described in step 104. A result of warping feature data onto a DTW template is
that the new feature data is of the same length as the DTW template. In other words, the
incoming data now has the same number of feature vectors as the DTW template. Each feature vector of the DTW template is then scaled (meaning multiplied) by the number of
utterances used to compute the original template, as shown in step 108. Then refeπing to step
112, the warped feature data is added to the scaled DTW feature data. This is accomplished
by adding each element of each warped feature vector to the corresponding element of the
scaled feature vector in the DTW template. Then, as shown in item 116, the sum of the scaled and warped feature data is normalized by dividing it by the new number of utterances, which is
M+l.
The preferred method of NTN adaptation is shown in Figure 3. The NTN 22
deteπnines the speaker score for a given vector by traversing the tree and retrieving the
probability at the leaf which the vector arrived. The probability at each leaf of the NTN 22 is
computed as the ratio of speaker observations (i.e., target vectors) to total observations (i.e.,
total vectors) encountered during training. By maintaining the number of speaker
observations and impostor observations at each leaf, as set forth in step 34, the probability
update is straight-forward. Each vector of the adaptation utterance is applied to the NTN 22,
as set forth in block 38. and the number of speaker observations within a leaf is counted, as set
forth in block 42. The new number of speaker observations and total observations are stored
in memory. This concludes the verification process for the NTN. However, if testing, the
new number of speaker observations is now divided by the total number of observations to obtain an updated probability, as set forth in step 46.
The NTN adaptation method can be better understood with reference to Figure 4. The original speaker target vectors are designated as "1" in the figure. Imposter vectors are designated by a "0". The adaptation vectors based on the updated voice utterance are those within the dashed circles 70, 74. For the left-most leaf 71 in Figure 4, the original probability is computed as 0.6, by dividing the number of original speaker target vectors (i.e., three) by the total number of vectors (i.e., five). .After applying the updated speech utterance, the adapted probability is determined to be 0.67, by dividing the speaker target vectors (i.e., 4) by the total number of vectors (i.e., 6). Advantages can also be obtained by applying more weight to the new observations.
Since the NTN 22 also retains imposter counts at each leaf, it can also be adapted with an imposter utterance. This would be accomplished in a similar manner as to how the speaker counts were added. Specifically, the feature vectors for an imposter utterance are applied to the NTN 22 and the leaf imposter counts are updated to reflect the imposter data that came to that leaf. The NTN 22 is unique in this sense (as compared to the DTW and GMM models) in that it can be adapted with imposter data.
Since only the leaves of the NTN 22 are modified during adaptation, there is the implicit assumption that the feature space partitions do not have to change. Adapting the
discriminant boundaries is not feasible as the nodes and leaves only retain information regarding the weight vectors and observation counts, respectively.
In the preferred embodiment, the GMM 26 modules are also adapted individually using sub-word data acquired from the blind segmentation. The adaptation of a single sub-word GMM module 26 is described since the process is identical for each sub-word. The adaptation method for a single sub-word GMM is shown in Figure 5. Referring to the first equation
above, the adaptation process, under processor control, produces an updated set of GMM Parameters μ', s,'2; i = 1 ... C} for the GMM PDF that reflects the contribution of the adaptation phrase, as described below.
A clustering of the adaptation data is performed as the first step in the individual GMM adaptation, as shown in step 82. If the adaptation features are
defined by X with N vectors, the clustering groups the data into C subsets X; i = 1...C, where X contains N; vectors. A simple Euclidean distance between the input vector and component
distribution means is used to partition the data.
The verification model retains information on the number of utterances used to train
the GMM along with the number of prior adaptations. The sum of these values is used to
scale the mixture weights, means and variances before adding new statistics, as set forth in
step 86. The algorithm also makes the assumption that the prior utterances all contain N training vectors. It does this because the true sizes of the previous training and adaptation
utterances are not retained as part of the verification model. Given these assumptions, the
adapted component distribution parameters (i.e., mixture weight, mean and covariance) can
be determined in steps 88, 90 and 92 as follows:
(3)
Figure imgf000023_0001
, _ <r; M(N - l ,.Pι>. ) + .J . i '. - Yr
SUN - i )P( -. N, - 1
(5) This approach to adapting the distribution parameters weights all training and
adaptation utterances equally. This means that each new adaptation phrase has less effect on the GMM. By limiting M to a maximum value, a simple forgetting factor can be incorporated
into the adaptation.
Examples:
Example 1:
All results presented herein are produced from experiments conducted on a verification
database that contains nine enrolled speakers. Additionally, there are 80 separate speakers
that are used as the development speakers for training the neural tree network. The database
contains two data sets with collections separated by a six month period. The first set contains
13 repetitions of each person speaking their full name and five repetitions of them speaking each other person's name. This amounts to 58 recordings for each speaker. The second set
contains ten more repetitions of each person speaking their own name. We refer to a repetition of a person saying their own name as a true-speaker repetition and a person saying
another person's name as an impostor repetition. The two data collections are referred to as
the recent set and the aged set respectively.
Three training scenarios are examined. In each case, all training repetitions were taken
from the recent collection set. The scenarios are outlined below.
1. Train a verification model with three true-speaker repetitions. (TR3) 2. Train a verification model with six true-speaker repetitions. (TR6)
3. Train a verification model with three true-speaker repetitions and adapt on three true-speaker repetitions. (TR3.AD3)
For the second and third training scenarios, the first three training repetitions are kept fixed, while the second set of three repetitions are varied using a resampling scheme. The resampling technique is based on a leave- - out data partitioning where M=3. For each training, three new repetitions are used. This allows for three independent training sequences for the ten available true-speaker repetitions. The fixed training repetitions used for scenarios 2 and 3 are the same as those used in scenario 1. The first scenario provides a baseline system performance, the second shows the benefit of adding speaker information to the original training, while the third shows the benefit of adapting the model using the additional speaker information.
A set of three experiments are initially performed for each training scenario. This include testing the GMM 26 and NTN 22 models individually along with testing a combined model. DTW adaptation was not examined for tiiis example. All testing repetitions are taken from the recent collection set. For the baseline training scenario, ten true-speaker repetitions and 45 impostor repetitions are tested for each speaker model. Equal error rates (EER) are then calculated for the system by collecting performance across speakers. For scenarios 2 and 3, three resampling tests are performed for each individual experiment. For each test, the appropriate three true-speaker repetitions are excluded from the experiment. This results in 7 true-speaker and 45 impostor repetitions for each test or 21 true-speaker and 135 impostor repetitions for each speaker.
Table 1 displays the performance of these experiments. Several observations can be made when inspecting the table. First, the additional speech data provides a performance benefit when the model is trained on all the data. Second adapting on the additional training data also improves performance to some degree. The GMM adaptation does a better job at matching the training performance than the NTN adaptation. .Although the NTN does not adapt as well as the GMM, it still helps reduce the EER when applying adaptation to the combined model.
Figure imgf000026_0001
Table 1. Comparative Data: Verification EER performance for several training scenarios and verification model types. All experiments evaluated with the recent collection data.
Example 2:
A second set of experiments are performed for the combined verification model. For this set, true-speaker testing repetitions are taken from the aged collection set. All other training and testing conditions are kept the same as the previous experiments. These results are displayed in Table 2. The table shows that all training scenarios suffer when evaluating the aged true-speaker repetitions. This is to be expected, since the verification model is trained on data collected over a short period of time. There is still improvement even though when the model is trained on additional data from the recent set. As with the previous experiments, the adaptation also improves the performance but not as much as the mil training.
Figure imgf000026_0002
Table 2. Comparative Data: Verification EER performance for several scenarios and combined model type. All experiments evaluates with the aged collection data.
It was shown above that the GMM error rate was reduced from 5.3% to 1.7% and the NTN performance improved from 6.0% to 4.3% when adapting on additional training data. A classifier that combines these two models shows similar improvement and performs better than either classifier in isolation. In addition, when testing the combined classifier on aged data, the error rate reduced from 12.% to 7.2%. The overall system performance using adaptation is comparable to that achieved by training the model with the adaptation information.
The present model adaptation method and system has been described with respect to a text-dependent speaker verification system. The invention, however, could also be applied to text-independent systems, as well. Because there is no temporal ordering of feature data, preferably only a single NTN or GMM is trained. The DTW template is omitted in this case as it does not rely on the temporal ordering of the feature data. The adaptation procedures described above can be applied to any such models. Although the adaptation method and procedure has been described above with respect to a multiple model system, clearly the present invention can be applied to enhance performance of template-based speaker verification models (DTW), neural tree network model based speaker verification systems or statistical speaker verification models (GMM),
individually. It has been shown that adaptation is an effective method for improving the perfonnance of a speaker verification model. However, it is also important to discuss the criteria that determines when adaptation should take place. Adapting a model with utterances that are not from the speaker for which the model was trained, can have negative performance implications. Hence, a strategy must be used for recommending which data should be used for adaptation and which data should be discarded. Three recommendations for adaptation criteria are as follows. One is to compare the composite model score to a threshold and deem it acceptable for adaptation if it passes some threshold criteria. .Another method is to analyze the model scores individually and if the majority of the models recommend adaptation (by evaluating threshold criteria) then use the data to adapt all models. Finally, another scenario may be where data is .known to belong to the speaker whose model is to be adapted. In this case, the criteria checks can be bypassed and the model can be updated with the data. In addition to adapting the model components of a model, one can also adapt the threshold component. In the preferred embodiment of the invention, the threshold is computed as follows. During model training, estimates are made for the average speaker score and average imposter score. The average speaker score is obtained by evaluating the trained model with the original training utterances and recording the scores. The average score is then computed from these scores and the average score is scaled to account for the bias in the data. This is done to compensate for the fact that the data used to train a model will always score higher than data that is independent of the model training. The average imposter score is obtained by applying imposter data to the trained model and computing the average of the resulting scores. Imposter attempts on the speaker model can be synthesized by accessing feature data from the antispeaker database that is similar to the subword data used to train a model. This data can be pieced together into an imposter attempt and then applied to a speaker model. The threshold is currently computed by selecting a value between the average imposter score and average speaker score. Adaptation can be applied to the threshold component of the model as follows. First, the number of utterances used to compute the imposter average (referred to as N) and speaker
average (referred to as M) must be included as part of the model and retrieved upon
adaptation. When adapting the threshold with the score from the valid speaker, the speaker mean is multiplied by M and the adaptation score is added to this quantity. The resulting sum
is then divided by (M+l) and this denotes the new speaker mean. Similarly, when adapting the
threshold with an imposter score, the imposter mean is multiplied by N and the adaptation
score is added to this quantity. The resulting sum is then divided by (N+l) and this denotes
the new imposter mean. Future threshold positions would then use the modified speaker and imposter means.
The adaptable speaker recognition system of the present invention can be employed for
user validation for telephone services such as cellular phone services and bill-to-third-party
phone services. It can also be used for account validation for information system access.
The model adaptation techniques of the present invention can be combined with fusion
adaptation and threshold adaptation, as described in copending Patent Application Serial No.
08/976,280, entitled "Voice Print System and Method," filed on November 21, 1997 by
Sharma et al., herein incorporated by reference. All of the adaptation techniques may effect
the number and probability of obtaining false-negative and false-positive results, so should be
used with caution. These adaptive techniques may be used in combination with channel
adaptation, or each other, either simultaneously or at different authorization occurrences.
The foregoing description of the present invention has been presented for purposes of
illustration and description which is not intended to limit the invention to the specific
embodiments described. Consequently, variations and modifications commensurate with the above teachings, and within the skill and knowledge of the relevant art, are part of the scope of the present invention. It is intended that the appended claims be construed to include alternative embodiments to the extent permitted by law. We claim:

Claims

CLAIMS:
1. .An adaptable speaker verification system with model adaptation, the system comprising: a receiver, the receiver obtaining a voice utterance; a means, connected to the receiver, for extracting predetermined features of the voice utterance; a means, operably connected to the extracting means, for segmenting the predetermined features of the voice utterance, wherein the features are segmented into a plurality of subwords; at least one adaptable model, connected to the segmenting means, wherein the model models the plurality of subwords and outputs one or more scores, and the models are updated dynamically based on the received voice utterance to incorporate the changing characteristics of a user's voice.
2. The adaptable speaker verification system of claim 1, further comprising: an analog to digital converter, connected to the receiver, for providing the obtained voice utterance in a digital format.
3. The adaptable speaker verification system of claim 1, further comprising: a means, connected to the extracting means, for warping the voice utterance onto a dynamic warping template, the warping means providing a DTW score; wherein the warping means is adapted based on the voice utterance.
4. The adaptable speaker verification system of claims 1 or 3, wherein the adaptable classifiers comprise at least one adaptable Gaussian mb ture model, the adaptable Gaussian miaXture model resulting in a GMM score.
5. The adaptable speaker verification system of claims 1 or 3, wherein the adaptable models comprise at least one adaptable neural tree network model, the adaptable neural tree network model resulting in an NTN score.
6. The adaptable speaker verification system of claims 1 or 3, wherein the adaptable models comprise: at least one adaptable Gaussian mi.xture model, the adaptable Gaussian mixture model resulting in a GMM score; and at least one adaptable neural tree network model, the adaptable neural tree network model resulting in an NTN score.
7. The adaptable speaker verification system of claim 1, further comprising a means, connected to the models, for combining the model scores, the combining means resulting in a final score for the combined system.
8. The adaptable speaker verification system of claim 3, further comprising a means, connected to the models and the warping means, for combining the DTW score and the model scores, the combining means resulting in a final score for the combined system.
9. The adaptable speaker verification system of claim 1, wherein the segmenting means generates subwords based on automatic blind speech segmentation.
10. The adaptable speaker verification system of claim 7, wherein the combining means is a linear opinion pool.
11. An adaptable speaker verification method, including the steps of: obtaining enrollment speech from a .known individual; receiving test speech from a user; extracting predetermined features of the test speech; warping the predetermined features using a dynamic time warping template, wherein the dynamic warping template is adapted based on the predetermined features of the test speech, resulting in the creation of warped feature data and a dynamic time warping score from the adapted dynamic warping template; generating subwords from the warped feature data; scoring the subwords using a plurality of adaptable models, wherein the adaptable models are adapted based on the subwords derived from the test speech; combining the results of each classifier score and the dynamic time warping score to
generate a final score; and comparing the final score to a threshold value to determine whether the test speech and enrollment speech are from the .known individual.
12. The adaptable speaker verification method of claim 11, further comprising the steps of: digitizing the obtained test speech; and preprocessing the digitized test speech.
13. The adaptable speaker verification method of claim 11, wherein the step of scoring comprises the step of scoring at least one adaptable neural tree network model.
14. The adaptable speaker verification method of claim 11, wherein the step of scoring comprises the step of scoring at least one adaptable Gaussian mixture model.
15. The adaptable speaker verification method of claim 11, wherein the step of scoring further comprises the steps of: scoring at least one adaptable Gaussian mixture model, the adaptable Gaussian mixture model resulting in a GMM score; and scoring at least one adaptable neural tree network model, the adaptable neural tree network model resulting in a NTN score.
16. The adaptable speaker verification method of claim 11, wherein the step of generating comprises generating subwords using automatic blind speech segmentation.
17. The adaptable speaker verification method of claim 11, wherein the step of combining comprises combining the scores using a linear opinion pool.
18. An adaptable speaker verification method, wherein at least one neural tree network model is adapted based on an adaptation utterance, comprising the following steps: storing number of speaker observations, number of imposter observations and a total number of observations from previous enrollments or verifications; obtaining an adaptation utterance from a speaker; extracting predetermined features from the speaker adaptation utterance; segmenting the predetermined features into a plurality of subwords; applying the plurality of subwords to at least one neural tree network model; counting the number of updated speaker observations within each leaf of the neural tree network; storing the number of updated speaker observations in memory; and updating probabilities by dividing the number of updated speaker observations by a total number of observations at each leaf, thereby resulting in an adapted neural tree network model.
19. The adaptable speaker verification method of claim 18, further comprising the steps of: digitizing the obtained adaptation speaker utterance; and preprocessing the digitized speaker utterance.
20. The adaptable speaker verification method of claim 18, wherein the step of segmenting comprises generating subwords using automatic blind speech segmentation.
21. The adaptable speaker verification method of claim 18, further comprising the step of: warping the predetermined features from the speaker adaptation utterance using a dynamic time warping template, wherein the dynamic warping template is adapted based on
the predetermined features of the test speech, resulting in the creation of warped feature data;
and wherein the step of segmenting segments the warped feature data into a plurality of
subwords.
22. adaptable speaker verification method, wherein a dynamic time warping model can
be adapted using adapted voice utterances, the method comprising the steps of: creating an original dynamic time warping template for a particular user, resulting in
original dynamic time warping template data;
storing a number of utterances used to compute the original dynamic time warping
template;
obtaining an adaptation voice utterance; warping the adaptation voice utterance onto the original dynamic time warping
template, resulting in warped adaptation data;
scaling the original dynamic time warping template data, wherein the template data is
scaled by multiplying the template data by the number of utterances used in training the
original template;
adding the warped adaptation data to the scaled original template data to create a
summed value; and
normaliaZing the summed value by dividing the summed value by the new total number
of utterances used in the model to create an adapted model.
23. The adaptable speaker verification method of claim 22, fiirther comprising the step of
extracting predetermined features from the adaptation voice utterance.
24. The adaptable speaker verification method of claim 22, fiirther comprising the steps of:
digitizing the obtained adaptation voice utterance; and
preprocessing the digitized voice utterance.
25. An adaptable speaker verification method, wherein at least one Gaussian mi.xture
model is adapted based on an adaptation utterance, comprising the following steps: storing number of utterances used to train the Gaussian mijcture model and number of
prior adaptation utterances;
obtaining an adaptation utterance from a speaker;
extracting predetermined features from the speaker adaptation utterance;
segmenting the predetermined features into a plurality of subwords;
applying a subword to each of the Gaussian mixture models;
determining a scaling value, the scaling value related to a sum of the number of training
utterances and the number of prior adaptation utterances; determining one or more adapted component distribution parameters using the scaling
value, thereby resulting in an adapted Gaussian mixture model, wherein the adapted
component distribution parameters reflect the contribution of the speaker adaptation
utterance.
26. The adaptable speaker verification method of claim 25, further comprising the steps of: digitizing the obtained adaptation speaker utterance; and preprocessing the digitized speaker utterance.
27. The adaptable speaker verification method of claim 25, wherein the step of segmenting comprises generating subwords using automatic blind speech segmentation.
28. The adaptable speaker verification method of claim 25, further comprising the step of: warping the predetermined features from the speaker adaptation utterance using a dynamic time warping template, wherein the dynamic warping template is adapted based on the predetermined features of the speech, resulting in the creation of warped feature data; and wherein the step of segmenting segments the warped feature data into a plurality of subwords.
29. The speaker verification method of claim 25, wherein the adapted component distribution parameters comprise a mij ture weight, adapted mean and adapted covariance.
30. .An adaptable speaker verification method, wherein at least one neural tree network model is adapted based on an adaptation utterance, comprising the following steps: storing number of speaker observations, number of imposter observations and a total number of observations from previous enrollments or verifications; obtaining an adaptation utterance from an imposter; extracting predetermined features from the imposter adaptation utterance; segmenting the predetermined features into a plurality of subwords; applying the plurality of subwords to at least one neural tree network model; counting the number of updated imposter observations within each leaf of the neural tree network; storing the number of updated imposter observations in memory; and updating probabilities by dividing the number of updated speaker observations by a total number of observations at each leaf, thereby resulting in an adapted neural tree network
model.
PCT/US1998/023477 1997-11-03 1998-11-03 Model adaptation system and method for speaker verification WO1999023643A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU13057/99A AU1305799A (en) 1997-11-03 1998-11-03 Model adaptation system and method for speaker verification
EP98956559A EP1027700A4 (en) 1997-11-03 1998-11-03 Model adaptation system and method for speaker verification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US6406997P 1997-11-03 1997-11-03
US60/064,069 1997-11-03

Publications (1)

Publication Number Publication Date
WO1999023643A1 true WO1999023643A1 (en) 1999-05-14

Family

ID=22053368

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/023477 WO1999023643A1 (en) 1997-11-03 1998-11-03 Model adaptation system and method for speaker verification

Country Status (5)

Country Link
US (1) US6519561B1 (en)
EP (1) EP1027700A4 (en)
CN (1) CN1302427A (en)
AU (1) AU1305799A (en)
WO (1) WO1999023643A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029785A1 (en) * 2000-09-30 2002-04-11 Intel Corporation Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
EP1431959A2 (en) * 2002-12-18 2004-06-23 Matsushita Electric Industrial Co., Ltd. Gaussian model-based dynamic time warping system and method for speech processing
GB2465782A (en) * 2008-11-28 2010-06-02 Univ Nottingham Trent Biometric identity verification utilising a trained statistical classifier, e.g. a neural network
GB2517952A (en) * 2013-09-05 2015-03-11 Barclays Bank Plc Biometric verification using predicted signatures
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
WO2020035085A3 (en) * 2019-10-31 2020-08-20 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
US11611581B2 (en) 2020-08-26 2023-03-21 ID R&D, Inc. Methods and devices for detecting a spoofing attack

Families Citing this family (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735563B1 (en) * 2000-07-13 2004-05-11 Qualcomm, Inc. Method and apparatus for constructing voice templates for a speaker-independent voice recognition system
US20040190688A1 (en) * 2003-03-31 2004-09-30 Timmins Timothy A. Communications methods and systems using voiceprints
US20020138274A1 (en) * 2001-03-26 2002-09-26 Sharma Sangita R. Server based adaption of acoustic models for client-based speech systems
EP1395803B1 (en) * 2001-05-10 2006-08-02 Koninklijke Philips Electronics N.V. Background learning of speaker voices
US7437289B2 (en) * 2001-08-16 2008-10-14 International Business Machines Corporation Methods and apparatus for the systematic adaptation of classification systems from sparse adaptation data
CN1409527A (en) * 2001-09-13 2003-04-09 松下电器产业株式会社 Terminal device, server and voice identification method
AU2002336341A1 (en) * 2002-02-20 2003-09-09 Planar Systems, Inc. Light sensitive display
US7509257B2 (en) * 2002-12-24 2009-03-24 Marvell International Ltd. Method and apparatus for adapting reference templates
US7734025B2 (en) * 2003-02-28 2010-06-08 Grape Technology Group, Inc. Methods and systems for providing on-line bills for use in communications services
US7386448B1 (en) 2004-06-24 2008-06-10 T-Netix, Inc. Biometric voice authentication
KR100679042B1 (en) * 2004-10-27 2007-02-06 삼성전자주식회사 Method and apparatus for speech recognition, and navigation system using for the same
US7962327B2 (en) * 2004-12-17 2011-06-14 Industrial Technology Research Institute Pronunciation assessment method and system based on distinctive feature analysis
US7706510B2 (en) * 2005-03-16 2010-04-27 Research In Motion System and method for personalized text-to-voice synthesis
US9300790B2 (en) * 2005-06-24 2016-03-29 Securus Technologies, Inc. Multi-party conversation analyzer and logger
CN100502463C (en) * 2005-12-14 2009-06-17 浙江工业大学 Method for collecting characteristics in telecommunication flow information video detection
CN101051463B (en) * 2006-04-06 2012-07-11 株式会社东芝 Verification method and device identified by speaking person
WO2007131530A1 (en) * 2006-05-16 2007-11-22 Loquendo S.P.A. Intersession variability compensation for automatic extraction of information from voice
US8386248B2 (en) * 2006-09-22 2013-02-26 Nuance Communications, Inc. Tuning reusable software components in a speech application
US8041571B2 (en) * 2007-01-05 2011-10-18 International Business Machines Corporation Application of speech and speaker recognition tools to fault detection in electrical circuits
US8099288B2 (en) * 2007-02-12 2012-01-17 Microsoft Corp. Text-dependent speaker verification
US7953216B2 (en) * 2007-05-04 2011-05-31 3V Technologies Incorporated Systems and methods for RFID-based access management of electronic devices
US8050919B2 (en) 2007-06-29 2011-11-01 Microsoft Corporation Speaker recognition via voice sample based on multiple nearest neighbor classifiers
US8006291B2 (en) 2008-05-13 2011-08-23 Veritrix, Inc. Multi-channel multi-factor authentication
US8468358B2 (en) 2010-11-09 2013-06-18 Veritrix, Inc. Methods for identifying the guarantor of an application
US8516562B2 (en) 2008-05-13 2013-08-20 Veritrix, Inc. Multi-channel multi-factor authentication
US8536976B2 (en) * 2008-06-11 2013-09-17 Veritrix, Inc. Single-channel multi-factor authentication
US8166297B2 (en) * 2008-07-02 2012-04-24 Veritrix, Inc. Systems and methods for controlling access to encrypted data stored on a mobile device
US8886663B2 (en) * 2008-09-20 2014-11-11 Securus Technologies, Inc. Multi-party conversation analyzer and logger
EP2353125A4 (en) * 2008-11-03 2013-06-12 Veritrix Inc User authentication for social networks
EP2216775B1 (en) * 2009-02-05 2012-11-21 Nuance Communications, Inc. Speaker recognition
US8474014B2 (en) 2011-08-16 2013-06-25 Veritrix, Inc. Methods for the secure use of one-time passwords
US9367612B1 (en) * 2011-11-18 2016-06-14 Google Inc. Correlation-based method for representing long-timescale structure in time-series data
AU2012265559B2 (en) * 2011-12-23 2018-12-20 Commonwealth Scientific And Industrial Research Organisation Verifying a user
US10438591B1 (en) 2012-10-30 2019-10-08 Google Llc Hotword-based speaker recognition
US9437195B2 (en) * 2013-09-18 2016-09-06 Lenovo (Singapore) Pte. Ltd. Biometric password security
US9344419B2 (en) 2014-02-27 2016-05-17 K.Y. Trix Ltd. Methods of authenticating users to a site
US9621713B1 (en) 2014-04-01 2017-04-11 Securus Technologies, Inc. Identical conversation detection method and apparatus
US10237399B1 (en) 2014-04-01 2019-03-19 Securus Technologies, Inc. Identical conversation detection method and apparatus
US10373711B2 (en) 2014-06-04 2019-08-06 Nuance Communications, Inc. Medical coding system with CDI clarification request notification
US10754925B2 (en) 2014-06-04 2020-08-25 Nuance Communications, Inc. NLU training with user corrections to engine annotations
US9922048B1 (en) 2014-12-01 2018-03-20 Securus Technologies, Inc. Automated background check via facial recognition
CN104616655B (en) 2015-02-05 2018-01-16 北京得意音通技术有限责任公司 The method and apparatus of sound-groove model automatic Reconstruction
US10366687B2 (en) * 2015-12-10 2019-07-30 Nuance Communications, Inc. System and methods for adapting neural network acoustic models
US10013973B2 (en) * 2016-01-18 2018-07-03 Kabushiki Kaisha Toshiba Speaker-adaptive speech recognition
CN109313902A (en) 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
US10141009B2 (en) 2016-06-28 2018-11-27 Pindrop Security, Inc. System and method for cluster-based audio event detection
US20180018973A1 (en) 2016-07-15 2018-01-18 Google Inc. Speaker verification
US9824692B1 (en) 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
US10325601B2 (en) 2016-09-19 2019-06-18 Pindrop Security, Inc. Speaker recognition in the call center
CA3036561C (en) 2016-09-19 2021-06-29 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
US10553218B2 (en) 2016-09-19 2020-02-04 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
US10949602B2 (en) 2016-09-20 2021-03-16 Nuance Communications, Inc. Sequencing medical codes methods and apparatus
US10755347B2 (en) 2016-09-21 2020-08-25 Coinbase, Inc. Corrective action realignment and feedback system for a compliance determination and enforcement platform
US11625769B2 (en) * 2016-09-21 2023-04-11 Coinbase, Inc. Multi-factor integrated compliance determination and enforcement platform
US10614813B2 (en) * 2016-11-04 2020-04-07 Intellisist, Inc. System and method for performing caller identity verification using multi-step voice analysis
US10755718B2 (en) * 2016-12-07 2020-08-25 Interactive Intelligence Group, Inc. System and method for neural network based speaker classification
US10397398B2 (en) 2017-01-17 2019-08-27 Pindrop Security, Inc. Authentication using DTMF tones
CN109102812B (en) * 2017-06-21 2021-08-31 北京搜狗科技发展有限公司 Voiceprint recognition method and system and electronic equipment
GB2563952A (en) * 2017-06-29 2019-01-02 Cirrus Logic Int Semiconductor Ltd Speaker identification
US11133091B2 (en) 2017-07-21 2021-09-28 Nuance Communications, Inc. Automated analysis system and method
US10896673B1 (en) * 2017-09-21 2021-01-19 Wells Fargo Bank, N.A. Authentication of impaired voices
CN107464568B (en) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 Speaker identification method and system based on three-dimensional convolution neural network text independence
US11024424B2 (en) 2017-10-27 2021-06-01 Nuance Communications, Inc. Computer assisted coding systems and methods
US10715522B2 (en) * 2018-01-31 2020-07-14 Salesforce.Com Voiceprint security with messaging services
US11355103B2 (en) 2019-01-28 2022-06-07 Pindrop Security, Inc. Unsupervised keyword spotting and word discovery for fraud analytics
WO2020163624A1 (en) 2019-02-06 2020-08-13 Pindrop Security, Inc. Systems and methods of gateway detection in a telephone network
US10891318B2 (en) * 2019-02-22 2021-01-12 United States Of America As Represented By The Secretary Of The Navy Temporal logic fusion of real time data
US11646018B2 (en) 2019-03-25 2023-05-09 Pindrop Security, Inc. Detection of calls from voice assistants
US11170783B2 (en) * 2019-04-16 2021-11-09 At&T Intellectual Property I, L.P. Multi-agent input coordination
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
KR20210050884A (en) * 2019-10-29 2021-05-10 삼성전자주식회사 Registration method and apparatus for speaker recognition
US11841932B2 (en) * 2020-04-14 2023-12-12 Nice Ltd. System and method for updating biometric evaluation systems
CN113571043A (en) * 2021-07-27 2021-10-29 广州欢城文化传媒有限公司 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687287A (en) * 1995-05-22 1997-11-11 Lucent Technologies Inc. Speaker verification method and apparatus using mixture decomposition discrimination

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5638486A (en) * 1994-10-26 1997-06-10 Motorola, Inc. Method and system for continuous speech recognition using voting techniques
US5596679A (en) * 1994-10-26 1997-01-21 Motorola, Inc. Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs
US5839103A (en) * 1995-06-07 1998-11-17 Rutgers, The State University Of New Jersey Speaker verification system using decision fusion logic
US5835890A (en) * 1996-08-02 1998-11-10 Nippon Telegraph And Telephone Corporation Method for speaker adaptation of speech models recognition scheme using the method and recording medium having the speech recognition method recorded thereon

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687287A (en) * 1995-05-22 1997-11-11 Lucent Technologies Inc. Speaker verification method and apparatus using mixture decomposition discrimination

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHIWEI CHE, QIGUANG LIN, DONG-SUK YUK: "AN HMM APPROACH TO TEXT-PROMPTED SPEAKER VERIFICATION", 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING - PROCEEDINGS. (ICASSP). ATLANTA, MAY 7 - 10, 1996., NEW YORK, IEEE., US, vol. 02, 1 March 1996 (1996-03-01), US, pages 673 - 676, XP002917152, ISBN: 978-0-7803-3193-8, DOI: 10.1109/ICASSP.1996.543210 *
HAN-SHENG LIOU, MAMMONE R J: "A SUBWORD NEURAL TREE NETWORK APPROACH TO TEXT-DEPENDENT SPEAKER VERIFICATION", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP). DETROIT, MAY 9 - 12, 1995. SPEECH., NEW YORK, IEEE., US, vol. 01, 1 May 1995 (1995-05-01), US, pages 357 - 360, XP002917151, ISBN: 978-0-7803-2432-9, DOI: 10.1109/ICASSP.1995.479595 *
HAN-SHENG LIOU, MAMMONE R J: "SPEAKER VERIFICATION USING PHONEME-BASED NEURAL TREE NETWORKS AND PHONETIC WEIGHTING SCORING METHOD", NEURAL NETWORKS FOR SIGNAL PROCESSING. PROCEEDINGS OF THE IEEESIGNAL PROCESSING SOCIETY WORKSHOP, XX, XX, 1 January 1995 (1995-01-01), XX, pages 213 - 222, XP002917150 *
MISTRETTA W, FARRELL K: "MODEL ADAPTATION METHODS FOR SPEAKER VERIFICATION", PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING. ICASSP '98. SEATTLE, WA, MAY 12 - 15, 1998., NEW YORK, NY : IEEE., US, vol. 01, 1 May 1998 (1998-05-01), US, pages 113 - 116, XP002917153, ISBN: 978-0-7803-4429-7, DOI: 10.1109/ICASSP.1998.674380 *
NAIK J M: "SPEAKER VERIFICATION: A TUTORIAL", IEEE COMMUNICATIONS MAGAZINE., IEEE SERVICE CENTER, PISCATAWAY., US, 1 January 1990 (1990-01-01), US, pages 42 - 48, XP002917154, ISSN: 0163-6804, DOI: 10.1109/35.46670 *
See also references of EP1027700A4 *
TOMOKO MATSUI, SADAOKI FURUI: "SPEAKER ADAPTATION OF TIED-MIXTURE-BASED PHONEME MODELS FOR TEXT-PROMPTED SPEAKER RECOGNITION", PROCEEDINGS OF ICASSP '94. IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING; 19-22 APRIL 1994; ADELAIDE, SA, AUSTRALIA, IEEE SERVICE CENTER, PISCATAWAY, NJ, vol. 01, 1 January 1994 (1994-01-01), Piscataway, NJ, pages I125 - I128, XP002917149, ISBN: 978-0-7803-1775-8 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029785A1 (en) * 2000-09-30 2002-04-11 Intel Corporation Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
EP1431959A2 (en) * 2002-12-18 2004-06-23 Matsushita Electric Industrial Co., Ltd. Gaussian model-based dynamic time warping system and method for speech processing
EP1431959A3 (en) * 2002-12-18 2005-04-20 Matsushita Electric Industrial Co., Ltd. Gaussian model-based dynamic time warping system and method for speech processing
GB2465782B (en) * 2008-11-28 2016-04-13 Univ Nottingham Trent Biometric identity verification
US9311546B2 (en) 2008-11-28 2016-04-12 Nottingham Trent University Biometric identity verification for access control using a trained statistical classifier
GB2465782A (en) * 2008-11-28 2010-06-02 Univ Nottingham Trent Biometric identity verification utilising a trained statistical classifier, e.g. a neural network
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
GB2517952A (en) * 2013-09-05 2015-03-11 Barclays Bank Plc Biometric verification using predicted signatures
GB2517952B (en) * 2013-09-05 2017-05-31 Barclays Bank Plc Biometric verification using predicted signatures
US9830440B2 (en) 2013-09-05 2017-11-28 Barclays Bank Plc Biometric verification using predicted signatures
WO2020035085A3 (en) * 2019-10-31 2020-08-20 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
US10997980B2 (en) 2019-10-31 2021-05-04 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
US11031018B2 (en) 2019-10-31 2021-06-08 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for personalized speaker verification
US11244689B2 (en) 2019-10-31 2022-02-08 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
US11611581B2 (en) 2020-08-26 2023-03-21 ID R&D, Inc. Methods and devices for detecting a spoofing attack

Also Published As

Publication number Publication date
EP1027700A4 (en) 2001-01-31
CN1302427A (en) 2001-07-04
EP1027700A1 (en) 2000-08-16
US6519561B1 (en) 2003-02-11
AU1305799A (en) 1999-05-24

Similar Documents

Publication Publication Date Title
US6519561B1 (en) Model adaptation of neural tree networks and other fused models for speaker verification
US6539352B1 (en) Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
EP0870300B1 (en) Speaker verification system
US5862519A (en) Blind clustering of data with application to speech processing systems
EP1399915B1 (en) Speaker verification
EP0501631B1 (en) Temporal decorrelation method for robust speaker verification
EP0744734B1 (en) Speaker verification method and apparatus using mixture decomposition discrimination
AU2002311452A1 (en) Speaker recognition system
CA2609247A1 (en) Automatic text-independent, language-independent speaker voice-print creation and speaker recognition
EP1417677A1 (en) Voice registration method and system, and voice recognition method and system based on voice registration method and system
WO2002029785A1 (en) Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
KR100917419B1 (en) Speaker recognition systems
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
WO1999059136A1 (en) Channel estimation system and method for use in automatic speaker verification systems
Melin et al. Voice recognition with neural networks, fuzzy logic and genetic algorithms
Fakotakis et al. High performance text-independent speaker recognition system based on voiced/unvoiced segmentation and multiple neural nets.
Thakur et al. Speaker Authentication Using GMM-UBM
WO2005038774A1 (en) Adaptive sound and image learning system and method
Pandit Voice and lip based speaker verification
Jayarathna Speaker Recognition using Voice Biometrics
VALSAMAKIS Speaker Identification and Verification Using Gaussian Mixture Models
Anitha et al. PASSWORD SECURED SPEAKER RECOGNITION USING TIME AND FREQUENCY DOMAIN FEATURES
MXPA97009615A (en) High verification system

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 98812890.X

Country of ref document: CN

AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
NENP Non-entry into the national phase

Ref country code: KR

WWE Wipo information: entry into national phase

Ref document number: 1998956559

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1998956559

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: CA

WWW Wipo information: withdrawn in national office

Ref document number: 1998956559

Country of ref document: EP