US20070239441A1 - System and method for addressing channel mismatch through class specific transforms - Google Patents

System and method for addressing channel mismatch through class specific transforms Download PDF

Info

Publication number
US20070239441A1
US20070239441A1 US11/391,891 US39189106A US2007239441A1 US 20070239441 A1 US20070239441 A1 US 20070239441A1 US 39189106 A US39189106 A US 39189106A US 2007239441 A1 US2007239441 A1 US 2007239441A1
Authority
US
United States
Prior art keywords
speaker
condition state
recited
utterance
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/391,891
Inventor
Jiri Navratil
Jason Pelecanos
Ganesh Ramaswamy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/391,891 priority Critical patent/US20070239441A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NAVRATIL, JIRI, PELECANOS, JASON, RAMASWAMY, GANESH N.
Publication of US20070239441A1 publication Critical patent/US20070239441A1/en
Priority to US12/132,079 priority patent/US8024183B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present invention relates to audio classification and more particularly to systems and methods for addressing mismatch in utterances due to equipment or transmission media differences.
  • Speaker recognition and verification is an important part of many current systems for security or other applications.
  • mismatched channel conditions for example, when a person enrolls for a service or attempts to access their account using an electret handset but wishes to be verified when using a cell phone, there is significant mismatch between these audio environments. This results in severe performance degradation.
  • SMS Speaker Model Synthesis
  • FM Feature Mapping
  • ISV Intersession Variation Modeling
  • the SMS technique was a model transformation technique.
  • the SMS technique performed speaker model transformations according to the parameter differences between MAP adapted speaker background models of different handset types.
  • Embodiments of the present systems and methods address the problem of speaker verification under mismatched channel conditions, and further address the shortfalls of the prior art by directly optimizing a target function for the various discrete handsets and channels.
  • a method and system for speaker recognition and identification includes transforming features of a speaker utterance in a first condition state to match a second condition state and provide a transformed utterance.
  • a discriminative criterion is used to determine the transformation that is applied to the utterance to obtain a computed result.
  • the discriminative criterion is maximized over a plurality of speakers to obtain a best transform function for one of recognizing speech and identifying a speaker under the second condition state.
  • Speech recognition and speaker identity may be determined by employing the best transform for decoding speech to reduce channel mismatch.
  • a system/method for audio classification includes transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance.
  • a discriminative criterion is maximized over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
  • Another system/method for audio classification includes providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types and applying one of the transforms to a speaker based upon the input type.
  • the transforms are precomputed by transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance, and maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
  • the audio class modeling may include speaker recognition and/or speaker identification.
  • a condition state may include a neutralized channel condition which counters effects of a first condition state.
  • FIG. 1 is a block/flow diagram showing a system/method for adjusting models and determining transforms to reduce channel mismatch in accordance with one illustrative embodiment
  • FIG. 2 is a block/flow diagram showing a system/method for identifying a speaker or recognizing speech in accordance with another illustrative embodiment
  • FIG. 3 is a block diagram showing a device which implements features in accordance with the present embodiments.
  • Embodiments of the present disclosure provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device and transmission media mismatches.
  • the criterion is naturally optimized and is preferably suited to a Log-Likelihood-Ratio (LLR) scoring approach commonly used for speaker recognition.
  • LLR Log-Likelihood-Ratio
  • the LLR algorithm combined with the transformation approach attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using a transform.
  • the transform attempts to directly maximize posterior probabilities and is targeted to reduce mismatch between handsets, microphones, input equipment and/or transmission media accordingly.
  • Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements.
  • the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • a computer-usable or computer-readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution.
  • I/O devices including but not limited to keyboards, displays, pointing devices, etc. may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
  • Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • Preferred embodiments provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device mismatch.
  • GMMs Gaussian Mixture Models
  • the criterion is naturally optimized and is suited to the Log-Likelihood-Ratio (LLR) scoring approach commonly used in GMMs for speaker recognition.
  • LLR Log-Likelihood-Ratio
  • the algorithm attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using the transform (preferably class specific transforms).
  • the transform attempts to maximize the posterior probability of the speech observations across a set of speaker models.
  • the present approach addresses the channel mismatch issue through the direct transformation of features using a discriminative criterion.
  • the present disclosure performs a transformation dependent upon the channel type of a test recording and a desired target channel type that the features are to be mapped to. It may also be optimized in a manner that does not require explicit knowledge of the channel itself.
  • mapping optimization function is trained by maximizing a simplified version of a joint likelihood ratio scoring metric using held out data from many speakers.
  • the goal of the mapping function is to obtain a transformation that maximizes the joint log-likelihood-ratio of observing the utterances from many speakers against their corresponding target speaker and background speaker models.
  • a discriminative design framework is formulated by optimizing joint model probabilities.
  • the discriminative design framework includes a system useful in non-speech differences and distortion that may be present on a microphone or other input device between training and/or different use sessions.
  • An illustrative example will be employed to demonstrate principles of the present embodiments.
  • a speaker recognition system is given a transformed utterance ⁇ right arrow over (Y) ⁇ for a speaker, and performs the following evaluation to determine if the test utterance belongs to the target speaker model, ⁇ s . If the speaker score ⁇ s is above a specified threshold, the speaker claim is accepted, otherwise the claim is rejected. It is also the same criterion used for optimizing for the mismatch between audio sessions, giving a natural optimization result.
  • ⁇ s Pr ⁇ ( ⁇ s
  • Y _ ) ⁇ ⁇ ( 1 ) P ⁇ ( ⁇ s ) ⁇ p ⁇ ( Y _
  • P( ⁇ s ) and P( ⁇ h ) are the prior probabilities of an utterance being from speaker s and h correspondingly, where s and h are speaker indexes.
  • the posterior probability of speaker model ⁇ s given the speaker's utterance ⁇ right arrow over (Y) ⁇ , is indicated by Pr( ⁇ s
  • the likelihood of the observations, ⁇ right arrow over (Y) ⁇ given the model ⁇ h is given by p( ⁇ right arrow over (Y) ⁇
  • the model was trained using audio data from one channel (say, e.g., an electret type landline handset) while the test utterance was recorded under different channel conditions (say, e.g., a carbon button type landline handset), it may prove useful to transform the features of the test utterance to match the channel conditions of the model training component.
  • one channel say, e.g., an electret type landline handset
  • different channel conditions say, e.g., a carbon button type landline handset
  • the calculation of a Jacobian matrix is not required as the optimization function is a ratio of densities.
  • the denominator of equation (4) may be represented by a single Universal Background Model (UBM) or a model representative of all speaker classes. Note that in a similar manner the most competitive impostor model for each speaker utterance could be substituted in place for the UBM in the denominator of (4).
  • UBM Universal Background Model
  • One important point to consider is that depending on the functional form of the numerator and denominator pair of equation (4), the final optimization function may become too complex or may not deliver an optimization problem with a stationary point.
  • the denominator of (4) will be represented as a collection of speaker models (e.g., class specific models) then the optimization function will become more complex.
  • speaker models e.g., class specific models
  • An alternative to using many speaker models in the denominator of (4) is to consider that these speaker models have parameters that follow a particular distribution, p( ⁇ ).
  • a Bayesian predictive estimate (known in the art) may be given for the denominator of (4). With speaker class prior probabilities being equal, this gives the following result.
  • ⁇ s ) be represented by a Gaussian Mixture Model (GMM), comprised of N Gaussian components, with the set of weights, means and diagonal covariances given as ⁇ i s , ⁇ i s , ⁇ i s ⁇ ⁇ i . If ⁇ right arrow over (Y) ⁇ includes T s independent and identically distributed observations represented by ⁇ y 1 s ,y 2 s , . . . ,y T s ⁇ then the joint likelihood of the D-dimensional observations may be calculated.
  • GMM Gaussian Mixture Model
  • the notation (′) represents the transpose operator. Now the problem of specifying the distribution of the speaker model parameters is addressed. Let all speaker models be the MAP adaptation representation (known in the art) of a Universal Background Model which is trained on a large quantity of speech. For one embodiment only the mixture component means ( ⁇ ) are adapted (indicating a minimal degradation attributed to such constraints).
  • the speaker model component mean parameters are assumed to be independent and are governed by a Gaussian distribution with ⁇ m i ,C i ⁇ .
  • the denominator may now be evaluated. Let the joint likelihood of the observations be approximated by considering only the most significant Gaussian component contribution for each frame. This approximation is most appropriate for sparse mixture components.
  • ⁇ y t d s ,m i d ⁇ represents the d th element within their respective vectors ⁇ y i s ,m i ⁇ .
  • ⁇ i dd ,C i dd , ⁇ i dd s ⁇ represent the element in the d th row and d th column of the appropriate diagonal covariance matrices ⁇ i ,C i , ⁇ i s ⁇ .
  • the maximization problem may be simplified further if it is considered that the derivative of this function with respect to transformation variables is calculated. It is assumed that the Gaussian mixture models are calculated through Bayesian adaptation of the mixture component means from a Universal Background GMM. All model parameters are coupled to the Universal Background Model; which includes the S target speaker models and the denominator model representation. The most significant mixture components are determined by using Equation (9) and extracting the Gaussian indexes by scoring on the Universal Background GMM. These indexes are used to score the corresponding Gaussian components in all other models.
  • the algorithm was represented such that for a single speaker model created from a single enrollment utterance, there was a single test utterance to score the utterance. Further richness can be achieved in the optimization process if multiple models and/or test utterances are trained for each speaker.
  • one benefit of the optimization function is that the unique one-to-one mapping needed when a Jacobian matrix is factored in is not required here. This also permits for the situation where two modes present under one channel condition may manifest themselves as a single mode under another channel. Given this flexibility, an appropriate transform for ⁇ right arrow over (Y) ⁇ s is selected.
  • a final transform may be represented as a combination of affine transforms according to posterior probability.
  • x ) ⁇ j ⁇ ⁇ g ( x
  • u j ⁇ , ⁇ j ⁇ ) ⁇ z 1 J ⁇ ⁇ ⁇ z ⁇ ⁇ g ( x
  • ⁇ hacek over ( ⁇ ) ⁇ j , ⁇ hacek over ( ⁇ ) ⁇ j , ⁇ hacek over ( ⁇ ) ⁇ j ⁇ is the set of mixture component weights, means and covariances, respectively for a J component Gaussian Mixture Model.
  • the purpose of this GMM is to provide a smooth weighting function of Gaussian kernels to weight the corresponding combination of affine transforms.
  • ⁇ (•) is selected to be of a form with a controllable complexity similar to SPAM models, which are known in the art.
  • ⁇ j ( x ) A j x+b j (21) with the set of A j and b j being controllable in complexity as follows:
  • R j is a mixture component specific transform matrix
  • ⁇ j k is the weighting factor applied to the k th transform matrix V k .
  • the resulting transform matrix, A j for mixture component j is a linear combination of a small set of transforms.
  • the matrix R j is typically a zero matrix, or a constrained matrix to enable some simplified transforms that would not typically be available using the remaining transformation matrices. It may also be a preset mixture-component-specific matrix that is known to be a reasonable solution to the problem.
  • the offset vector for mixture component j may be determined in a similar manner with r i being the mixture component specific offset, and v k being the k th offset vector.
  • the vector r i is typically a zero vector or a pre-selected, mixture component specific, vector constant. In the case when the vector is a preset constant, the remainder of the equation is designed to maximize the target function by optimizing for the residual.
  • Equation (18) may be maximized with the transformation function used from Equation 19 using a number of techniques. Between iterations, if no transformed observations change which significant Gaussian class they belong to in the original acoustic UBM, the problem is a matrix-quadratic optimization problem. In one embodiment, due to transformed vectors changing their Gaussian class between iterations, a gradient ascent approach is taken. Consequently, the functional derivative needs to be determined.
  • variable ⁇ y i d s _ ⁇ ⁇ may be substituted by any one of the following partial derivative results in the equations following.
  • ⁇ j t s Pr ⁇ ( j
  • the derivative may be calculated for the subspace matrices and vectors.
  • be the vector of variables that is to be optimized in terms of maximizing Q.
  • the gradient ascent algorithm can now used accordingly, given the parametric estimates of the slopes.
  • ⁇ new ⁇ old + ⁇ ( ⁇ log ⁇ ⁇ Q ⁇ ⁇ ⁇ ⁇ old ) ( 36 ) where ⁇ is the learning rate.
  • the methods were employed to determine an optimal mapping between a first channel state and a second channel state.
  • the mapping that is calculated may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required.
  • the mapping system can then map arbitrarily from any channel state to any other channel state.
  • the optimization functions of equations (3), (4) or (5) may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform).
  • the applied mapping is formed from several transforms that are selected (or weighted) according to their relevance.
  • equation (3) was presented in equation (5) and its optimization procedure was derived.
  • equation (4) although more computationally expensive, may also be optimized using a set of speaker or audio class models.
  • the core function to optimize is given as:
  • ⁇ y i d s _ ⁇ ⁇ may be substituted by any of the corresponding equations from equation (27) to equation (35).
  • the steepest ascent algorithm may be performed, as before, to determine the transform or transforms to apply.
  • the single top mixture component for each audio frame is scored. This may also be extended to the feature transformation mapping algorithm such that only the mapping corresponding to the top scoring feature partitioning GMM is applied rather than summing over the contributions of the mappings corresponding to all mixture components.
  • the target and background model representations can be forced to be a function of the Universal Background Model (UBM); a coupled model system. Consequently, the UBM can be used to determine which mixture components are the largest contributors to the frame based likelihood.
  • UBM Universal Background Model
  • the Gaussian indexes may be applied to the coupled target adapted model and the background model representations. It is noted that the top 5 mixture components were considered for the transformation function GMM. This approximation introduced significant speedups. Each of these previously mentioned items were included in the current system but it should be noted that additional speed optimizations are available.
  • One technique is to test if a particular Gaussian is the most significant Gaussian for the current feature vector. Given that this algorithm is iterative and applies small adjustments to the mapping parameters, a Gaussian component that was dominant on a previous iteration may also be relevant for the current iteration. If the probability density of the vector for the Gaussian is larger than the predetermined threshold for the appropriate Gaussian component, then it is the most significant Gaussian for the GMM. This technique operates more effectively for sparse Gaussian mixture components.
  • Another method is to construct, for each Gaussian component, a table of close Gaussians or Gaussian competitors. Given that the mapping parameters are adjusted in an incremental manner, the Gaussian lookup table for the most significant Gaussian of the previous iteration may be evaluated to rapidly locate the most significant Gaussian for the next iteration.
  • the table length may be configured to trade off the search speed against the accuracy of locating the most likely Gaussian component.
  • speaker models trained using a first condition state or input type are provided.
  • speaker models trained using a landline telephone or a microphone to collect speaker utterances are stored in a database or provided as a model. These models may be created from audio from a single channel type.
  • feature sets from a set of speaker utterances in a second condition state are generated as input for a discriminative training criterion.
  • the discriminative criterion (from blocks 102 and 104 ) is maximized over a plurality of speakers by applying, e.g., a steepest ascent algorithm (or similar optimization) to determine a best transform function or set of transform functions. This includes maximizing Q 1 in equation (3) (or Q A in equation (39).
  • a discriminative criterion objective function is specified using the existing speaker models and the non-transformed utterances. This discriminative criterion is applied to generate the transformed utterance to obtain a computed result, which may be determined either arbitrarily or empirically. An optimization metric of a speaker based on discrimination between speaker classes is preferably performed.
  • the discriminative criterion may include equation (3) (or equation (39).
  • the result giving the objective function maximum gives the transform or transforms.
  • This transform may then be used to convert/map or neutralize the inputs received over a different input type.
  • the transforms may be adjusted to provide accommodation for the currently used input type.
  • the best transform may be used for recognizing speech and/or identifying a speaker under the condition state of the received utterance to reduce channel mismatch.
  • the system may undergo many input conditions and a best transform may be applied for each input condition.
  • posterior probabilities are maximized in the maximizing step.
  • the present embodiments use speaker classes to determine the transform. The result is that the most likely speaker is determined instead of the most likely acoustic match.
  • the transform is calculated once to maximize Q such that the maximum Q gives the transform.
  • the speaker space is broken down based on subsets or classes of speakers. The maximum likelihood (or a related metric) of seeing a particular speaker is used to determine the transform (as opposed to simply matching the acoustic input).
  • At least one speaker model may be transformed using the best transform to create a new model for decoding speech or identifying a speaker.
  • the speaker model may be transformed from a first input type to a second input type by directly mapping features from the first input type to a second input type using the transform.
  • the mapping done by the transform may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required.
  • the mapping system can then map arbitrarily from any channel state to any other channel state.
  • the mapping block Once the mapping block has learned the set of transforms, the speaker recognition system may be evaluated.
  • the multi-transform mapping block is then used to map all utterances. The benefit is that no explicit handset or channel labels are required.
  • FIG. 2 a system/method 200 for speaker recognition and identification in accordance with an illustrative embodiment is shown. A similar method may be employed for other audio analysis as well.
  • a plurality of transforms is provided for decoding utterances, wherein the transforms correspond to a plurality of input types or conditions. This may include a single transform or a plurality of transforms to handle multiple conditions.
  • a transform(s) is applied to features from a speaker based upon the input type or all input types.
  • Block 206 indicates that the transforms are precomputed by the method shown in FIG. 1 . The precomputation of the transform or transforms may be performed at the time of manufacture of the system or may be recomputed intermittently to account for new input types or other system changes.
  • the best transform or series of transforms are determined for each or all input types and applied by determining conditions under which a speaker is providing input.
  • the input types may include, e.g., telephone handset types, channel types and/or microphone types.
  • the best transform may include transforming the input to a neutralized channel condition which counters effects of the input state or any other transform that reduces mismatch between input types.
  • the speaker is identified or the utterance is decoded in accordance with the input type correction provided herein.
  • channel mismatch is reduced or eliminated.
  • the same procedure as described above may be applied to other audio applications related to audio scene analysis, music and song detection and audio enhancement of corrupted audio channels.
  • One such example would be to recognize artists or songs being played over the radio or media. This greatly reduces the effect of the channel differences on the system attempting to detect the song type or artist.
  • the technique effectively removes the differences between different audio channels and simplifies the matching process for the audio classifier.
  • Other applications are also contemplated.
  • the audio training utterance durations were approximately two minutes with test utterance durations of 15-45 seconds.
  • the NIST 1999 speaker recognition database was included as development data to train the UBM and the corresponding carbon handset to electret handset transformation function. This database was selected because of the significant quantity of carbon and electret handset data available. The same principle may be applied to the more recent speaker recognition evaluations by providing several channel mapping functions dependent upon the channel type.
  • a speaker recognition system in accordance with embodiments of the present invention includes two main components, a feature extraction module and a speaker modeling module, as is known in the art.
  • MFCCs Mel-Frequency Cepstral Coefficients
  • 19 Mel-Frequency Cepstral Coefficients were extracted from 24 filter banks.
  • the cepstral features may be extracted, e.g., by using 32 ms frames at a 10 ms frame shift. The corresponding delta features were calculated.
  • Feature Warping may be applied to all features to mitigate the effects of linear channels and slowly varying additive noise.
  • the speaker modeling module generated speaker modeling through MAP adaptation of a Universal Background Model. This implementation of the MAP adaptation approach adjusted the Gaussian components toward the target speaker speech features.
  • the mixture component mean parameters were also adapted. In this work, a single iteration of the EM-MAP algorithm was performed. In testing, only the top mixture component from the UBM was scored and used to reference the corresponding components in other models. Other mixtures and numbers of mixture components may also be employed.
  • a version of the system was evaluation for a challenging subset of the NIST 2000 Evaluation.
  • the subset of trials was selected purposely to identify the effect of the channel mapping from the carbon test utterance type to the electret model type. Thus, only carbon tests against electret models were evaluated in this subset.
  • DCF minimum detection cost function
  • EER equal error rate
  • the improvements were realized using a single transformation determined from 100 unique speakers. Additional error reductions are expected by using more speaker data to calculate the transform.
  • the results described herein are for illustrative purposes only.
  • An audio classification system or device 300 may include a personal computer, a telephone system, an answering system, a security system or any other device or system where multiple users or multiple users and/or multiple input types or devices may be present.
  • Device 300 is capable of supporting a software application or module 302 which provides audio classification, which can map the input types as described above to enable identification of a speaker, the decoding of utterances or other audio classification processes.
  • Application 302 may include a speech recognition system, speech to speech system, text to speech system or other audio classification processing module 304 capable of audio processing (e.g., for audio scene analysis, etc.).
  • input utterances may be received from a plurality of different input types and/or channels (telephones, microphones, etc.).
  • Inputs 301 may include microphones of different types, telephones of different types, prerecorded audio sent via different channels or methods or any other input device.
  • a module 306 may include a speech synthesizer, a printer, recording media, a computer or other data port or any other suitable device that uses the output of application 302 .
  • Application 302 stores precomputed transforms 310 which are best adapted to account for channel mismatch.
  • the transforms 310 include a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not needed.
  • the system can then map arbitrarily from any channel state to any other channel state.
  • the optimization functions of may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform) to provide this functionality at training.
  • the applied mapping may be formed from several transforms that are selected (or weighted) according to their relevance. Once the set of transforms are learned, the speaker recognition system may be evaluated to ensure proper operation on any available channel types for that application.
  • the multi-transform mapping block is then used to map all utterances.
  • Module 306 may include a security device that permits a user access to information, such as a database, an account, applications, etc. based on an authorization or confirmed identity as determined by application 302 .
  • Speech recognition system 304 may also recognize or decode the speech of a user despite the input type 301 that the user employs to communicate with device 300 .

Abstract

A method and system for speaker recognition and identification includes transforming features of a speaker utterance in a first condition state to match a second condition state and provide a transformed utterance. A discriminative criterion is used to generate a transform that maps an utterance to obtain a computed result. The discriminative criterion is maximized over a plurality of speakers to obtain a best transform for recognizing speech and/or identifying a speaker under the second condition state. Speech recognition and speaker identity may be determined by employing the best transform for decoding speech to reduce channel mismatch.

Description

    GOVERNMENT RIGHTS
  • This invention was made with Government support under Contract No.: NBCH050097 awarded by the U.S. Department of Interior. The Government has certain rights in this invention.
  • BACKGROUND
  • 1. Technical Field
  • The present invention relates to audio classification and more particularly to systems and methods for addressing mismatch in utterances due to equipment or transmission media differences.
  • 2. Description of the Related Art
  • Speaker recognition and verification is an important part of many current systems for security or other applications. However, under mismatched channel conditions, for example, when a person enrolls for a service or attempts to access their account using an electret handset but wishes to be verified when using a cell phone, there is significant mismatch between these audio environments. This results in severe performance degradation.
  • Some of the solutions to date include Speaker Model Synthesis (SMS), Feature Mapping (FM) and Intersession Variation Modeling (ISV) and channel specific score normalization. A drawback of these methods includes that SMS and FM perform a model/feature transformation based on a criterion that is unrelated to the core likelihood ratio criterion that is being used to score the result. ISV does not assume discrete channel classes, and score normalization does not directly account for channel mismatch.
  • Previous work in addressing the channel mismatch problem is similar in that either the features or model parameters are transformed according to some criterion. For example, the SMS technique was a model transformation technique. The SMS technique performed speaker model transformations according to the parameter differences between MAP adapted speaker background models of different handset types.
  • Some work in the area of speech recognition, although not directly addressing the channel mismatch problem, is also worthy of mention. It examined constrained discriminative model training and transformations to robustly estimate model parameters. Using such constraints, speaker models could be adapted to new environments. Another approach, termed factor analysis, models the speaker and channel variability in a model parameter subspace. Follow up work showed that modeling intersession variation alone provided significant gains in speaker verification performance.
  • There are several schemes that address channel mismatch from the perspective of feature transformation schemes. One study utilized a neural network to perform feature mapping on an incoming acoustic feature stream to minimize the effect of channel influences. There were no explicit channel specific mappings applied on this occasion. Another technique involved performed feature mapping based on detecting the channel type and mapping the features to a neutral channel domain. This technique mapped features in a similar manner that SMS transforms model parameters. For speech recognition, a piecewise Feature space Maximum Likelihood Linear Regression (fMLLR) transformation is applied to adapt to channel conditions. No explicit channel information is exploited.
  • SUMMARY
  • Embodiments of the present systems and methods address the problem of speaker verification under mismatched channel conditions, and further address the shortfalls of the prior art by directly optimizing a target function for the various discrete handsets and channels.
  • A method and system for speaker recognition and identification includes transforming features of a speaker utterance in a first condition state to match a second condition state and provide a transformed utterance. A discriminative criterion is used to determine the transformation that is applied to the utterance to obtain a computed result. The discriminative criterion is maximized over a plurality of speakers to obtain a best transform function for one of recognizing speech and identifying a speaker under the second condition state. Speech recognition and speaker identity may be determined by employing the best transform for decoding speech to reduce channel mismatch.
  • A system/method for audio classification includes transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance. A discriminative criterion is maximized over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
  • Another system/method for audio classification includes providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types and applying one of the transforms to a speaker based upon the input type. The transforms are precomputed by transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance, and maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
  • In other systems and methods, the audio class modeling may include speaker recognition and/or speaker identification. A condition state may include a neutralized channel condition which counters effects of a first condition state. The system may undergo many input conditions and apply a best transform for each input condition. Maximizing a discriminative criterion may include determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker. Speech decoding may be based on a selected transform.
  • These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
  • FIG. 1 is a block/flow diagram showing a system/method for adjusting models and determining transforms to reduce channel mismatch in accordance with one illustrative embodiment;
  • FIG. 2 is a block/flow diagram showing a system/method for identifying a speaker or recognizing speech in accordance with another illustrative embodiment;
  • FIG. 3 is a block diagram showing a device which implements features in accordance with the present embodiments.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • Embodiments of the present disclosure provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device and transmission media mismatches. The criterion is naturally optimized and is preferably suited to a Log-Likelihood-Ratio (LLR) scoring approach commonly used for speaker recognition. The LLR algorithm combined with the transformation approach attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using a transform. The transform attempts to directly maximize posterior probabilities and is targeted to reduce mismatch between handsets, microphones, input equipment and/or transmission media accordingly.
  • Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • Preferred embodiments provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device mismatch. The criterion is naturally optimized and is suited to the Log-Likelihood-Ratio (LLR) scoring approach commonly used in GMMs for speaker recognition. The algorithm attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using the transform (preferably class specific transforms). The transform attempts to maximize the posterior probability of the speech observations across a set of speaker models.
  • One of the largest challenges in telephony based speaker recognition is effectively mitigating the degradation attributed to handset and channel mismatch. There are a number of techniques described above which address this issue. These approaches reduce mismatch through the modification of the features or adjustment of the models themselves to suit the new condition.
  • The present approach addresses the channel mismatch issue through the direct transformation of features using a discriminative criterion. The present disclosure performs a transformation dependent upon the channel type of a test recording and a desired target channel type that the features are to be mapped to. It may also be optimized in a manner that does not require explicit knowledge of the channel itself.
  • In contrast to previous work, a mapping optimization function is trained by maximizing a simplified version of a joint likelihood ratio scoring metric using held out data from many speakers. The goal of the mapping function is to obtain a transformation that maximizes the joint log-likelihood-ratio of observing the utterances from many speakers against their corresponding target speaker and background speaker models.
  • A discriminative design framework is formulated by optimizing joint model probabilities. The discriminative design framework includes a system useful in non-speech differences and distortion that may be present on a microphone or other input device between training and/or different use sessions. An illustrative example will be employed to demonstrate principles of the present embodiments.
  • A speaker recognition system is given a transformed utterance {right arrow over (Y)} for a speaker, and performs the following evaluation to determine if the test utterance belongs to the target speaker model, λs. If the speaker score Λs is above a specified threshold, the speaker claim is accepted, otherwise the claim is rejected. It is also the same criterion used for optimizing for the mismatch between audio sessions, giving a natural optimization result. Λ s = Pr ( λ s | Y _ ) ( 1 ) = P ( λ s ) p ( Y _ | λ s ) h P ( λ h ) p ( Y _ | λ h ) ( 2 )
  • Here, P(λs) and P(λh) are the prior probabilities of an utterance being from speaker s and h correspondingly, where s and h are speaker indexes. The posterior probability of speaker model λs, given the speaker's utterance {right arrow over (Y)}, is indicated by Pr(λs|{right arrow over (Y)}). The likelihood of the observations, {right arrow over (Y)} given the model λh is given by p({right arrow over (Y)}|λh).
  • Given that the model was trained using audio data from one channel (say, e.g., an electret type landline handset) while the test utterance was recorded under different channel conditions (say, e.g., a carbon button type landline handset), it may prove useful to transform the features of the test utterance to match the channel conditions of the model training component.
  • In one embodiment, a feature transformation function is employed to maximize equation (1) but across many speakers (S). Hence, a joint probability of the speakers (s=1 to S) given their corresponding observations is maximized. The calculation of a Jacobian matrix is not required as the optimization function is a ratio of densities. Q 1 = s = 1 S Pr ( λ s | Y _ s ) ( 3 ) = s = 1 S P ( λ s ) p ( Y _ | λ s ) h P ( λ h ) p ( Y _ | λ h ) ( 4 )
  • Here the denominator of equation (4) may be represented by a single Universal Background Model (UBM) or a model representative of all speaker classes. Note that in a similar manner the most competitive impostor model for each speaker utterance could be substituted in place for the UBM in the denominator of (4). One important point to consider is that depending on the functional form of the numerator and denominator pair of equation (4), the final optimization function may become too complex or may not deliver an optimization problem with a stationary point.
  • If it is assumed that the denominator of (4) will be represented as a collection of speaker models (e.g., class specific models) then the optimization function will become more complex. An alternative to using many speaker models in the denominator of (4) is to consider that these speaker models have parameters that follow a particular distribution, p(λ). In this case, a Bayesian predictive estimate (known in the art) may be given for the denominator of (4). With speaker class prior probabilities being equal, this gives the following result. Q 2 = s = 1 S p ( Y _ s | λ s ) p ( λ ) p ( Y _ s | λ ) λ ( 5 )
  • Let p({right arrow over (Y)}ss) be represented by a Gaussian Mixture Model (GMM), comprised of N Gaussian components, with the set of weights, means and diagonal covariances given as {ωi si si s} i . If {right arrow over (Y)} includes Ts independent and identically distributed observations represented by {y1 s,y2 s, . . . ,yT s} then the joint likelihood of the D-dimensional observations may be calculated. p ( Y _ s | λ s ) = t = 1 T s i = 1 N ω i s g ( y t s | μ i s , i s ) ( 6 ) where g ( y t s μ i s , i s ) = 1 ( 2 π ) d I s × exp { ( y t s - μ i s ) ( i s ) - 1 ( y t s - μ i s ) } ( 7 )
  • The notation (′) represents the transpose operator. Now the problem of specifying the distribution of the speaker model parameters is addressed. Let all speaker models be the MAP adaptation representation (known in the art) of a Universal Background Model which is trained on a large quantity of speech. For one embodiment only the mixture component means (μ) are adapted (indicating a minimal degradation attributed to such constraints).
  • The speaker model component mean parameters are assumed to be independent and are governed by a Gaussian distribution with {mi,Ci}. Thus, in a similar vain, the representation for p(λs) is established. For example: p ( λ s ) = i = 1 N g ( μ i s | m i , C i ) ( 8 )
  • The denominator may now be evaluated. Let the joint likelihood of the observations be approximated by considering only the most significant Gaussian component contribution for each frame. This approximation is most appropriate for sparse mixture components. p ( Y _ | λ s ) i = 1 T s max i = 1 N { ω i 2 g ( y t s | μ i s , i s ) } ( 9 )
  • Given this assumption, the predictive likelihood may be calculated. The result is given by equation 10. A Viterbi approach for estimating the Bayesian predictive density may be referenced. λ p ( λ ) p ( Y _ | λ ) λ t = 1 N ( ω i ( 2 π ) D / 2 ) n i s d = 1 D Φ i dd s ( Σ i dd ) n i s × exp { - n i s 2 Σ i dd ( Φ i dd s ( y i d s - m i d ) 2 _ + ( 1 - Φ i dd s ) ( y i d s 2 _ - y i d s _ 2 ) ) } ( 10 ) where n i s = t 3 y i ε i l ( 11 ) y i d s _ = 1 n i s t 3 y i s ε i y i d s ( 12 ) y i d s 2 _ = 1 n i s t 3 y i s ε i y i d s 2 ( 13 ) Φ i dd s = 1 n i s C i dd i ii - 1 + 1 ( 14 ) and ( y i d s - m i d ) 2 _ = y i d s 2 _ - 2 m i d y i d s _ + m i d m i d ( 15 )
  • Here, {yt d s,mi d } represents the dth element within their respective vectors {yi s,mi}. Correspondingly, {Σi dd ,Ci dd i dd s} represent the element in the dth row and dth column of the appropriate diagonal covariance matrices {Σi,Cii s}.
  • Now in the case where there is a single speaker being scored, as in the numerator condition, the model distribution becomes a point observation. This is achieved by setting Ci dd to 0 and gives the following result which is equivalent (depending on the optimal mixture component selection criterion) to the standard GMM likelihood scoring when only the top Gaussian is scored. p ( Y _ | λ s ) = t = 1 N ( ω i ( 2 π ) D / 2 ) n i d = 1 D 1 ( Σ i dd ) n i × exp { - n i s 2 Σ i dd ( ( y i dd s - μ i d ) 2 _ ) } ( 16 )
  • Given this derivation, let us calculate the log of the ratio of the target speaker joint likelihood and the likelihood of all other speakers for a set of S utterances and corresponding models. For example: log Q 2 = s = 1 S ( log p ( Y _ s | λ s ) - log λ p ( λ ) p ( Y _ s | λ ) λ ) ( 17 )
  • The maximization problem may be simplified further if it is considered that the derivative of this function with respect to transformation variables is calculated. It is assumed that the Gaussian mixture models are calculated through Bayesian adaptation of the mixture component means from a Universal Background GMM. All model parameters are coupled to the Universal Background Model; which includes the S target speaker models and the denominator model representation. The most significant mixture components are determined by using Equation (9) and extracting the Gaussian indexes by scoring on the Universal Background GMM. These indexes are used to score the corresponding Gaussian components in all other models. With these constraints, the function to maximize is the following: Q = s = 1 S i = 1 N d = 1 D { n i s 2 Σ i dd × ( 2 ( μ i dd s –Φ i dd s m i d ) y i d s _ - ( 1 - Φ i dd s ) y i d s _ 2 ) } ( 18 )
  • For simplicity, the algorithm was represented such that for a single speaker model created from a single enrollment utterance, there was a single test utterance to score the utterance. Further richness can be achieved in the optimization process if multiple models and/or test utterances are trained for each speaker. Depending on the viewpoint, one benefit of the optimization function is that the unique one-to-one mapping needed when a Jacobian matrix is factored in is not required here. This also permits for the situation where two modes present under one channel condition may manifest themselves as a single mode under another channel. Given this flexibility, an appropriate transform for {right arrow over (Y)}s is selected.
  • Transform Selection
  • A final transform may be represented as a combination of affine transforms according to posterior probability. For example: y = Ψ ( x ) = j = 1 J Pr ( j | x ) Ψ j ( x ) ( 19 ) with Pr ( j | x ) = ω j g ( x | u j , j ) z = 1 J ω z g ( x | u z , z ) ( 20 )
  • where {{hacek over (ω)}j,{hacek over (μ)}j,{hacek over (Σ)}j} is the set of mixture component weights, means and covariances, respectively for a J component Gaussian Mixture Model. The purpose of this GMM is to provide a smooth weighting function of Gaussian kernels to weight the corresponding combination of affine transforms.
  • Note also that throughout the optimization problem the posterior probabilities need only to be calculated once. This GMM, used to determine the mixture component probabilities, could be the same as the Universal Background Speaker model for adapting speakers or a separate model altogether.
  • Here Ψ(•) is selected to be of a form with a controllable complexity similar to SPAM models, which are known in the art.
    Ψj(x)=A j x+b j  (21)
    with the set of Aj and bj being controllable in complexity as follows: A j = θ j A R j + k = 1 K θ j k V k ( 22 ) b j = θ j b r j + k = 1 K θ j k v k ( 23 )
    where Rj is a mixture component specific transform matrix, and θj k is the weighting factor applied to the kth transform matrix Vk. In summary, the resulting transform matrix, Aj, for mixture component j is a linear combination of a small set of transforms. The matrix Rj is typically a zero matrix, or a constrained matrix to enable some simplified transforms that would not typically be available using the remaining transformation matrices. It may also be a preset mixture-component-specific matrix that is known to be a reasonable solution to the problem.
  • Conversely, the offset vector for mixture component j may be determined in a similar manner with ri being the mixture component specific offset, and vk being the kth offset vector. The vector ri is typically a zero vector or a pre-selected, mixture component specific, vector constant. In the case when the vector is a preset constant, the remainder of the equation is designed to maximize the target function by optimizing for the residual.
  • An alternative weighting function is also proposed as an alternative that considers only the top scoring mixture component. Ψ j ( x ) = Ψ j max ( x ) where j max = arg max j = 1 J Pr ( j | x ) ( 24 )
  • Optimization
  • Equation (18) may be maximized with the transformation function used from Equation 19 using a number of techniques. Between iterations, if no transformed observations change which significant Gaussian class they belong to in the original acoustic UBM, the problem is a matrix-quadratic optimization problem. In one embodiment, due to transformed vectors changing their Gaussian class between iterations, a gradient ascent approach is taken. Consequently, the functional derivative needs to be determined.
  • For the derivative calculation, it is assumed that no or very few mapped feature vector observations lie on or near the decision boundary between two Gaussians (in which case the slope estimate is only an approximation). Depending on the configuration of the system this assumption may be significant and would then need an additional derivative approximation for the mixture component counts. If A and b are to be optimized directly, the partial derivative approximation with respect to one of the optimization variables, Ω, is presented. Q Ω = s = 1 S i = 1 N d = 1 D { n i s i dd ( y i d s _ Ω ) × ( ( μ i d s - Φ i dd s m i d ) - ( 1 - Φ i dd s ) y i d s _ ) } ( 25 )
  • The variable y i d s _ Ω
    may be substituted by any one of the following partial derivative results in the equations following. γ j t s = Pr ( j | x t s ) ( 26 ) y i d s _ b j d = 1 n i s t y t s i γ j t s ( 27 ) y i d s _ A j d e = 1 n i s t y t s i γ j t s x t d s ( 28 ) y i d s _ A j f e = 0 if d f ( 29 )
  • This results in a series of equations to solve. Note that the assumption here is that the transformation variations between iterations are small and that the number of observations changing from iteration to iteration is negligible.
  • Correspondingly, if a SPAM model equivalent is substituted to reduce the number of parameters to optimize, the slope functions become the following.
  • The mixture component specific transformation weightings are as follows. These weighting factors are established for the reason efficiently managing the search space for the transformation. y i d s _ θ j A = 1 n i s t y t s i γ j t s q = 1 D R j dq x t q s ( 30 ) y i d s _ θ j b = 1 n i s t y t s i γ j t s r j d ( 31 ) y i d s _ θ j k = 1 n i s t y t s i γ j t s ( v k d + q = 1 D V k dq x t q s ) ( 32 )
  • The derivative may be calculated for the subspace matrices and vectors. y i d s _ v k d = 1 n i s t y t s i j = 1 J γ j t s θ j k ( 33 ) y i d s _ V k d e = 1 n i s t y t s i ( j = 1 J γ j t s θ j k ) x t d s ( 34 ) y i d s _ V k f e = 0 if d f ( 35 )
  • Let Ω be the vector of variables that is to be optimized in terms of maximizing Q. The gradient ascent algorithm can now used accordingly, given the parametric estimates of the slopes. Ω new = Ω old + η ( log Q Ω Ω old ) ( 36 )
    where η is the learning rate.
  • Generalized Mapping: In the embodiments described above, the methods were employed to determine an optimal mapping between a first channel state and a second channel state. In another form, the mapping that is calculated may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required. In addition, the mapping system can then map arbitrarily from any channel state to any other channel state. In this sense, the optimization functions of equations (3), (4) or (5) may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform). The applied mapping is formed from several transforms that are selected (or weighted) according to their relevance. Once the mapping block has learned the set of transforms, the speaker recognition system may be evaluated. The multi-transform mapping block is then used to map all utterances. The benefit is that no explicit handset or channel labels are required.
  • Alternative Optimization Function
  • There are many different transformation optimization functions that can be determined and derived using a similar process. For the purpose of illustration, another example derivation follows. An estimate of equation (3) was presented in equation (5) and its optimization procedure was derived. In a similar manner, equation (4), although more computationally expensive, may also be optimized using a set of speaker or audio class models. In the log domain, equation (4) may be represented as an alternative (logQA2) to equation (17) as follows: log Q A 2 = s = 1 S ( log p ( Y s | λ s ) - log h p ( Y s | λ h ) ) ( 37 )
    Under similar constraints to the previously suggested optimization function, the core function to optimize is given as: Q A = s = 1 S { ( i = 1 N d = 1 D n i s 2 i dd [ 2 y i d s _ μ i d s - ( μ i d s ) 2 ] ) - log ( h exp { i = 1 N d = 1 D n i s 2 i dd [ 2 y i d s _ μ i d h - ( μ i d h ) 2 ] } ) } ( 38 )
    To optimize this function, the same steepest ascent procedure is preferably selected. To perform the optimization an approximation to the slope is needed. In this slope approximation, it is assumed that ni s remains relatively constant. In many instances this assumption may not be appropriate. Accordingly, an approximation to the derivative is presented. Q Ω s = 1 S i = 1 N d = 1 D { n i s i dd ( y i d s _ Ω ) × ( μ i d s - h Pr ( λ h | Y s ) μ i d h ) } ( 39 )
    Here Pr(λh|{hacek over (Y)}s) is the probability of the speaker model λh given the transformed utterance, {right arrow over (Y)}s. The term, y i d s _ Ω ,
    may be substituted by any of the corresponding equations from equation (27) to equation (35). The steepest ascent algorithm may be performed, as before, to determine the transform or transforms to apply.
  • Efficient Algorithmic Implementation
  • Due to the nature of the optimization process, a number of techniques can be introduced to speed up the procedure. As already identified and derived above, the single top mixture component for each audio frame is scored. This may also be extended to the feature transformation mapping algorithm such that only the mapping corresponding to the top scoring feature partitioning GMM is applied rather than summing over the contributions of the mappings corresponding to all mixture components. In addition to using only the top mixture component throughout the system, the target and background model representations can be forced to be a function of the Universal Background Model (UBM); a coupled model system. Consequently, the UBM can be used to determine which mixture components are the largest contributors to the frame based likelihood. Thus, once the Gaussian indexes are obtained from the UBM, they may be applied to the coupled target adapted model and the background model representations. It is noted that the top 5 mixture components were considered for the transformation function GMM. This approximation introduced significant speedups. Each of these previously mentioned items were included in the current system but it should be noted that additional speed optimizations are available.
  • One technique is to test if a particular Gaussian is the most significant Gaussian for the current feature vector. Given that this algorithm is iterative and applies small adjustments to the mapping parameters, a Gaussian component that was dominant on a previous iteration may also be relevant for the current iteration. If the probability density of the vector for the Gaussian is larger than the predetermined threshold for the appropriate Gaussian component, then it is the most significant Gaussian for the GMM. This technique operates more effectively for sparse Gaussian mixture components.
  • Another method is to construct, for each Gaussian component, a table of close Gaussians or Gaussian competitors. Given that the mapping parameters are adjusted in an incremental manner, the Gaussian lookup table for the most significant Gaussian of the previous iteration may be evaluated to rapidly locate the most significant Gaussian for the next iteration. The table length may be configured to trade off the search speed against the accuracy of locating the most likely Gaussian component.
  • Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, a system/method 100 is illustratively shown which provides a speaker recognition and identification system in accordance with one embodiment. In block 102, speaker models trained using a first condition state or input type are provided. For example, speaker models trained using a landline telephone or a microphone to collect speaker utterances are stored in a database or provided as a model. These models may be created from audio from a single channel type. In block 104, feature sets from a set of speaker utterances in a second condition state are generated as input for a discriminative training criterion.
  • In block 110, the discriminative criterion (from blocks 102 and 104) is maximized over a plurality of speakers by applying, e.g., a steepest ascent algorithm (or similar optimization) to determine a best transform function or set of transform functions. This includes maximizing Q1 in equation (3) (or QA in equation (39).
  • In block 110, a discriminative criterion objective function is specified using the existing speaker models and the non-transformed utterances. This discriminative criterion is applied to generate the transformed utterance to obtain a computed result, which may be determined either arbitrarily or empirically. An optimization metric of a speaker based on discrimination between speaker classes is preferably performed. The discriminative criterion may include equation (3) (or equation (39).
  • The result giving the objective function maximum gives the transform or transforms. This transform may then be used to convert/map or neutralize the inputs received over a different input type. The transforms may be adjusted to provide accommodation for the currently used input type.
  • The best transform may be used for recognizing speech and/or identifying a speaker under the condition state of the received utterance to reduce channel mismatch. The system may undergo many input conditions and a best transform may be applied for each input condition. In one embodiment posterior probabilities are maximized in the maximizing step.
  • The present embodiments use speaker classes to determine the transform. The result is that the most likely speaker is determined instead of the most likely acoustic match. In addition, the transform is calculated once to maximize Q such that the maximum Q gives the transform. The speaker space is broken down based on subsets or classes of speakers. The maximum likelihood (or a related metric) of seeing a particular speaker is used to determine the transform (as opposed to simply matching the acoustic input).
  • In block 112, at least one speaker model may be transformed using the best transform to create a new model for decoding speech or identifying a speaker. The speaker model may be transformed from a first input type to a second input type by directly mapping features from the first input type to a second input type using the transform.
  • The mapping done by the transform may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required. In addition, the mapping system can then map arbitrarily from any channel state to any other channel state. Once the mapping block has learned the set of transforms, the speaker recognition system may be evaluated. The multi-transform mapping block is then used to map all utterances. The benefit is that no explicit handset or channel labels are required.
  • Referring to FIG. 2, a system/method 200 for speaker recognition and identification in accordance with an illustrative embodiment is shown. A similar method may be employed for other audio analysis as well. A plurality of transforms is provided for decoding utterances, wherein the transforms correspond to a plurality of input types or conditions. This may include a single transform or a plurality of transforms to handle multiple conditions. In block 210, a transform(s) is applied to features from a speaker based upon the input type or all input types. Block 206 indicates that the transforms are precomputed by the method shown in FIG. 1. The precomputation of the transform or transforms may be performed at the time of manufacture of the system or may be recomputed intermittently to account for new input types or other system changes.
  • In block 208, the best transform or series of transforms are determined for each or all input types and applied by determining conditions under which a speaker is providing input. The input types may include, e.g., telephone handset types, channel types and/or microphone types. The best transform may include transforming the input to a neutralized channel condition which counters effects of the input state or any other transform that reduces mismatch between input types.
  • In block 212, the speaker is identified or the utterance is decoded in accordance with the input type correction provided herein. Advantageously, channel mismatch is reduced or eliminated.
  • In the same way that the discriminative technique was designed to perform a channel mapping that was optimized to differentiate between speakers, the same procedure as described above may be applied to other audio applications related to audio scene analysis, music and song detection and audio enhancement of corrupted audio channels. One such example would be to recognize artists or songs being played over the radio or media. This greatly reduces the effect of the channel differences on the system attempting to detect the song type or artist. The technique effectively removes the differences between different audio channels and simplifies the matching process for the audio classifier. Other applications are also contemplated.
  • Experiments:
  • Evaluation and development data: To demonstrate the present invention, a speaker recognition system was evaluated on the NIST 2000 dataset. This particular dataset is comprised mostly of landline telephone calls from carbon-button or electret based telephone handsets. Given that there are two classes of audio data, the feature transformation mechanism was designed to map from carbon-button features to electret based features. In this database, for the primary condition, there are 4991 audio segments (including 2470 male test and 2521 female test audio segments) tested against over 1003 speakers, of which 457 speakers are male and 546 speakers are female.
  • The audio training utterance durations were approximately two minutes with test utterance durations of 15-45 seconds. The NIST 1999 speaker recognition database was included as development data to train the UBM and the corresponding carbon handset to electret handset transformation function. This database was selected because of the significant quantity of carbon and electret handset data available. The same principle may be applied to the more recent speaker recognition evaluations by providing several channel mapping functions dependent upon the channel type.
  • System Description Used in Experiments
  • A speaker recognition system in accordance with embodiments of the present invention includes two main components, a feature extraction module and a speaker modeling module, as is known in the art.
  • For the feature extraction module in accordance with one embodiment, Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from filter banks. In an illustrative embodiment, 19 Mel-Frequency Cepstral Coefficients (MFCCs) were extracted from 24 filter banks. The cepstral features may be extracted, e.g., by using 32 ms frames at a 10 ms frame shift. The corresponding delta features were calculated.
  • Feature Warping may be applied to all features to mitigate the effects of linear channels and slowly varying additive noise. The speaker modeling module generated speaker modeling through MAP adaptation of a Universal Background Model. This implementation of the MAP adaptation approach adjusted the Gaussian components toward the target speaker speech features. The mixture component mean parameters were also adapted. In this work, a single iteration of the EM-MAP algorithm was performed. In testing, only the top mixture component from the UBM was scored and used to reference the corresponding components in other models. Other mixtures and numbers of mixture components may also be employed.
  • Results
  • A version of the system was evaluation for a challenging subset of the NIST 2000 Evaluation. The subset of trials was selected purposely to identify the effect of the channel mapping from the carbon test utterance type to the electret model type. Thus, only carbon tests against electret models were evaluated in this subset. The results indicated a reduction in minimum detection cost function (DCF) from 0.057 to 0.054 and a decrease in equal error rate (EER) from 14.9% to 13.0%. The improvements were realized using a single transformation determined from 100 unique speakers. Additional error reductions are expected by using more speaker data to calculate the transform. The results described herein are for illustrative purposes only.
  • Referring to FIG. 3, a system 300 for providing class specific transformations based on input type or conditions is illustratively depicted. An audio classification system or device 300 may include a personal computer, a telephone system, an answering system, a security system or any other device or system where multiple users or multiple users and/or multiple input types or devices may be present. Device 300 is capable of supporting a software application or module 302 which provides audio classification, which can map the input types as described above to enable identification of a speaker, the decoding of utterances or other audio classification processes. Application 302 may include a speech recognition system, speech to speech system, text to speech system or other audio classification processing module 304 capable of audio processing (e.g., for audio scene analysis, etc.).
  • In one embodiment, input utterances may be received from a plurality of different input types and/or channels (telephones, microphones, etc.). Inputs 301 may include microphones of different types, telephones of different types, prerecorded audio sent via different channels or methods or any other input device. A module 306 may include a speech synthesizer, a printer, recording media, a computer or other data port or any other suitable device that uses the output of application 302.
  • Application 302 stores precomputed transforms 310 which are best adapted to account for channel mismatch. In one embodiment, the transforms 310 include a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not needed. The system can then map arbitrarily from any channel state to any other channel state. The optimization functions of may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform) to provide this functionality at training. The applied mapping may be formed from several transforms that are selected (or weighted) according to their relevance. Once the set of transforms are learned, the speaker recognition system may be evaluated to ensure proper operation on any available channel types for that application. The multi-transform mapping block is then used to map all utterances.
  • Module 306 may include a security device that permits a user access to information, such as a database, an account, applications, etc. based on an authorization or confirmed identity as determined by application 302. Speech recognition system 304 may also recognize or decode the speech of a user despite the input type 301 that the user employs to communicate with device 300.
  • Having described preferred embodiments of a system and method for addressing channel mismatch through class specific (e.g., speaker discrimination) transforms (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Claims (20)

1. A method for audio classification, comprising:
transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance; and
maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
2. The method as recited in claim 1, further comprising employing a speaker model trained using a first channel condition provided by a first hardware type.
3. The method as recited in claim 2, wherein the second condition state includes a second channel condition provided by a second hardware type.
4. The method as recited in claim 2, wherein the second condition state includes a neutralized channel condition which counters effects of the first condition state.
5. The method as recited in claim 1, wherein the system undergoes many input conditions and further comprises applying a best transform for each input condition.
6. The method as recited in claim 1, wherein maximizing a discriminative criterion includes determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker.
7. The method as recited in claim 1, wherein the discriminative criterion includes:
Q 1 = s = 1 S Pr ( λ s | Y s ) ( 3 )
where Q1 is a function to be optimized, and Pr(λs|{hacek over (Y)}s) is a posterior probability of speaker model λs, given the speaker's channel matched transformed utterance, {right arrow over (Y)}.
8. The method as recited in claim 1, further comprising decoding speech based on a selected transform.
9. The method as recited in claim 1, further comprising transforming at least one speaker model from a first input type corresponding to the first condition state to a second input type corresponding to the second condition state by directly mapping features from the first input type to a second input type using a transform.
10. The method as recited in claim 1, wherein maximizing includes maximizing posterior probabilities.
11. A computer program product for audio classification comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:
transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance; and
maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
12. A method for audio classification, comprising:
providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types; and
applying one of the transforms to a speaker based upon the input type;
wherein the transforms are precomputed by:
transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance; and
maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
13. The method as recited in claim 12, wherein the best transform is determined for each input type and applied by determining conditions under which a speaker is providing input.
14. The method as recited in claim 12, wherein the input types include one or more of telephone handsets, channel types and microphones.
15. The method as recited in claim 12, wherein the different condition state includes a neutralized channel condition which counters effects of the first condition state.
16. The method as recited in claim 12, wherein maximizing a discriminative criterion includes determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker.
17. The method as recited in claim 12, wherein the discriminative criterion includes:
Q 1 = s = 1 S Pr ( λ s | Y s ) ( 3 )
where Q1 is a function to be optimized, and Pr(λs|{right arrow over (Y)}s) is the posterior probability of speaker model λs, given the speaker's channel matched transformed utterance, {right arrow over (Y)}.
18. The method as recited in claim 17, further comprising decoding speech based on a selected transform.
19. The method as recited in claim 12, wherein the transform reduces mismatch between input types.
20. A computer program product for audio classification comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:
providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types; and
applying one of the transforms to a speaker based upon the input type;
wherein the transforms are precomputed by:
transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance; and
maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
US11/391,891 2006-03-29 2006-03-29 System and method for addressing channel mismatch through class specific transforms Abandoned US20070239441A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/391,891 US20070239441A1 (en) 2006-03-29 2006-03-29 System and method for addressing channel mismatch through class specific transforms
US12/132,079 US8024183B2 (en) 2006-03-29 2008-06-03 System and method for addressing channel mismatch through class specific transforms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/391,891 US20070239441A1 (en) 2006-03-29 2006-03-29 System and method for addressing channel mismatch through class specific transforms

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/132,079 Continuation US8024183B2 (en) 2006-03-29 2008-06-03 System and method for addressing channel mismatch through class specific transforms

Publications (1)

Publication Number Publication Date
US20070239441A1 true US20070239441A1 (en) 2007-10-11

Family

ID=38576534

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/391,891 Abandoned US20070239441A1 (en) 2006-03-29 2006-03-29 System and method for addressing channel mismatch through class specific transforms
US12/132,079 Expired - Fee Related US8024183B2 (en) 2006-03-29 2008-06-03 System and method for addressing channel mismatch through class specific transforms

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/132,079 Expired - Fee Related US8024183B2 (en) 2006-03-29 2008-06-03 System and method for addressing channel mismatch through class specific transforms

Country Status (1)

Country Link
US (2) US20070239441A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233483A1 (en) * 2006-04-03 2007-10-04 Voice. Trust Ag Speaker authentication in digital communication networks
US20080167868A1 (en) * 2007-01-04 2008-07-10 Dimitri Kanevsky Systems and methods for intelligent control of microphones for speech recognition applications
US20080235016A1 (en) * 2007-01-23 2008-09-25 Infoture, Inc. System and method for detection and analysis of speech
US20090228272A1 (en) * 2007-11-12 2009-09-10 Tobias Herbig System for distinguishing desired audio signals from noise
US20120166195A1 (en) * 2010-12-27 2012-06-28 Fujitsu Limited State detection device and state detecting method
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
US8744847B2 (en) 2007-01-23 2014-06-03 Lena Foundation System and method for expressive language assessment
US20140337026A1 (en) * 2013-05-09 2014-11-13 International Business Machines Corporation Method, apparatus, and program for generating training speech data for target domain
CN104167211A (en) * 2014-08-08 2014-11-26 南京大学 Multi-source scene sound abstracting method based on hierarchical event detection and context model
US8938390B2 (en) 2007-01-23 2015-01-20 Lena Foundation System and method for expressive language and developmental disorder assessment
US9240188B2 (en) 2004-09-16 2016-01-19 Lena Foundation System and method for expressive language, developmental disorder, and emotion assessment
US9355651B2 (en) 2004-09-16 2016-05-31 Lena Foundation System and method for expressive language, developmental disorder, and emotion assessment
US10223934B2 (en) 2004-09-16 2019-03-05 Lena Foundation Systems and methods for expressive language, developmental disorder, and emotion assessment, and contextual feedback
US10529357B2 (en) 2017-12-07 2020-01-07 Lena Foundation Systems and methods for automatic determination of infant cry and discrimination of cry from fussiness
CN111210809A (en) * 2018-11-22 2020-05-29 阿里巴巴集团控股有限公司 Voice training data adaptation method and device, voice data conversion method and electronic equipment
US11449372B1 (en) * 2019-06-28 2022-09-20 Amazon Technologies, Inc. System for enforcing use of schemas and interfaces

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143326B (en) * 2013-12-03 2016-11-02 腾讯科技(深圳)有限公司 A kind of voice command identification method and device
US9373324B2 (en) 2013-12-06 2016-06-21 International Business Machines Corporation Applying speaker adaption techniques to correlated features
KR102421027B1 (en) * 2020-08-28 2022-07-15 국방과학연구소 Apparatus, method, computer-readable storage medium and computer program for speaker voice analysis

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5528731A (en) * 1993-11-19 1996-06-18 At&T Corp. Method of accommodating for carbon/electret telephone set variability in automatic speaker verification
US5950157A (en) * 1997-02-28 1999-09-07 Sri International Method for establishing handset-dependent normalizing models for speaker recognition
US6032115A (en) * 1996-09-30 2000-02-29 Kabushiki Kaisha Toshiba Apparatus and method for correcting the difference in frequency characteristics between microphones for analyzing speech and for creating a recognition dictionary
US6233556B1 (en) * 1998-12-16 2001-05-15 Nuance Communications Voice processing and verification system
US6751588B1 (en) * 1999-11-23 2004-06-15 Sony Corporation Method for performing microphone conversions in a speech recognition system
US6760701B2 (en) * 1996-11-22 2004-07-06 T-Netix, Inc. Subword-based speaker verification using multiple-classifier fusion, with channel, fusion, model and threshold adaptation
US6778957B2 (en) * 2001-08-21 2004-08-17 International Business Machines Corporation Method and apparatus for handset detection
US6804647B1 (en) * 2001-03-13 2004-10-12 Nuance Communications Method and system for on-line unsupervised adaptation in speaker verification
US6912497B2 (en) * 2001-03-28 2005-06-28 Texas Instruments Incorporated Calibration of speech data acquisition path
US6934364B1 (en) * 2002-02-28 2005-08-23 Hewlett-Packard Development Company, L.P. Handset identifier using support vector machines
US6980952B1 (en) * 1998-08-15 2005-12-27 Texas Instruments Incorporated Source normalization training for HMM modeling of speech
US7043427B1 (en) * 1998-03-18 2006-05-09 Siemens Aktiengesellschaft Apparatus and method for speech recognition
US7451085B2 (en) * 2000-10-13 2008-11-11 At&T Intellectual Property Ii, L.P. System and method for providing a compensated speech recognition model for speech recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6487530B1 (en) * 1999-03-30 2002-11-26 Nortel Networks Limited Method for recognizing non-standard and standard speech by speaker independent and speaker dependent word models
US6760601B1 (en) * 1999-11-29 2004-07-06 Nokia Corporation Apparatus for providing information services to a telecommunication device user
US6941264B2 (en) * 2001-08-16 2005-09-06 Sony Electronics Inc. Retraining and updating speech models for speech recognition
US7240007B2 (en) * 2001-12-13 2007-07-03 Matsushita Electric Industrial Co., Ltd. Speaker authentication by fusion of voiceprint match attempt results with additional information
US7181393B2 (en) * 2002-11-29 2007-02-20 Microsoft Corporation Method of real-time speaker change point detection, speaker tracking and speaker model construction

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5528731A (en) * 1993-11-19 1996-06-18 At&T Corp. Method of accommodating for carbon/electret telephone set variability in automatic speaker verification
US6032115A (en) * 1996-09-30 2000-02-29 Kabushiki Kaisha Toshiba Apparatus and method for correcting the difference in frequency characteristics between microphones for analyzing speech and for creating a recognition dictionary
US6760701B2 (en) * 1996-11-22 2004-07-06 T-Netix, Inc. Subword-based speaker verification using multiple-classifier fusion, with channel, fusion, model and threshold adaptation
US5950157A (en) * 1997-02-28 1999-09-07 Sri International Method for establishing handset-dependent normalizing models for speaker recognition
US7043427B1 (en) * 1998-03-18 2006-05-09 Siemens Aktiengesellschaft Apparatus and method for speech recognition
US6980952B1 (en) * 1998-08-15 2005-12-27 Texas Instruments Incorporated Source normalization training for HMM modeling of speech
US6233556B1 (en) * 1998-12-16 2001-05-15 Nuance Communications Voice processing and verification system
US6751588B1 (en) * 1999-11-23 2004-06-15 Sony Corporation Method for performing microphone conversions in a speech recognition system
US7451085B2 (en) * 2000-10-13 2008-11-11 At&T Intellectual Property Ii, L.P. System and method for providing a compensated speech recognition model for speech recognition
US6804647B1 (en) * 2001-03-13 2004-10-12 Nuance Communications Method and system for on-line unsupervised adaptation in speaker verification
US6912497B2 (en) * 2001-03-28 2005-06-28 Texas Instruments Incorporated Calibration of speech data acquisition path
US6778957B2 (en) * 2001-08-21 2004-08-17 International Business Machines Corporation Method and apparatus for handset detection
US6934364B1 (en) * 2002-02-28 2005-08-23 Hewlett-Packard Development Company, L.P. Handset identifier using support vector machines

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9899037B2 (en) 2004-09-16 2018-02-20 Lena Foundation System and method for emotion assessment
US9240188B2 (en) 2004-09-16 2016-01-19 Lena Foundation System and method for expressive language, developmental disorder, and emotion assessment
US9355651B2 (en) 2004-09-16 2016-05-31 Lena Foundation System and method for expressive language, developmental disorder, and emotion assessment
US10573336B2 (en) 2004-09-16 2020-02-25 Lena Foundation System and method for assessing expressive language development of a key child
US9799348B2 (en) 2004-09-16 2017-10-24 Lena Foundation Systems and methods for an automatic language characteristic recognition system
US10223934B2 (en) 2004-09-16 2019-03-05 Lena Foundation Systems and methods for expressive language, developmental disorder, and emotion assessment, and contextual feedback
US7970611B2 (en) * 2006-04-03 2011-06-28 Voice.Trust Ag Speaker authentication in digital communication networks
US20070233483A1 (en) * 2006-04-03 2007-10-04 Voice. Trust Ag Speaker authentication in digital communication networks
US20080167868A1 (en) * 2007-01-04 2008-07-10 Dimitri Kanevsky Systems and methods for intelligent control of microphones for speech recognition applications
US8140325B2 (en) * 2007-01-04 2012-03-20 International Business Machines Corporation Systems and methods for intelligent control of microphones for speech recognition applications
US8744847B2 (en) 2007-01-23 2014-06-03 Lena Foundation System and method for expressive language assessment
US8078465B2 (en) * 2007-01-23 2011-12-13 Lena Foundation System and method for detection and analysis of speech
US20080235016A1 (en) * 2007-01-23 2008-09-25 Infoture, Inc. System and method for detection and analysis of speech
US8938390B2 (en) 2007-01-23 2015-01-20 Lena Foundation System and method for expressive language and developmental disorder assessment
US8131544B2 (en) * 2007-11-12 2012-03-06 Nuance Communications, Inc. System for distinguishing desired audio signals from noise
US20090228272A1 (en) * 2007-11-12 2009-09-10 Tobias Herbig System for distinguishing desired audio signals from noise
US8996373B2 (en) * 2010-12-27 2015-03-31 Fujitsu Limited State detection device and state detecting method
US20120166195A1 (en) * 2010-12-27 2012-06-28 Fujitsu Limited State detection device and state detecting method
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
US20140337026A1 (en) * 2013-05-09 2014-11-13 International Business Machines Corporation Method, apparatus, and program for generating training speech data for target domain
US10217456B2 (en) * 2013-05-09 2019-02-26 International Business Machines Corporation Method, apparatus, and program for generating training speech data for target domain
CN104167211A (en) * 2014-08-08 2014-11-26 南京大学 Multi-source scene sound abstracting method based on hierarchical event detection and context model
US10529357B2 (en) 2017-12-07 2020-01-07 Lena Foundation Systems and methods for automatic determination of infant cry and discrimination of cry from fussiness
US11328738B2 (en) 2017-12-07 2022-05-10 Lena Foundation Systems and methods for automatic determination of infant cry and discrimination of cry from fussiness
CN111210809A (en) * 2018-11-22 2020-05-29 阿里巴巴集团控股有限公司 Voice training data adaptation method and device, voice data conversion method and electronic equipment
US11449372B1 (en) * 2019-06-28 2022-09-20 Amazon Technologies, Inc. System for enforcing use of schemas and interfaces

Also Published As

Publication number Publication date
US8024183B2 (en) 2011-09-20
US20080235007A1 (en) 2008-09-25

Similar Documents

Publication Publication Date Title
US8024183B2 (en) System and method for addressing channel mismatch through class specific transforms
Žmolíková et al. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures
Li et al. An overview of noise-robust automatic speech recognition
Huo et al. A Bayesian predictive classification approach to robust speech recognition
Hanilci et al. Recognition of brand and models of cell-phones from recorded speech signals
US7664643B2 (en) System and method for speech separation and multi-talker speech recognition
JP5459680B2 (en) Speech processing system and method
Hanilçi et al. Source cell-phone recognition from recorded speech using non-speech segments
US20080208581A1 (en) Model Adaptation System and Method for Speaker Recognition
Besacier et al. Localization and selection of speaker-specific information with statistical modeling
US6931374B2 (en) Method of speech recognition using variational inference with switching state space models
JP2002140089A (en) Method and apparatus for pattern recognition training wherein noise reduction is performed after inserted noise is used
Akbacak et al. Environmental sniffing: noise knowledge estimation for robust speech systems
Tsao et al. An ensemble speaker and speaking environment modeling approach to robust speech recognition
Pujol et al. On real-time mean-and-variance normalization of speech recognition features
Herbig et al. Self-learning speaker identification for enhanced speech recognition
Scheffer et al. Recent developments in voice biometrics: Robustness and high accuracy
Ozerov et al. GMM-based classification from noisy features
Pullella et al. Robust speaker identification using combined feature selection and missing data recognition
Nandwana et al. Analysis and mitigation of vocal effort variations in speaker recognition
JP2009116278A (en) Method and device for register and evaluation of speaker authentication
Herbig et al. Simultaneous speech recognition and speaker identification
Yamamoto et al. Genetic algorithm-based improvement of robot hearing capabilities in separating and recognizing simultaneous speech signals
Herbig et al. Detection of unknown speakers in an unsupervised speech controlled system
Fujimoto et al. A Robust Estimation Method of Noise Mixture Model for Noise Suppression.

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAVRATIL, JIRI;PELECANOS, JASON;RAMASWAMY, GANESH N.;REEL/FRAME:017470/0504

Effective date: 20060328

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION