US20070239441A1

US20070239441A1 - System and method for addressing channel mismatch through class specific transforms

Info

Publication number: US20070239441A1
Application number: US11/391,891
Authority: US
Inventors: Jiri Navratil; Jason Pelecanos; Ganesh Ramaswamy
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-03-29
Filing date: 2006-03-29
Publication date: 2007-10-11
Also published as: US8024183B2; US20080235007A1

Abstract

A method and system for speaker recognition and identification includes transforming features of a speaker utterance in a first condition state to match a second condition state and provide a transformed utterance. A discriminative criterion is used to generate a transform that maps an utterance to obtain a computed result. The discriminative criterion is maximized over a plurality of speakers to obtain a best transform for recognizing speech and/or identifying a speaker under the second condition state. Speech recognition and speaker identity may be determined by employing the best transform for decoding speech to reduce channel mismatch.

Description

GOVERNMENT RIGHTS

This invention was made with Government support under Contract No.: NBCH050097 awarded by the U.S. Department of Interior. The Government has certain rights in this invention.

BACKGROUND

1. Technical Field
The present invention relates to audio classification and more particularly to systems and methods for addressing mismatch in utterances due to equipment or transmission media differences.
2. Description of the Related Art
Speaker recognition and verification is an important part of many current systems for security or other applications. However, under mismatched channel conditions, for example, when a person enrolls for a service or attempts to access their account using an electret handset but wishes to be verified when using a cell phone, there is significant mismatch between these audio environments. This results in severe performance degradation.
Some of the solutions to date include Speaker Model Synthesis (SMS), Feature Mapping (FM) and Intersession Variation Modeling (ISV) and channel specific score normalization. A drawback of these methods includes that SMS and FM perform a model/feature transformation based on a criterion that is unrelated to the core likelihood ratio criterion that is being used to score the result. ISV does not assume discrete channel classes, and score normalization does not directly account for channel mismatch.
Previous work in addressing the channel mismatch problem is similar in that either the features or model parameters are transformed according to some criterion. For example, the SMS technique was a model transformation technique. The SMS technique performed speaker model transformations according to the parameter differences between MAP adapted speaker background models of different handset types.
Some work in the area of speech recognition, although not directly addressing the channel mismatch problem, is also worthy of mention. It examined constrained discriminative model training and transformations to robustly estimate model parameters. Using such constraints, speaker models could be adapted to new environments. Another approach, termed factor analysis, models the speaker and channel variability in a model parameter subspace. Follow up work showed that modeling intersession variation alone provided significant gains in speaker verification performance.
There are several schemes that address channel mismatch from the perspective of feature transformation schemes. One study utilized a neural network to perform feature mapping on an incoming acoustic feature stream to minimize the effect of channel influences. There were no explicit channel specific mappings applied on this occasion. Another technique involved performed feature mapping based on detecting the channel type and mapping the features to a neutral channel domain. This technique mapped features in a similar manner that SMS transforms model parameters. For speech recognition, a piecewise Feature space Maximum Likelihood Linear Regression (fMLLR) transformation is applied to adapt to channel conditions. No explicit channel information is exploited.

SUMMARY

Embodiments of the present systems and methods address the problem of speaker verification under mismatched channel conditions, and further address the shortfalls of the prior art by directly optimizing a target function for the various discrete handsets and channels.
A method and system for speaker recognition and identification includes transforming features of a speaker utterance in a first condition state to match a second condition state and provide a transformed utterance. A discriminative criterion is used to determine the transformation that is applied to the utterance to obtain a computed result. The discriminative criterion is maximized over a plurality of speakers to obtain a best transform function for one of recognizing speech and identifying a speaker under the second condition state. Speech recognition and speaker identity may be determined by employing the best transform for decoding speech to reduce channel mismatch.
A system/method for audio classification includes transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance. A discriminative criterion is maximized over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
Another system/method for audio classification includes providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types and applying one of the transforms to a speaker based upon the input type. The transforms are precomputed by transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance, and maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.
In other systems and methods, the audio class modeling may include speaker recognition and/or speaker identification. A condition state may include a neutralized channel condition which counters effects of a first condition state. The system may undergo many input conditions and apply a best transform for each input condition. Maximizing a discriminative criterion may include determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker. Speech decoding may be based on a selected transform.
These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
FIG. 1 is a block/flow diagram showing a system/method for adjusting models and determining transforms to reduce channel mismatch in accordance with one illustrative embodiment;
FIG. 2 is a block/flow diagram showing a system/method for identifying a speaker or recognizing speech in accordance with another illustrative embodiment;
FIG. 3 is a block diagram showing a device which implements features in accordance with the present embodiments.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiments of the present disclosure provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device and transmission media mismatches. The criterion is naturally optimized and is preferably suited to a Log-Likelihood-Ratio (LLR) scoring approach commonly used for speaker recognition. The LLR algorithm combined with the transformation approach attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using a transform. The transform attempts to directly maximize posterior probabilities and is targeted to reduce mismatch between handsets, microphones, input equipment and/or transmission media accordingly.
Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Preferred embodiments provide a discriminative criterion applied to Gaussian Mixture Models (GMMs) to reduce input device mismatch. The criterion is naturally optimized and is suited to the Log-Likelihood-Ratio (LLR) scoring approach commonly used in GMMs for speaker recognition. The algorithm attempts to perform a direct mapping of features from one channel type to an assumed undistorted target channel but with the goal of maximizing speaker discrimination using the transform (preferably class specific transforms). The transform attempts to maximize the posterior probability of the speech observations across a set of speaker models.
One of the largest challenges in telephony based speaker recognition is effectively mitigating the degradation attributed to handset and channel mismatch. There are a number of techniques described above which address this issue. These approaches reduce mismatch through the modification of the features or adjustment of the models themselves to suit the new condition.
The present approach addresses the channel mismatch issue through the direct transformation of features using a discriminative criterion. The present disclosure performs a transformation dependent upon the channel type of a test recording and a desired target channel type that the features are to be mapped to. It may also be optimized in a manner that does not require explicit knowledge of the channel itself.
In contrast to previous work, a mapping optimization function is trained by maximizing a simplified version of a joint likelihood ratio scoring metric using held out data from many speakers. The goal of the mapping function is to obtain a transformation that maximizes the joint log-likelihood-ratio of observing the utterances from many speakers against their corresponding target speaker and background speaker models.
A discriminative design framework is formulated by optimizing joint model probabilities. The discriminative design framework includes a system useful in non-speech differences and distortion that may be present on a microphone or other input device between training and/or different use sessions. An illustrative example will be employed to demonstrate principles of the present embodiments.
A speaker recognition system is given a transformed utterance {right arrow over (Y)} for a speaker, and performs the following evaluation to determine if the test utterance belongs to the target speaker model, λ^s. If the speaker score Λ_sis above a specified threshold, the speaker claim is accepted, otherwise the claim is rejected. It is also the same criterion used for optimizing for the mismatch between audio sessions, giving a natural optimization result. $\begin{matrix} Λ^{s} = \Pr (λ^{s} | \overline{Y}) (1) \\ = \frac{P (λ^{s}) p (\overline{Y} | λ^{s})}{\sum_{\forall h} P (λ^{h}) p (\overline{Y} | λ^{h})} (2) \end{matrix}$
Here, P(λ^s) and P(λ^h) are the prior probabilities of an utterance being from speaker s and h correspondingly, where s and h are speaker indexes. The posterior probability of speaker model λ^s, given the speaker's utterance {right arrow over (Y)}, is indicated by Pr(λ^s|{right arrow over (Y)}). The likelihood of the observations, {right arrow over (Y)} given the model λ^his given by p({right arrow over (Y)}|λ^h).
Given that the model was trained using audio data from one channel (say, e.g., an electret type landline handset) while the test utterance was recorded under different channel conditions (say, e.g., a carbon button type landline handset), it may prove useful to transform the features of the test utterance to match the channel conditions of the model training component.
In one embodiment, a feature transformation function is employed to maximize equation (1) but across many speakers (S). Hence, a joint probability of the speakers (s=1 to S) given their corresponding observations is maximized. The calculation of a Jacobian matrix is not required as the optimization function is a ratio of densities. $\begin{matrix} Q_{1} = \prod_{s = 1}^{S} \Pr (λ^{s} | {\overline{Y}}^{s}) (3) \\ = \prod_{s = 1}^{S} \frac{P (λ^{s}) p (\overline{Y} | λ^{s})}{\sum_{\forall h} P (λ^{h}) p (\overline{Y} | λ^{h})} (4) \end{matrix}$
Here the denominator of equation (4) may be represented by a single Universal Background Model (UBM) or a model representative of all speaker classes. Note that in a similar manner the most competitive impostor model for each speaker utterance could be substituted in place for the UBM in the denominator of (4). One important point to consider is that depending on the functional form of the numerator and denominator pair of equation (4), the final optimization function may become too complex or may not deliver an optimization problem with a stationary point.
If it is assumed that the denominator of (4) will be represented as a collection of speaker models (e.g., class specific models) then the optimization function will become more complex. An alternative to using many speaker models in the denominator of (4) is to consider that these speaker models have parameters that follow a particular distribution, p(λ). In this case, a Bayesian predictive estimate (known in the art) may be given for the denominator of (4). With speaker class prior probabilities being equal, this gives the following result. $\begin{matrix} Q_{2} = \prod_{s = 1}^{S} \frac{p ({\overline{Y}}^{s} | λ^{s})}{\int p (λ) p ({\overline{Y}}^{s} | λ) ⅆ λ} & (5) \end{matrix}$
Let p({right arrow over (Y)}^s|λ^s) be represented by a Gaussian Mixture Model (GMM), comprised of N Gaussian components, with the set of weights, means and diagonal covariances given as {ω_i ^s,μ_i ^s,Σ_i ^s}_∀ _i. If {right arrow over (Y)} includes T^sindependent and identically distributed observations represented by {y₁ ^s,y₂ ^s, . . . ,y_T ^s} then the joint likelihood of the D-dimensional observations may be calculated. $\begin{matrix} p ({\overline{Y}}^{s} | λ^{s}) = \prod_{t = 1}^{T^{s}} \sum_{i = 1}^{N} ω_{i}^{s} g (y_{t}^{s} | μ_{i}^{s}, \sum_{i}^{s}) & (6) \\ where g (y_{t}^{s} μ_{i}^{s}, \sum_{i}^{s}) = \frac{1}{\sqrt{{(2 π)}^{d} \langle \sum_{I}^{s} \rangle}} \times \exp {{(y_{t}^{s} - μ_{i}^{s})}^{'} {(\sum_{i}^{s})}^{- 1} (y_{t}^{s} - μ_{i}^{s})} & (7) \end{matrix}$
The notation (′) represents the transpose operator. Now the problem of specifying the distribution of the speaker model parameters is addressed. Let all speaker models be the MAP adaptation representation (known in the art) of a Universal Background Model which is trained on a large quantity of speech. For one embodiment only the mixture component means (μ) are adapted (indicating a minimal degradation attributed to such constraints).
The speaker model component mean parameters are assumed to be independent and are governed by a Gaussian distribution with {m_i,C_i}. Thus, in a similar vain, the representation for p(λ^s) is established. For example: $\begin{matrix} p (λ^{s}) = \prod_{i = 1}^{N} g (μ_{i}^{s} | m_{i}, C_{i}) & (8) \end{matrix}$
The denominator may now be evaluated. Let the joint likelihood of the observations be approximated by considering only the most significant Gaussian component contribution for each frame. This approximation is most appropriate for sparse mixture components. $\begin{matrix} p (\overline{Y} | λ^{s}) \approx \prod_{i = 1}^{T^{s}} \max_{i = 1}^{N} {ω_{i}^{2} g (y_{t}^{s} | μ_{i}^{s}, \sum_{i}^{s})} & (9) \end{matrix}$
Given this assumption, the predictive likelihood may be calculated. The result is given by equation 10. A Viterbi approach for estimating the Bayesian predictive density may be referenced. $\begin{matrix} \int_{λ} p (λ) p (\overline{Y} | λ) ⅆ λ \approx \prod_{t = 1}^{N} {(\frac{ω_{i}}{{(2 π)}^{D / 2}})}^{n_{i}^{s}} \prod_{d = 1}^{D} \sqrt{\frac{Φ_{i_{dd}}^{s}}{{(Σ_{i_{dd}})}^{n_{i}^{s}}}} \times \exp {- \frac{n_{i}^{s}}{2 Σ_{i_{dd}}} (Φ_{i_{dd}}^{s} \overline{{(y_{i_{d}}^{s} - m_{i_{d}})}^{2}} + (1 - Φ_{i_{dd}}^{s}) (\overline{y_{i_{d}}^{s^{2}}} - {\overline{y_{i_{d}}^{s}}}^{2}))} & (10) \\ where n_{i}^{s} = \sum_{\forall t 3 y_{i}^{'} ε i} l & (11) \\ \overline{y_{i_{d}}^{s}} = \frac{1}{n_{i}^{s}} \sum_{\forall t 3 y_{i}^{s} ε i} y_{i_{d}}^{s} & (12) \\ \overline{y_{i_{d}}^{s^{2}}} = \frac{1}{n_{i}^{s}} \sum_{\forall t 3 y_{i}^{s} ε i} y_{i_{d}}^{s^{2}} & (13) \\ Φ_{i_{dd}}^{s} = \frac{1}{n_{i}^{s} C_{i_{dd}} \sum_{i_{ii}}^{- 1} + 1} & (14) \\ and \overline{{(y_{i_{d}}^{s} - m_{i_{d}})}^{2}} = \overline{y_{i_{d}}^{s^{2}}} - 2 m_{i_{d}} \overline{y_{i_{d}}^{s}} + m_{i_{d}} m_{i_{d}} & (15) \end{matrix}$
Here, {y_t _d ^s,m_i _d} represents the d^thelement within their respective vectors {y_i ^s,m_i}. Correspondingly, {Σ_i _dd,C_i _dd,Φ_i _dd ^s} represent the element in the d^throw and d^thcolumn of the appropriate diagonal covariance matrices {Σ_i,C_i,Φ_i ^s}.
Now in the case where there is a single speaker being scored, as in the numerator condition, the model distribution becomes a point observation. This is achieved by setting C_i _ddto 0 and gives the following result which is equivalent (depending on the optimal mixture component selection criterion) to the standard GMM likelihood scoring when only the top Gaussian is scored. $\begin{matrix} p (\overline{Y} | λ^{s}) = \prod_{t = 1}^{N} {(\frac{ω_{i}}{{(2 π)}^{D / 2}})}^{n_{i}^{'}} \prod_{d = 1}^{D} \frac{1}{\sqrt{{(Σ_{i_{dd}})}^{n_{i}^{'}}}} \times \exp {- \frac{n_{i}^{s}}{2 Σ_{i_{dd}}} (\overline{{(y_{i_{dd}}^{s} - μ_{i_{d}})}^{2}})} & (16) \end{matrix}$
Given this derivation, let us calculate the log of the ratio of the target speaker joint likelihood and the likelihood of all other speakers for a set of S utterances and corresponding models. For example: $\begin{matrix} \log Q_{2} = \sum_{s = 1}^{S} (\log p ({\overline{Y}}^{s} | λ^{s}) - \log \int_{λ} p (λ) p ({\overline{Y}}^{s} | λ) ⅆ λ) & (17) \end{matrix}$
The maximization problem may be simplified further if it is considered that the derivative of this function with respect to transformation variables is calculated. It is assumed that the Gaussian mixture models are calculated through Bayesian adaptation of the mixture component means from a Universal Background GMM. All model parameters are coupled to the Universal Background Model; which includes the S target speaker models and the denominator model representation. The most significant mixture components are determined by using Equation (9) and extracting the Gaussian indexes by scoring on the Universal Background GMM. These indexes are used to score the corresponding Gaussian components in all other models. With these constraints, the function to maximize is the following: $\begin{matrix} Q = \sum_{s = 1}^{S} \sum_{i = 1}^{N} \sum_{d = 1}^{D} {\frac{n_{i}^{s}}{2 Σ_{i_{dd}}} \times (2 (μ_{i_{dd}}^{s} {–Φ}_{i_{dd}}^{s} m_{i_{d}}) \overline{y_{i_{d}}^{s}} - (1 - Φ_{i_{dd}}^{s}) {\overline{y_{i_{d}}^{s}}}^{2})} & (18) \end{matrix}$
For simplicity, the algorithm was represented such that for a single speaker model created from a single enrollment utterance, there was a single test utterance to score the utterance. Further richness can be achieved in the optimization process if multiple models and/or test utterances are trained for each speaker. Depending on the viewpoint, one benefit of the optimization function is that the unique one-to-one mapping needed when a Jacobian matrix is factored in is not required here. This also permits for the situation where two modes present under one channel condition may manifest themselves as a single mode under another channel. Given this flexibility, an appropriate transform for {right arrow over (Y)}^sis selected.
Transform Selection
A final transform may be represented as a combination of affine transforms according to posterior probability. For example: $\begin{matrix} y = Ψ (x) = \sum_{j = 1}^{J} \Pr (j | x) Ψ_{j} (x) & (19) \\ with \Pr (j | x) = \frac{\overset{⋓}{ω_{j}} g (x | \overset{⋓}{u_{j}}, \overset{⋓}{\sum_{j}})}{\sum_{z = 1}^{J} \overset{⋓}{ω_{z}} g (x | \overset{⋓}{u_{z}}, \overset{⋓}{\sum_{z}})} & (20) \end{matrix}$
where {{hacek over (ω)}_j,{hacek over (μ)}_j,{hacek over (Σ)}_j} is the set of mixture component weights, means and covariances, respectively for a J component Gaussian Mixture Model. The purpose of this GMM is to provide a smooth weighting function of Gaussian kernels to weight the corresponding combination of affine transforms.
Note also that throughout the optimization problem the posterior probabilities need only to be calculated once. This GMM, used to determine the mixture component probabilities, could be the same as the Universal Background Speaker model for adapting speakers or a separate model altogether.
Here Ψ(•) is selected to be of a form with a controllable complexity similar to SPAM models, which are known in the art.
Ψ_j(x)=A _j x+b _j (21)
with the set of A_jand b_jbeing controllable in complexity as follows: $\begin{matrix} A_{j} = θ_{j}^{A} R_{j} + \sum_{k = 1}^{K} θ_{j}^{k} V_{k} & (22) \\ b_{j} = θ_{j}^{b} r_{j} + \sum_{k = 1}^{K} θ_{j}^{k} v_{k} & (23) \end{matrix}$
where R_jis a mixture component specific transform matrix, and θ_j ^kis the weighting factor applied to the k^thtransform matrix V_k. In summary, the resulting transform matrix, A_j, for mixture component j is a linear combination of a small set of transforms. The matrix R_jis typically a zero matrix, or a constrained matrix to enable some simplified transforms that would not typically be available using the remaining transformation matrices. It may also be a preset mixture-component-specific matrix that is known to be a reasonable solution to the problem.
Conversely, the offset vector for mixture component j may be determined in a similar manner with r_ibeing the mixture component specific offset, and v_kbeing the k^thoffset vector. The vector r_iis typically a zero vector or a pre-selected, mixture component specific, vector constant. In the case when the vector is a preset constant, the remainder of the equation is designed to maximize the target function by optimizing for the residual.
An alternative weighting function is also proposed as an alternative that considers only the top scoring mixture component. $\begin{matrix} Ψ_{j} (x) = Ψ_{j \max} (x) where j \max = {\arg \max}_{j = 1}^{J} \Pr (j | x) & (24) \end{matrix}$
Optimization
Equation (18) may be maximized with the transformation function used from Equation 19 using a number of techniques. Between iterations, if no transformed observations change which significant Gaussian class they belong to in the original acoustic UBM, the problem is a matrix-quadratic optimization problem. In one embodiment, due to transformed vectors changing their Gaussian class between iterations, a gradient ascent approach is taken. Consequently, the functional derivative needs to be determined.
For the derivative calculation, it is assumed that no or very few mapped feature vector observations lie on or near the decision boundary between two Gaussians (in which case the slope estimate is only an approximation). Depending on the configuration of the system this assumption may be significant and would then need an additional derivative approximation for the mixture component counts. If A and b are to be optimized directly, the partial derivative approximation with respect to one of the optimization variables, Ω, is presented. $\begin{matrix} \frac{\partial Q}{\partial Ω} = \sum_{s = 1}^{S} \sum_{i = 1}^{N} \sum_{d = 1}^{D} {\begin{matrix} \frac{n_{i}^{s}}{\sum_{i_{dd}}} (\frac{\partial \overline{y_{i_{d}}^{s}}}{\partial Ω}) \times \\ (\begin{matrix} (μ_{i_{d}}^{s} - Φ_{i_{dd}}^{s} m_{i_{d}}) - \\ (1 - Φ_{i_{dd}}^{s}) \overline{y_{i_{d}}^{s}} \end{matrix}) \end{matrix}} & (25) \end{matrix}$
The variable $\frac{\partial \overline{y_{i_{d}}^{s}}}{\partial Ω}$
may be substituted by any one of the following partial derivative results in the equations following. $\begin{matrix} γ_{j_{t}}^{s} = \Pr (j | x_{t}^{s}) & (26) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial b_{j_{d}}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} γ_{j_{t}}^{s} & (27) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial A_{j_{d e}}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} γ_{j_{t}}^{s} x_{t_{d}}^{s} & (28) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial A_{j_{f e}}} = 0 if d \neq f & (29) \end{matrix}$
This results in a series of equations to solve. Note that the assumption here is that the transformation variations between iterations are small and that the number of observations changing from iteration to iteration is negligible.
Correspondingly, if a SPAM model equivalent is substituted to reduce the number of parameters to optimize, the slope functions become the following.
The mixture component specific transformation weightings are as follows. These weighting factors are established for the reason efficiently managing the search space for the transformation. $\begin{matrix} \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial θ_{j}^{A}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} γ_{j_{t}}^{s} \sum_{q = 1}^{D} R_{j_{dq}} x_{t_{q}}^{s} & (30) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial θ_{j}^{b}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} γ_{j_{t}}^{s} r_{j_{d}} & (31) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial θ_{j}^{k}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} γ_{j_{t}}^{s} (v_{k_{d}} + \sum_{q = 1}^{D} V_{k_{dq}} x_{t_{q}}^{s}) & (32) \end{matrix}$
The derivative may be calculated for the subspace matrices and vectors. $\begin{matrix} \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial v_{k_{d}}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} \sum_{j = 1}^{J} γ_{j_{t}}^{s} θ_{j}^{k} & (33) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial V_{k_{d e}}} = \frac{1}{n_{i}^{s}} \sum_{\forall t ∋ y_{t}^{s} \in i} (\sum_{j = 1}^{J} γ_{j_{t}}^{s} θ_{j}^{k}) x_{t_{d}}^{s} & (34) \\ \frac{\partial \overline{y_{i_{d}}^{s}}}{\partial V_{k_{f e}}} = 0 if d \neq f & (35) \end{matrix}$
Let Ω be the vector of variables that is to be optimized in terms of maximizing Q. The gradient ascent algorithm can now used accordingly, given the parametric estimates of the slopes. $\begin{matrix} Ω_{new} = Ω_{old} + {η (\frac{\partial \log Q}{\partial Ω} \rangle}_{Ω_{old}}) & (36) \end{matrix}$
where η is the learning rate.
Generalized Mapping: In the embodiments described above, the methods were employed to determine an optimal mapping between a first channel state and a second channel state. In another form, the mapping that is calculated may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required. In addition, the mapping system can then map arbitrarily from any channel state to any other channel state. In this sense, the optimization functions of equations (3), (4) or (5) may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform). The applied mapping is formed from several transforms that are selected (or weighted) according to their relevance. Once the mapping block has learned the set of transforms, the speaker recognition system may be evaluated. The multi-transform mapping block is then used to map all utterances. The benefit is that no explicit handset or channel labels are required.
Alternative Optimization Function
There are many different transformation optimization functions that can be determined and derived using a similar process. For the purpose of illustration, another example derivation follows. An estimate of equation (3) was presented in equation (5) and its optimization procedure was derived. In a similar manner, equation (4), although more computationally expensive, may also be optimized using a set of speaker or audio class models. In the log domain, equation (4) may be represented as an alternative (logQ_A2) to equation (17) as follows: $\begin{matrix} \log Q_{A 2} = \sum_{s = 1}^{S} (\log p ({\overset{⇀}{Y}}^{s} | λ^{s}) - \log \sum_{\forall h} p ({\overset{⇀}{Y}}^{s} | λ^{h})) & (37) \end{matrix}$
Under similar constraints to the previously suggested optimization function, the core function to optimize is given as: $\begin{matrix} Q_{A} = \sum_{s = 1}^{S} {\begin{matrix} (\sum_{i = 1}^{N} \sum_{d = 1}^{D} \frac{n_{i}^{s}}{2 \sum_{i_{dd}}} [2 \overline{y_{i_{d}}^{s}} μ_{i_{d}}^{s} - {(μ_{i_{d}}^{s})}^{2}]) - \\ \log (\sum_{\forall h} \exp {\sum_{i = 1}^{N} \sum_{d = 1}^{D} \frac{n_{i}^{s}}{2 \sum_{i_{dd}}} [2 \overline{y_{i_{d}}^{s}} μ_{i_{d}}^{h} - {(μ_{i_{d}}^{h})}^{2}]}) \end{matrix}} & (38) \end{matrix}$
To optimize this function, the same steepest ascent procedure is preferably selected. To perform the optimization an approximation to the slope is needed. In this slope approximation, it is assumed that n_i ^sremains relatively constant. In many instances this assumption may not be appropriate. Accordingly, an approximation to the derivative is presented. $\begin{matrix} \frac{\partial Q}{\partial Ω} \approx \sum_{s = 1}^{S} \sum_{i = 1}^{N} \sum_{d = 1}^{D} {\frac{n_{i}^{s}}{\sum_{i_{dd}}} (\frac{\partial \overline{y_{i_{d}}^{s}}}{\partial Ω}) \times (μ_{i_{d}}^{s} - \sum_{\forall h} \Pr (λ^{h} | {\vec{Y}}^{s}) μ_{i_{d}}^{h})} & (39) \end{matrix}$
Here Pr(λ^h|{hacek over (Y)}^s) is the probability of the speaker model λ^hgiven the transformed utterance, {right arrow over (Y)}^s. The term, $\frac{\partial \overline{y_{i_{d}}^{s}}}{\partial Ω},$
may be substituted by any of the corresponding equations from equation (27) to equation (35). The steepest ascent algorithm may be performed, as before, to determine the transform or transforms to apply.
Efficient Algorithmic Implementation
Due to the nature of the optimization process, a number of techniques can be introduced to speed up the procedure. As already identified and derived above, the single top mixture component for each audio frame is scored. This may also be extended to the feature transformation mapping algorithm such that only the mapping corresponding to the top scoring feature partitioning GMM is applied rather than summing over the contributions of the mappings corresponding to all mixture components. In addition to using only the top mixture component throughout the system, the target and background model representations can be forced to be a function of the Universal Background Model (UBM); a coupled model system. Consequently, the UBM can be used to determine which mixture components are the largest contributors to the frame based likelihood. Thus, once the Gaussian indexes are obtained from the UBM, they may be applied to the coupled target adapted model and the background model representations. It is noted that the top 5 mixture components were considered for the transformation function GMM. This approximation introduced significant speedups. Each of these previously mentioned items were included in the current system but it should be noted that additional speed optimizations are available.
One technique is to test if a particular Gaussian is the most significant Gaussian for the current feature vector. Given that this algorithm is iterative and applies small adjustments to the mapping parameters, a Gaussian component that was dominant on a previous iteration may also be relevant for the current iteration. If the probability density of the vector for the Gaussian is larger than the predetermined threshold for the appropriate Gaussian component, then it is the most significant Gaussian for the GMM. This technique operates more effectively for sparse Gaussian mixture components.
Another method is to construct, for each Gaussian component, a table of close Gaussians or Gaussian competitors. Given that the mapping parameters are adjusted in an incremental manner, the Gaussian lookup table for the most significant Gaussian of the previous iteration may be evaluated to rapidly locate the most significant Gaussian for the next iteration. The table length may be configured to trade off the search speed against the accuracy of locating the most likely Gaussian component.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, a system/method 100 is illustratively shown which provides a speaker recognition and identification system in accordance with one embodiment. In block 102, speaker models trained using a first condition state or input type are provided. For example, speaker models trained using a landline telephone or a microphone to collect speaker utterances are stored in a database or provided as a model. These models may be created from audio from a single channel type. In block 104, feature sets from a set of speaker utterances in a second condition state are generated as input for a discriminative training criterion.
In block 110, the discriminative criterion (from blocks 102 and 104) is maximized over a plurality of speakers by applying, e.g., a steepest ascent algorithm (or similar optimization) to determine a best transform function or set of transform functions. This includes maximizing Q₁in equation (3) (or Q_Ain equation (39).
In block 110, a discriminative criterion objective function is specified using the existing speaker models and the non-transformed utterances. This discriminative criterion is applied to generate the transformed utterance to obtain a computed result, which may be determined either arbitrarily or empirically. An optimization metric of a speaker based on discrimination between speaker classes is preferably performed. The discriminative criterion may include equation (3) (or equation (39).
The result giving the objective function maximum gives the transform or transforms. This transform may then be used to convert/map or neutralize the inputs received over a different input type. The transforms may be adjusted to provide accommodation for the currently used input type.
The best transform may be used for recognizing speech and/or identifying a speaker under the condition state of the received utterance to reduce channel mismatch. The system may undergo many input conditions and a best transform may be applied for each input condition. In one embodiment posterior probabilities are maximized in the maximizing step.
The present embodiments use speaker classes to determine the transform. The result is that the most likely speaker is determined instead of the most likely acoustic match. In addition, the transform is calculated once to maximize Q such that the maximum Q gives the transform. The speaker space is broken down based on subsets or classes of speakers. The maximum likelihood (or a related metric) of seeing a particular speaker is used to determine the transform (as opposed to simply matching the acoustic input).
In block 112, at least one speaker model may be transformed using the best transform to create a new model for decoding speech or identifying a speaker. The speaker model may be transformed from a first input type to a second input type by directly mapping features from the first input type to a second input type using the transform.
The mapping done by the transform may include learning, and optimizing for, a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not required. In addition, the mapping system can then map arbitrarily from any channel state to any other channel state. Once the mapping block has learned the set of transforms, the speaker recognition system may be evaluated. The multi-transform mapping block is then used to map all utterances. The benefit is that no explicit handset or channel labels are required.
Referring to FIG. 2, a system/method 200 for speaker recognition and identification in accordance with an illustrative embodiment is shown. A similar method may be employed for other audio analysis as well. A plurality of transforms is provided for decoding utterances, wherein the transforms correspond to a plurality of input types or conditions. This may include a single transform or a plurality of transforms to handle multiple conditions. In block 210, a transform(s) is applied to features from a speaker based upon the input type or all input types. Block 206 indicates that the transforms are precomputed by the method shown in FIG. 1. The precomputation of the transform or transforms may be performed at the time of manufacture of the system or may be recomputed intermittently to account for new input types or other system changes.
In block 208, the best transform or series of transforms are determined for each or all input types and applied by determining conditions under which a speaker is providing input. The input types may include, e.g., telephone handset types, channel types and/or microphone types. The best transform may include transforming the input to a neutralized channel condition which counters effects of the input state or any other transform that reduces mismatch between input types.
In block 212, the speaker is identified or the utterance is decoded in accordance with the input type correction provided herein. Advantageously, channel mismatch is reduced or eliminated.
In the same way that the discriminative technique was designed to perform a channel mapping that was optimized to differentiate between speakers, the same procedure as described above may be applied to other audio applications related to audio scene analysis, music and song detection and audio enhancement of corrupted audio channels. One such example would be to recognize artists or songs being played over the radio or media. This greatly reduces the effect of the channel differences on the system attempting to detect the song type or artist. The technique effectively removes the differences between different audio channels and simplifies the matching process for the audio classifier. Other applications are also contemplated.
Experiments:
Evaluation and development data: To demonstrate the present invention, a speaker recognition system was evaluated on the NIST 2000 dataset. This particular dataset is comprised mostly of landline telephone calls from carbon-button or electret based telephone handsets. Given that there are two classes of audio data, the feature transformation mechanism was designed to map from carbon-button features to electret based features. In this database, for the primary condition, there are 4991 audio segments (including 2470 male test and 2521 female test audio segments) tested against over 1003 speakers, of which 457 speakers are male and 546 speakers are female.
The audio training utterance durations were approximately two minutes with test utterance durations of 15-45 seconds. The NIST 1999 speaker recognition database was included as development data to train the UBM and the corresponding carbon handset to electret handset transformation function. This database was selected because of the significant quantity of carbon and electret handset data available. The same principle may be applied to the more recent speaker recognition evaluations by providing several channel mapping functions dependent upon the channel type.
System Description Used in Experiments
A speaker recognition system in accordance with embodiments of the present invention includes two main components, a feature extraction module and a speaker modeling module, as is known in the art.
For the feature extraction module in accordance with one embodiment, Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from filter banks. In an illustrative embodiment, 19 Mel-Frequency Cepstral Coefficients (MFCCs) were extracted from 24 filter banks. The cepstral features may be extracted, e.g., by using 32 ms frames at a 10 ms frame shift. The corresponding delta features were calculated.
Feature Warping may be applied to all features to mitigate the effects of linear channels and slowly varying additive noise. The speaker modeling module generated speaker modeling through MAP adaptation of a Universal Background Model. This implementation of the MAP adaptation approach adjusted the Gaussian components toward the target speaker speech features. The mixture component mean parameters were also adapted. In this work, a single iteration of the EM-MAP algorithm was performed. In testing, only the top mixture component from the UBM was scored and used to reference the corresponding components in other models. Other mixtures and numbers of mixture components may also be employed.
Results
A version of the system was evaluation for a challenging subset of the NIST 2000 Evaluation. The subset of trials was selected purposely to identify the effect of the channel mapping from the carbon test utterance type to the electret model type. Thus, only carbon tests against electret models were evaluated in this subset. The results indicated a reduction in minimum detection cost function (DCF) from 0.057 to 0.054 and a decrease in equal error rate (EER) from 14.9% to 13.0%. The improvements were realized using a single transformation determined from 100 unique speakers. Additional error reductions are expected by using more speaker data to calculate the transform. The results described herein are for illustrative purposes only.
Referring to FIG. 3, a system 300 for providing class specific transformations based on input type or conditions is illustratively depicted. An audio classification system or device 300 may include a personal computer, a telephone system, an answering system, a security system or any other device or system where multiple users or multiple users and/or multiple input types or devices may be present. Device 300 is capable of supporting a software application or module 302 which provides audio classification, which can map the input types as described above to enable identification of a speaker, the decoding of utterances or other audio classification processes. Application 302 may include a speech recognition system, speech to speech system, text to speech system or other audio classification processing module 304 capable of audio processing (e.g., for audio scene analysis, etc.).
In one embodiment, input utterances may be received from a plurality of different input types and/or channels (telephones, microphones, etc.). Inputs 301 may include microphones of different types, telephones of different types, prerecorded audio sent via different channels or methods or any other input device. A module 306 may include a speech synthesizer, a printer, recording media, a computer or other data port or any other suitable device that uses the output of application 302.
Application 302 stores precomputed transforms 310 which are best adapted to account for channel mismatch. In one embodiment, the transforms 310 include a series of possible transformations (rather than just one) such that explicit knowledge of the channel is not needed. The system can then map arbitrarily from any channel state to any other channel state. The optimization functions of may include a feature mapping function that is internally comprised of multiple transforms (as opposed to a single transform) to provide this functionality at training. The applied mapping may be formed from several transforms that are selected (or weighted) according to their relevance. Once the set of transforms are learned, the speaker recognition system may be evaluated to ensure proper operation on any available channel types for that application. The multi-transform mapping block is then used to map all utterances.
Module 306 may include a security device that permits a user access to information, such as a database, an account, applications, etc. based on an authorization or confirmed identity as determined by application 302. Speech recognition system 304 may also recognize or decode the speech of a user despite the input type 301 that the user employs to communicate with device 300.
Having described preferred embodiments of a system and method for addressing channel mismatch through class specific (e.g., speaker discrimination) transforms (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Claims

1. A method for audio classification, comprising:

transforming features of a speaker utterance in a first condition state to match a second condition state and as a result provide a channel matched transformed utterance; and

maximizing a discriminative criterion over a plurality of speakers to obtain a best transform for audio class modeling under the second condition state.

2. The method as recited in claim 1, further comprising employing a speaker model trained using a first channel condition provided by a first hardware type.

3. The method as recited in claim 2, wherein the second condition state includes a second channel condition provided by a second hardware type.

4. The method as recited in claim 2, wherein the second condition state includes a neutralized channel condition which counters effects of the first condition state.

5. The method as recited in claim 1, wherein the system undergoes many input conditions and further comprises applying a best transform for each input condition.

6. The method as recited in claim 1, wherein maximizing a discriminative criterion includes determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker.

7. The method as recited in claim 1, wherein the discriminative criterion includes:

\begin{matrix} Q_{1} = \prod_{s = 1}^{S} \Pr (λ^{s} | {\vec{Y}}^{s}) & (3) \end{matrix}

where Q₁is a function to be optimized, and Pr(λ^s|{hacek over (Y)}^s) is a posterior probability of speaker model λ^s, given the speaker's channel matched transformed utterance, {right arrow over (Y)}.

8. The method as recited in claim 1, further comprising decoding speech based on a selected transform.

9. The method as recited in claim 1, further comprising transforming at least one speaker model from a first input type corresponding to the first condition state to a second input type corresponding to the second condition state by directly mapping features from the first input type to a second input type using a transform.

10. The method as recited in claim 1, wherein maximizing includes maximizing posterior probabilities.

11. A computer program product for audio classification comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:

12. A method for audio classification, comprising:

providing a plurality of transforms for decoding utterances, wherein the transforms correspond to a plurality of input types; and

applying one of the transforms to a speaker based upon the input type;

wherein the transforms are precomputed by:

13. The method as recited in claim 12, wherein the best transform is determined for each input type and applied by determining conditions under which a speaker is providing input.

14. The method as recited in claim 12, wherein the input types include one or more of telephone handsets, channel types and microphones.

15. The method as recited in claim 12, wherein the different condition state includes a neutralized channel condition which counters effects of the first condition state.

16. The method as recited in claim 12, wherein maximizing a discriminative criterion includes determining a likelihood of a speaker based on discrimination between speaker classes to identify the speaker.

17. The method as recited in claim 12, wherein the discriminative criterion includes:

\begin{matrix} Q_{1} = \prod_{s = 1}^{S} \Pr (λ^{s} | {\vec{Y}}^{s}) & (3) \end{matrix}

where Q₁is a function to be optimized, and Pr(λ^s|{right arrow over (Y)}^s) is the posterior probability of speaker model λ^s, given the speaker's channel matched transformed utterance, {right arrow over (Y)}.

18. The method as recited in claim 17, further comprising decoding speech based on a selected transform.

19. The method as recited in claim 12, wherein the transform reduces mismatch between input types.

20. A computer program product for audio classification comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:

applying one of the transforms to a speaker based upon the input type;

wherein the transforms are precomputed by: