WO1999014739A1

WO1999014739A1 - Method for suppressing noise in a digital speech signal

Info

Publication number: WO1999014739A1
Application number: PCT/FR1998/001981
Authority: WO
Inventors: Philip Lockwood; Stéphane LUBIARZ
Original assignee: Matra Nortel Communications
Priority date: 1997-09-18
Filing date: 1998-09-16
Publication date: 1999-03-25
Also published as: EP1016073B1; DE69804329D1; DE69804329T2; CA2304015A1; FR2768546A1; JP2001516902A; AU9169098A; CN1276896A; BR9812655A; EP1016073A1; ES2174484T3; FR2768546B1

Abstract

The invention concerns a method for suppressing noise in a digital speech signal processed by successive frames which consists in: computing the signal spectral components (Sn,f, Sn,i) on each frame; computing the maximised estimations (B'n,i) of spectral components of the noise included in the speech signal; carrying out a harmonic analysis of the signal to estimate a pitch; carrying out a spectral subtraction comprising at least a step consisting in subtracting respectively, from each spectral component of the speech signal on the frame (Sn,f), a quantity depending on parameters including the maximised estimation of the noise corresponding spectral component and the estimated pitch; and applying to the subtraction result a transform towards the time domain to construct an enhanced speech signal (s3).

Description

METHOD FOR NOISE REDUCTION OF A DIGITAL SPOKEN SIGNAL

The present invention relates to digital techniques for denoising speech signals. It relates more particularly to denoising by nonlinear spectral subtraction.

Due to the generalization of new forms of communication, in particular mobile telephones, communications are increasingly carried out in highly noisy environments. Noise, added to speech, then tends to disrupt communications by preventing optimal compression of the speech signal and creating unnatural background noise. On the other hand, noise makes it difficult and tiring to understand the spoken message. Many algorithms have been studied to try to reduce the effects of noise in a communication. S. F. Boll ("Suppression of acoustic noise m speech using spectral subtraction", IEEE Trans. On Acoustics, Speech and Signal Processing ", Vol. ASSP-27, n ° 2, April 1979) proposed an algorithm based on spectral subtraction. This technique consists in estimating the spectrum of the noise during the phases of silence and in subtracting it from the received signal. It allows a reduction in the noise level received. Its main fault is to create a particularly annoying musical noise, because it is not natural.

These works, taken up and improved by D. B. Paul

("The spectral envelope estimation vocoder", IEEE

Trans. on Acoustics, Speech and Signal Processing ”, Vol.

ASSP-29, n ° 4, August 1981) and by P. Lockwood and J. Boudy ("Expeπments with a nonlinear spectral subtractor (NSS), Hidden Markov Models and the projection, for robust speech récognition m cars", Speech Communication, Vol. 11, June 1992, pages 215-228, and EP-A-0 534 837) have made it possible to significantly reduce the noise level while retaining a natural character. In addition, this contribution had the merit of incorporating for the first time the masking principle in the calculation of the denoising filter. From this idea, a first attempt was made by S. Nandkumar and JHL Hansen (“Speech enhancement on a new set of auditory constramed parameters”, Proc. ICASSP 94, pages 1.1-1.4) to use in the spectral subtraction masking curves calculated explicitly. Despite the disappointing results of this technique, this contribution had the stress of emphasizing the importance of not distorting the speech signal during denoising.

Other methods based on the decomposition of the speech signal into singular values, and therefore on a projection of the speech signal in a more reduced space, have been studied by Bart De Moore (“The smgular value decomposition and long and short spaces of noisy matrices ", IEEE Trans. on Signal Processing, Vol. 41, n ° 9, September 1993, pages 2826-2838) and by SH Jensen et al (" Reduction of broad-band noise m speech by truncated QSVD ", IEEE Trans on Speech and Audio Processing, Vol. 3, No. 6, November 1995). The principle of this technique is to consider the speech signal and the noise signal as completely decorrelated, and to consider that the speech signal has sufficient predictability to be predicted from a restricted set of parameters. This technique makes it possible to obtain an acceptable denoising for strongly voiced signals, but totally distorts the speech signal. Faced with a relatively coherent noise, such as that caused by the contact of car tires or the rattling of an engine, the noise can prove to be more easily predictable than the unvoiced speech signal. There is then a tendency to project the speech signal into a part of the vector space of the noise. The method ignores the speech signal, especially the unvoiced speech areas where the predictability is reduced. In addition, predicting the speech signal from a reduced set of parameters does not take into account all the intrinsic richness of the speech. We understand here the limits of techniques based solely on mathematical considerations while forgetting the particular character of speech. Finally, other techniques are based on consistency criteria. The coherence function is particularly well developed by JA Cadzow and 0. M. Solomon ("Lmear modelmg and the coherence function", IEEE Trans. On Acoustics, Speech and Signal Processing, Vol. ASSP-35, n ° 1, January 1987 , pages 19-28), and its application to denoising has been studied by R. Le Bouquin (<. <Enhancement of noisy speech signais: application to moo radio communications ", Speech Communication, Vol. 18, pages 3-19) . This method is based on the fact that the speech signal has a significantly greater coherence than noise provided that several independent channels are used. The results seem to be quite encouraging. But unfortunately, this technique requires having multiple sources of sound, which is not always achieved.

US Patent 5,228,088 describes a denoising system operating in the frequency domain, provided with a tone frequency detector. The result of this detection is used on the one hand to adjust the noise suppression coefficients, and on the other hand to locate a "voice band". The noise suppression coefficients are used by a spectral subtraction module to weight the noise estimate before subtracting it from the signal. The module which adjusts the suppression coefficients only uses the information according to which a tone frequency has been detected or not. However, the value taken by the tone frequency has no influence on the suppression coefficients used. The “voice band” determined using the detected tone frequency is subject to an overall signal enhancement. It can, as a variant, be used to determine a contrario a "noise band", band in which an overall attenuation is then applied. Such an enhancement or attenuation of a portion of the spectrum and of the signal is a very different denoising method from spectral subtraction. A main aim of the present invention is to propose a new denoising technique which takes into account the characteristics of speech production, thus allowing effective denoising without deteriorating the perception of speech.

The invention thus proposes a method for denoising a digital speech signal processed by successive frames, in which:

- spectral components of the speech signal are calculated on each frame;

- Estimates are increased for each frame plus spectral components of the noise included in the speech signal; a spectral subtraction is carried out comprising at least one step consisting in respectively subtracting, from each spectral component of the speech signal on the frame, a quantity depending on parameters including the increased estimate of the corresponding spectral component of the noise for said frame; and - a transformation to the time domain is applied to the result of the spectral subtraction to construct a denoised speech signal.

A harmonic analysis of the speech signal is performed to estimate a tonal frequency of the speech signal on each frame where it exhibits speech activity. The parameters on which the quantities subtracted depend include the tone frequency thus estimated.

Overestimating the noise spectral envelope is generally desirable so that the increased estimate thus obtained is robust to sudden variations in noise. However, this overestimation usually has the disadvantage of distorting the speech signal when it becomes too large. This has the effect of affecting the voiced character of the speech signal by suppressing part of its predictability. This drawback is very annoying in the conditions of telephony, because it is during the voicing areas that the speech signal is then most energetic. The fact taking into account in the denoising the tonal frequency of the speech signal makes it possible to protect the harmonicity of this signal in these neighboring zones.

In general, to subtract from a given spectral component of the speech signal, we will adopt a lower quantity if said spectral component corresponds to a protected frequency, that is to say the one closest to an integer multiple of the frequency. estimated tone only if said spectral component does not correspond to such a protected frequency. This lower quantity can in particular be zero. In the latter case, spectral subtraction does not affect the signal at the estimated tone frequency and / or its harmonics. This then removes part of the non-lmeaπtés brought by the noise overestimation, which are particularly sensitive for the neighboring areas. Unvoiced areas, due to the more random nature of their excitation signal, are less sensitive to it.

In an advantageous embodiment, after having estimated the tonal frequency of the speech signal over a frame, the speech signal of the frame is conditioned by oversampling it at a oversampling frequency multiple of the estimated tonal frequency, and the components are calculated. spectral of the speech signal on the frame based on the conditioned signal to subtract said quantities from them. This arrangement makes it possible to favor the frequencies closest to the estimated tone frequency over the other frequencies. This avoids protecting harmonics relatively far from those of the tone frequency. The harmonic nature of the speech signal is therefore best preserved. To calculate the spectral components of the speech signal, the conditioned signal is distributed in blocks of N samples subjected to a transformation in the frequency domain, and the ratio between the oversampling frequency and the estimated tone frequency is chosen as a divisor of the number N . The previous technique can be further refined by estimating the tonal frequency of the speech signal on a frame as follows:

- Estimated time intervals between two consecutive breaks of the signal attπbuables closings of the glottis of the speaker intervening during the duration of the frame, the estimated tone frequency being inversely proportional to said time intervals;

- The speech signal is interpolated in said time intervals, so that the conditioned signal resulting from this interpolation has a constant time interval between two consecutive breaks.

This procedure artificially constructs a signal frame on which the speech signal breaks at constant intervals. We thus take into account possible variations in the tonal frequency over the duration of a frame.

An additional improvement consists in that, after the processing of each frame, a number of samples equal to an integer multiple of times the ratio between the frequency of the noise-free speech signal provided by this processing is retained. sampling and estimated tone frequency. This avoids the problems of distortion caused by phase discontinuities between frames, which are generally not completely corrected by conventional overlap-add techniques.

The fact of having conditioned the signal by the oversampling technique makes it possible to obtain a good measure of the degree of voicing of the speech signal on the frame, from a calculation of the entropy of the autocorrelation of the calculated spectral components. based on the signal conditions. The more the spectrum is disturbed, that is to say the more it is seen, the lower the values of the entropy. The conditioning of the speech signal accentuates the irregular aspect of the spectrum and therefore the variations of the entropy, so that this constitutes a measure of good sensitivity. The autocorrelations will usually be calculated based on the denoised signal to obtain the best performance. It would however be possible to calculate them on the basis of the conditioned signal before noise reduction.

The spectral components of the noise signal, obtained by subtracting said quantities from the spectral components of the speech signal, can be used to calculate a masking curve by applying a model of this auditory perception. Preferably, the parameters on which the quantity subtracted from a spectral component of the speech signal on a frame depends then include a difference between the increased estimate oe the corresponding spectral component of the noise and the calculated masking curve. This subtracted quantity can in particular be limited to the fraction of the estimate increased by the corresponding spectral component of the noise which exceeds the masking curve. This procedure is based on the observation that it is sufficient to denoise the audible noise frequencies. Conversely, there is no point in eliminating noise which is masked by speech.

In an advantageous embodiment, each increased estimate of the noise included in the speech signal is obtained by combining a long-term estimate of said spectral component of noise, and a measure of the variability of said spectral component of noise around its estimate. long-term. This gives a particularly robust noise estimator to variations in noise because it combines two separate estimators, one accounting for long-term fluctuations in noise, and the other for its short-term variability.

Other particularities and advantages of the present invention will appear in the description below of nonlimiting exemplary embodiments, with reference to the accompanying drawings, in which: - Figure 1 is a block diagram of a denoising system implementing the present invention;

- Figures 2 and 3 are flowcharts of procedures used by a voice activity detector of the system of Figure 1;

FIG. 4 is a diagram representing the states of a voice activity detection automaton; FIG. 5 is a graph illustrating the variations of a degree of vocal activity;

- Figure 6 is a block diagram of a noise overestimation module of the system of Figure 1; FIG. 7 is a graph illustrating the calculation of a masking curve; - Figure 8 is a graph illustrating the operation of the masking curves in the system of Figure 1;

- Figure 9 is a block diagram of another denoising system implementing the present invention;

- Figure 10 is a graph illustrating a harmonic analysis method usable in a method according to the invention; and

FIG. 11 partially shows a variant of the block diagram of FIG. 9.

The denoising system shown in FIG. 1 processes a digital speech signal s. A windowing module 10 puts this signal s in the form of successive windows or frames, each consisting of a number M of digital signal samples. Conventionally, these frames can have mutual overlaps. In the remainder of this description, it will be considered, without this being limiting, that the frames consist of N = 25β samples at a sampling frequency F of 8 kHz, with a weighting of

Hamming in each window, and 50% overlap between consecutive windows. 99/14739

- 9 -

The signal frame is transformed in the frequency domain by a module 11 applying a conventional fast Fourier transform (TFR) algorithm to calculate the module of the signal spectrum. The module 11 then delivers a set of N = 256 frequent components of the speech signal, denoted S _{n <} =, where n denotes the number of the current frame, and f a frequency of the discrete spectrum. Due to the properties of digital signals in the frequency domain, only the first N / 2 = 128 samples are used.

To calculate the noise estimates contained in the signal s, the frequency resolution available at the output of the fast Fourier transform is not used, but a lower resolution, determined by a number I of frequency bands covering the band [0 , F / 2] of the signal. Each band i

(l≤i≤I) extends between a lower frequency f (i-l) and a higher frequency f (i), with f (0) = 0, and f (I) = F / 2.

This division into frequency bands can be uniform (f (i) -f (i-1) = F / 2I). It can also be non-uniform

(for example according to a barks scale). A module 12 calculates the respective averages of the spectral components Si_l., 1 of the speech signal in bands, for example by a uniform weighting such that:

S _n - =; S „(1) ' ¹ f (i) - f (il) rr ⁿ > ^f fe [f (ι-l), f (ι) [

This averaging decreases the fluctuations between the bands by averaging the noise contributions in these bands, which will decrease the variance of the noise estimator. In addition, this averaging allows a significant reduction in the complexity of the system.

The averaged spectral components Sil, 1 • are addressed to a module 15 for detecting voice activity and to a module 16 for estimating noise. These two modules 15,

16 work jointly, in the sense that degrees of vocal activity γ_ II, 1. measured for the different bands by module 15 are used by module 16 to estimate the long-term energy of noise in the different bands, while these long-term estimates B, 1_1, -. are used by module 15 to carry out an a priori debrαitage of the speech signal in the different bands to determine the degrees of vocal activity γ lx, 1.

The operation of the modules 15 and 16 can correspond to the flowcharts represented in FIGS. 2 and 3. _^ x steps 17 to 20, the module 15 proceeds a priori to debruitaσe the speech signal in the different bands i for the signal frame n. This a priori noise reduction is carried out according to a conventional process of non-linear spectral subtraction from noise estimates obtained during one or more previous frames. In step 17, the module 15 calculates, with the resolution of the bands i, the frequency response Hp _π n, i of the a priori denoising filter, according to the formula:

^s n, ι ^{~ a} n-τl, ι- ^B n-τl, ι Hp - ^b n-τ2, ι or τl and τ2 are delays expressed in number of frames

1

(τl≥l, τ2> 0), and CA Il, - L, is a noise overestimation coefficient, the determination of which will be explained below. The delay τl can be fixed (for example τl = l) or variable. It is all the weaker when one is confident in the detection of voice activity.

In steps 18 to 20, the spectral components

Ep _n are calculated according to:

Ep _fl = max | ff _n . _n , β _χ - τl, ιj < ³ > where βp is a floor coefficient close to 0, conventionally used to avoid that the spectrum of the denoised signal take negative or too low values which would cause a musical noise.

Steps 17 to 20 therefore essentially consist in subtracting from the signal spectrum an estimate, increased by the coefficient α _n _ _τ - _j _, of the noise spectrum estimated a priori.

In step 21, the module 15 calculates the energy of the a priori signal in the different bands i 2 for the frame n: E _{n χ} = Ep _{n χ} . It also calculates an overall average E g of the energy of the a priori denoised signal, by a sum of the energies per band E, i, weighted by the widths of these bands. In the notations below, the index ι = 0 will be used to designate the global band of the signal.

In steps 22 and 23, module 15 calculates, for each band i (O≤i≤I), a quantity ΔE, I_I, 1 representing the short-term variation of the energy of the noise-suppressed signal in band i, as well as 'a long-term value E _n -, _ of the energy of the denoised signal in band i. The quantity ΔE 1, _1, 1 can be calculated by a simplified formula of

^E n-, ι ^{+ E} n-3 _f i ^~ ^E nl, ι ^' ^E n, ι derivation: AE „. =. As for

10 the long-term energy E _n , it can be calculated using an oblivion factor Bl such that 0 <B1 <1, namely n _f ι = Bl. Ë _n _ _lfl + (1-B1). E ^.

After having calculated the energies, i, of the denoised signal, its short-term variations

1_ and its long-term values E _n -, _ as shown in FIG. 2, the module 15 calculates, for each band i (O≤i≤I), a value p representative of the evolution of the energy of the denoised signal. This calculation is performed at steps 25 to 36 of FIG. 3, executed for each band i between = 0 and ι = I. This calculation uses a long-term estimator of the noise envelope ba, an internal estimator ci and a noisy frame counter b. In step 25, the quantity ΔE _n n, i is compared to a threshold εl. If the threshold εl is not reached, the counter b is incremented by one unit in step 26. In step 27, the long-term estimator ba is compared to the value of the smoothed energy E _{n / 1} . If ba ≥ E _{n ι} , the estimator ba is taken equal to the smoothed value E _n -, _ in step 28, and the counter o, is reset to zero. The quantity p, which is taken equal to the ratio ba / E _{n? 1} (step 36), is then equal to 1.

If step 27 shows that ba <E _n -, _, the counter b is compared with a limit value bmax in step 29. If b> bmax, the signal is considered to be too stationary to support vocal activity . The aforementioned step 28, which amounts to considering that the frame contains only noise, is then executed. If b ≤bmax in step 29, the internal estimator bi is calculated in step 33 according to: bι = (1-Bm). ^~ Ê ^~ _nfl + Bm. ba (4)

In this formula, Bm represents an update coefficient between 0.90 and 1. Its value differs depending on the state of a voice activity detection automaton (steps 30 to 32). This state δ -, is that determined during the processing of the previous frame. If the automaton is in a speech detection state (δ _n _ ₁ = in step 30), the coefficient Bm takes a value Bmp very close to 1 so that the noise estimator is very slightly updated in presence of speech. Otherwise, the coefficient Bm takes a lower value Bms, to allow a more significant update of 9/14739

- 13 - the noise estimator in silence phase. In step 34, the difference a ^ -bi ^ between the long-term estimator and the internal noise estimator is compared to a threshold ε2.

If the threshold ε2 is not reached, the long-term estimator ba ^ is updated with the value of the internal estimator i ^ in step 35. Otherwise, the long-term estimator ba ^ remains unchanged . This avoids that sudden variations due to a speech signal lead to an update of the noise estimator. After having obtained the quantities p>, the module 15 proceeds to the voice activity decisions in step 37. The module 15 first updates the state of the detection automaton according to the quantity p _Q calculated for l of the signal band. The new state δ of the automaton depends on the previous state δ -, and of Pg, as shown in Figure 4.

Four states are possible: δ = 0 detects silence, or absence of speech; δ = 2 detects the presence of voice activity; and the states δ = l and δ = 3 are intermediate states of ascent and descent. When the machine is in a state of silence

it remains there if Pg does not exceed a first threshold SE1, and it goes into the rising state otherwise. In the rising state (δ ι = l), it returns to the state of silence if pg is smaller than the threshold SEl, it goes into the speaking state if p _Q is greater than a second threshold SE2 greater than the threshold SEl, and it remains in the rising state if SEl≤ p ₀ ≤SE2. When the automaton is in the speech state (δ _n _ _ι ⁼ 2), it remains there if p _Q exceeds a third threshold SE3 smaller than the threshold SE2, and it goes into the descent state in the case opposite. In the state of - 14 - descent

the automaton returns to the speech state if Pg is greater than the threshold SE2, it returns to the state of silence if Pg is below a fourth threshold SΞ4 smaller than the threshold SE2, and it remains in the descent state if SE4 <p _Q <SE2.

In step 37, the module 15 also calculates the degrees of vocal activity γ II, 1 • in each band i≥l. This degree γ „-i is preferably a non-binary parameter, that is to say that the function γ 11

is a function varying continuously between 0 and 1 according to the values taken by the quantity p. This function has for example the appearance shown in FIG. 5.

Module 16 calculates the band noise estimates, which will be used in the denoising process, using the successive values of the components S_. and degrees of vocal activity γ _^ ,.

This corresponds to steps 40 to 42 of FIG. 3. In step 40, it is determined whether the voice activity detection machine has just gone from the rising state to the speaking state. If so, the last two estimates ^B _n -li ^e ^ ^B n-2 i previously calculated for each band i≥l are corrected according to the value of the previous estimate ^B _n -3 i • This correction is made for take into account that, in the ascent phase (δ = l), the long-term noise energy estimates in the speech activity detection process

(steps 30 to 33) could be calculated as if the signal only contained noise (Bm = Bms), so that they risk being tainted with error. In step 42, the module 16 updates the noise estimates per band according to the formulas:

^ë n _f ι = ^λ B- ^ê n- ^{+ (1} - ^λ S ⁾ - ⁵ n, i ⁽⁵ > / 14739

- 15 -

^B _{n /} ι = n _f ι- ^B nl _f ⁺ ^ n ^ ^{• B} n _f ι < ⁶⁾ where λ _β denotes a forgetting factor such as 0 <λ _β <l. Formula (6) highlights the taking into account of the degree of non-binary vocal activity γn, i.

As indicated previously, the long-term noise estimates B _{n ι} are overestimated, by a modu_e 45 (FIG. 1), before proceeding to denoising by nonlinear spectral subtraction. Module 45 calculates the overestimation coefficient α, previously

evoσue, as well as an increased estimate f? _. that matches

I basically has OA li _f -.L. B i-ï, _f -L ..

The organization of the overestimation module 45 is shown in FIG. 6. The increased estimate B _n . is obtained by combining the long-term estimate JE. and an

measure AB ™ ^has the variability of the noise component in band i around its long-term estimate. In the example considered, this combination is essentially a simple sum made by an adder 46. It could also be a weighted sum.

The overestimation coefficient α I _nf X, is equal to

relationship between the sum B _n , + ΔBi? ^a , ^x delivered by the adder 46 and the delayed long-term estimate ^B n-τ3 ₁ (divisor 47), capped at a limit value x. _χ , for example α_-, = 4 (block 48). The delay τ3 is used to correct, if necessary, in the rise phases (δ = l), the value of the overestimation coefficient α _n -, before the long-term estimates have been corrected by steps 40 and 41 of the figure 3 (for example. 9/14739

- 16

- increased estimate β I _n l _f J, -. is finally taken

equal to n, i- ^B n- -τ3 i (multiplier 49)

-.a measure AB I _n I, J, -. of the noise variability reflects the varia ce of the noise estimator. It is obtained as a function of the values of S _^ • and of B _n , - calculated for a certain number of previous frames on which the speech signal does not present any vocal activity in the

band i. It is a function of the deviations S n-k, i - B n-k, i calculated for a number K of frames of silence (n-k≤n). In the example shown, this function is simply the maximum (block 50). For each frame n, the degree of voice activity γ I-.I, 1 • is compared to a threshold (block 51)

to decide whether the difference Π, calculated in 52-53, must

or not be loaded into a queue 54 of K locations organized in first-in-first-out mode

(FIFO). If γ II, 1 • does not exceed the threshold (which can be equal to 0 if the function g () has the form of figure 5), FIFO 54 is not supplied, while it is in the opposite case. The maximum value contained in FIFO 54 is then provided as a measure of variability

.

The measure of variability AB I "l _f J, .- can, as a variant, be obtained as a function of the values S„ (and not S _n •) and

B n „, ι, ^• . We then proceed in the same way, except that the FIFO

54 does not contain ^s nk, i ^B nk _f i for each of the bands

i, but rather max S nk, f ^{~~ B} n-kfi f [f (il), f (i) [Thanks to independent estimates of long-term fluctuations in noise B n, ι and its 99/14739

- 17 -

short-term variability Δβ _n if-, L, the major estimator B _{n * f} -_I provides excellent robustness to the musical noises of the denoising process.

A first phase of the spectral subtraction is carried out by the module 55 shown in FIG. 1. This pnase provides, with the resolution of the bands i

(l≤i≤I), the frequency response HI _n l _f ±. of a first filter

noise reduction, depending on the components S I-, l, 1. and 2 _nf -.I and

1 of the overestimation coefficients α I _n lf ... This calculation can be carried out for each band i according to the formula:

where τ4 is a determined integer delay such that τ4> 0 (for example τ4 = 0). In expression (7), the coefficient β ^" ^ represents, like the coefficient βp of formula (3), a floor conventionally used to avoid negative or too low values of the denoised signal.

In known manner (EP-A-0 534 837), the coefficient

1 of overestimation α I _n lf ±, could be replaced in formula (7) by another coefficient equal to a function of θi _{n ι} and an estimate of the signal-to-noise ratio

(for example S_ 11, / X J. BI _n lf JJ-, this function decreasing according to the estimated value of the signal-to-noise ratio. This function is then equal to _{n 2} for the lowest values of the signal- In fact, when the signal is very noisy, it is not a priori useful to reduce the overestimation factor. Advantageously, this function decreases towards zero for the highest values of the signal / noise ratio. protect the most energetic areas of the spectrum, where the signal - speech is the most significant, the quantity subtracted from the signal then tending towards zero.

This strategy can be refined by applying it selectively to the harmonics of the pitch frequency of the speech signal when it has vocal activity.

Thus, in the embodiment shown in FIG. 1, a second denoising phase is carried out by a module 56 for protecting harmonics. This module calculates, with the resolution of the Fourier transform, the frequency response H _n f of a second filter of

deoruitage as a function of the parameters H _n α n, ι B n, ι V __ and of the tonal frequency fp-F Q / T p calculated outside the phases of silence by a harmonic analysis module 57. In phase of silence ( δ = 0), module 56 is not in

service, that is to say that H ^l n, f ^' H "rti _f i for each frequency f of a band i. The module 57 can apply any known method of analysis of the speech signal of the frame for determine the period T, expressed as an integer or fractional number of samples, for example a linear prediction method.

The protection provided by the module 56 may consist in carrying out, for each frequency f belonging to a band i:

Δf = F / N represents the spectral resolution of the

Fourier transform. When ^H ny ^{~ 1} 'the quantity subtracted from the component S _n n, _f i will be zero. In this

calculation, the floor coefficients β 2 ₂ (for example 1 β ₂ = β. ) express the fact that certain harmonics of the tonal frequency f can be masked by noise, so that it is not useful to protect them.

This protection strategy is preferably Ξ applied for each of the frequencies closest to the harmonic frequencies of f, that is to say for any arbitrary integer.

If we designate by δf the frequency resolution with which the analysis module 57 produces the estimated tone frequency f, that is to say that the tone frequency

A real is between and f

P + δf „P / 2, then the difference between the η-th harmonic of the real tone frequency is its estimate f _n (condition (9)) can go up to ± ηxδf / 2. For high values of η, this difference can be greater than the spectral half-resolution Δf / 2 of .5 the Fourier transform. To take account of this uncertainty and to guarantee the good protection of the harmonics of the real tonal frequency, one can protect each of the frequencies of the interval ηxf _p - ηxδjf / 2, ηxf ^' + ηxδi / 2, i.e. replace condition (9) above with:

3η integer / f - η. f ≤ η. δf + Δx / 2 9 '

This procedure (condition (9 ')) is of particular interest when the values of η can be large, in particular in the case where the method is used in a wide band system.

For each protected frequency, the corrected frequency response H _n ^ can be equal to 1 as indicated above, which corresponds to the subtraction of a zero quantity within the framework of spectral subtraction, 0 i.e. tell a full protection of the frequency in question. More generally, this frequency response corrected Hπt.A _f could be taken equal to a value

between 1 and H _n f depending on the degree of protection desired, which corresponds to the subtraction of a quantity less than that which would be subtracted if the frequency in question was not protected.

The spectral components 5 I _n l _f J _f . of a noise signal are calculated by a multiplier 58:

2

This signal S _nf is supplied to a module 60 which calculates, for each frame n, a masking curve by applying a psychoacoustic model of auditory perception by the human ear.

The masking phenomenon is a known principle of the functioning of the human ear. When two frequencies are heard simultaneously, one of them may no longer be heard. We then say that it is masked.

There are different methods for calculating masking curves. One can for example use the one developed by JD Johnston ("Transform Codmg of Audio Signais Usmg Perceptual Noise Cπteria", IEEE Journal on Selected Area m Communications, Vol. 6, No. 2, February 1988). In this method, we work in the frequency scale of the barks. The masking curve is seen as the convolution of the spectral spreading function of the basilar membrane in the bark domain with the excitatory signal, constituted in the present application by the signal S _n ^. The spectral spreading function can be modeled as shown in Figure 7. For each bark band, we calculate the contribution of the upper and lower bands convoluted by the spreading function of the basilar membrane: ^Cn ' ^{q (U)}

where the indices q and q 'denote the bands of bark q, q'<Q), and S - 2

0 <2

⁵ n _n , g „'<represents the mean of the components

2

S _n f of the excitation signal denoised for the discrete rrequences f belonging to the bark band q '.

The masking threshold „

q is obtained by the module

60 for each bark q band, according to the formula:

where R depends on the more or less voiced character of the signal. As is known, a possible form of R is:

10.1og ₁₀ (R _q ) = (A + q) .χ + B. (l-χ) (13) with A = 14.5 and B = 5.5. χ denotes a degree of voicing of the speech signal, varying between zero (no voicing) and

1 (strongly voiced signal). The parameter χ can be of the known form:

- ^min fcf SFM

where SFM represents, in decibels, the ratio between the arithmetic mean and the geometric mean of the energy of the bark bands, and SFM ^ == - 60 dB. The denoising system also includes a module

62 which corrects the frequency response of the denoising filter, as a function of the masking curve M

calculated by module 60 and increased estimates B 1 _n 1.1 • calculated by module 45. Module 62 decides the level of denoising which must really be achieved.

By comparing the envelope of the estimate increased by the noise with the envelope formed by the masking thresholds q, it is decided to denoise the signal only

insofar as the increased estimate B _n ^ exceeds the masking curve. This avoids unnecessarily removing masked noise with speech.

₃₃

The new response H _nf , for a frequency f belonging to the band i defined by the module 12 and to the bark band q, thus depends on the relative difference between the increased estimate B _n of the corresponding spectral component of the noise and the masking curve

v_ _< _. as follows :

In other words, the quantity subtracted from a spectral component S _nf , in the process of spectral subtraction having the frequency response

H _n 3 _f , is substantially equal to the minimum between on the one hand the quantity subtracted from this spectral component in the process of spectral subtraction having the frequency response H _n f, and on the other hand the fraction of

the increased estimate B _n of the corresponding spectral component of the noise which, if necessary, exceeds the masking curve M-il, q_. FIG. 8 illustrates the principle of the correction applied by the module 62. It schematically shows an example of a masking curve M, n., Q calculated on the basis

2 of the spectral components S _n ^ of the denoised signal, thus

than the increased estimate BI _n l _f J ,. of the noise spectrum. The quantity finally subtracted from the components Sn, r will be that represented by the hatched areas, that is to say limited to the fraction of the increased estimate B. spectral components of the noise that exceeds the masking curve. This subtraction is done by multiplying the

3 frequency response H _n f of the denoising filter by the spectral components Sn, _f i of the speech signal

(multiplier 64). A module 65 then reconstructs the noise signal in the time domain, by operating the inverse fast Fourier transform (TFRI) of the

3 frequency samples S _n f delivered by the multiplier

64. Fear each frame, only the N / 2 = 128 first samples of the signal produced by the module 65 are

3 deliveries as final denoised signal s, after reconstruction by addition-recovery with the N / 2 = 128 last samples of the previous frame (module 66).

FIG. 9 shows a preferred embodiment of a denoising system implementing the invention. This system comprises a certain number of elements similar to corresponding elements of the system of FIG. 1, for which the same reference numbers have been used. So, modules 10, 11,

12, 15, 16, 45 and 55 provide in particular the quantities S, l_l _/ J _η -, B i.lf-.L, α I „lfJ ,. , BI „lfJ ,. and H 1 „1 fj -_r to perform the selective denoising.

The frequency resolution of the transform of

Fast Fourier 11 is a limitation of the system of FIG. 1. In fact, the frequency subject to protection by the module 56 is not necessarily the precise tone frequency f, but the frequency closest to it. in the discrete spectrum. In some cases, it is then possible to protect harmonics relatively far from that of the tone frequency. The system of FIG. 9 overcomes this drawback thanks to an appropriate conditioning of the speech signal.

In this conditioning, the sampling frequency of the signal is modified so that the period 1 / f covers exactly an integer number of sample times of the conditioned signal. Many methods of harmonic analysis that can be implemented by the module 57 are capable of providing a fractional value of the delay T, expressed as a number of samples at the initial sampling frequency F. A new sampling frequency f is then chosen so that it is equal to an integer multiple of the estimated tone frequency, ie F

ti, with whole p. In order not to lose signal samples, f should be greater than F. One can in particular impose that it is between F _Ω and 2F (1 <K <2), to facilitate the implementation of the conditioning.

Of course, if no voice activity is detected on the current frame (δ _n ≠ 0), or if the delay T estimated by the module 57 is entire, it is not necessary to condition the signal.

So that each of the harmonics of the tone frequency also corresponds to an integer number of samples of the conditioned signal, the integer p must be a divider of the size N of the signal window produced by the module 10: N = αp, with α integer. This size N is usually a power of 2 for the implementation of the TFR. It is 256 in the example considered. The spectral resolution Δf of the transform of

Discrete Fourier of the conditioned signal is given by Δf = pf / N = f / α. It is therefore advantageous to choose p small so as to maximize α, but large enough to oversample. In the example considered, or F = 8 kHz and N = 256, the values chosen for the parameters p and α are indicated in table I. 500 Hz <f <1000 Hz 8 <T <16 ir P = 16 α = 16

250 Hz <f <500 Hz 16 <T <32 P = 32 α = 8

125 Hz <f <250 Hz 32 <T <64 P = 64 α = 4

62.5 Hz <f <125 Hz 64 <T <128 P = 128 α = 2

31, 25 Hz <f <62.5 Hz 128 <T <256 P = 256 α = 1

Table I

This choice is made by a module 70 according to the value of the delay

^' P provided by the harmonic analysis module 57. The module 70 provides the ratio K between the sampling frequencies to three frequency change modules 71, 72, 73.

The module 71 is used to transform the values S ^J n „, i '

B n, ι 'α " ^• /,.,!' ^B n, i and H n, f relating to the bands i defined by the module 12, in the scale of the modified frequencies sampling frequency f This transformation consists simply in dilate the bands i in factor K. The values thus transformed are supplied to module 56 for protection of harmonics.

This then operates in the same way as above to provide the frequency response

of

2 denoising filter. This response H _n is obtained in the same way as in the case of FIG. 1 (conditions

(8) and (9)), except that in condition (9), the tone frequency f = f / p is defined

P "according to the value of the entire delay p supplied by the module 70, the frequency resolution Δf also being supplied by this module 70.

The module 72 proceeds to oversampling the frame of N samples provided by the windowing module 10. Oversampling in a rational factor K (K = K1 / K2) consists in first carrying out an oversampling in the integer factor K1, then a sub-sampling in the integer factor K2. These oversampling and subsampling in whole factors can be carried out conventionally by means of polyphase filter banks.

The conditioned signal frame supplied by the module 72 includes KN samples at the frequency f. These samples are sent to a module 75 which calculates their Fourier transform. The transformation can be carried out from two blocks of N = 256 samples: one consisting of the first N samples of the frame of length KN of the conditioned signal s', and the other of the last N samples of this frame. The two blocks therefore have an overlap of (2-K) xl00%. For each of the two blocks, we obtain a set of Fourier components S_ Il -. These components S_II _f J are supplied to the multiplier 58, which multiplies them by the spectral response H _n 2 f to deliver the spectral components S _n ? _ψ of the first denoised signal.

These components S _n f are addressed to the module 60 which calculates the masking curves in the manner previously indicated. Preferably, in this calculation of the masking curves, the quantity χ designating the degree of voicing of the speech signal (formula (13)) is taken from the form χ = 1H, where H is an entropy of the autocorrelation of the spectral components S _n f of the denoised conditioned signal. The autocorrelations A (k) are calculated by a module 76, for example according to the formula: N / 2-1

∑ ^S n, f ' ^S n, f + kf = 0

A ⁽ k ⁾ = ₂ _ι tf / 2_ι ⁽¹⁵⁾

Σ Σ ^s n, f ^{• s} n, f + rf = 0 = o

A module 77 then calculates the normalized entropy

H, and provides it to module 60 for the calculation of the masking curve (see S.A. McClellan et al: “Spectral Entropy: an Alternative Indicator for Rate

Allocation ? ", Proc. ICASSP'94, pages 201-204):

N / 2-1 ∑ A (k). log [A (λ)] k = 0

H = (16) log (N / 2)

Thanks to the conditioning of the signal, as well as to its denoising by the filter H _n f, the standardized entropy H constitutes a measurement of voicing very robust to noise and variations in the tonal frequency.

The correction module 62 operates in the same way as that of the system in FIG. 1, taking into account the overestimated noise B _n ^ resized by the frequency change module 71. It provides the frequency response H _n ^ of the final noise reduction filter, which is multiplied by the spectral components Sn _n , r of the signal conditioned by the multiplier

3

64. The resulting components S _n f - ¹¹ are brought back into the time domain by the module of TFRI 65. At the output of this TFRI 65, a module 80 combines, for each frame, the two signal blocks resulting from the processing of the two overlapping blocks delivered by TFR 75. This combination can consist of a sum with Hamming weighting of the samples, to form a denoised conditioned signal frame of KΝ samples. The noise-reduced conditioned signal supplied by the module 80 is subject to a change in sampling frequency by the module 73. Its sampling frequency is reduced to F = f / K by the operations opposite to those carried out by the module 75. The module 73 delivers N = 256 samples per frame. After the reconstruction by addition-overlap with the N / 2 = 128 last samples of the previous frame, only the N / 2 = 128 first samples of the current frame are finally kept to form the final debrαite signal

3 s (module 66).

In a preferred embodiment, a module

82 manages the windows formed by the module 10 and saved by the module 66, so that a number M of samples is saved equal to an integer multiple of. This avoids the problems of

phase discontinuity between the frames. Correspondingly, the management module 82 controls the windowing module 10 so that the overlap between the current frame and the next one corresponds to NM. This recovery of NM samples will be required in the recovery sum carried out by the module 66 during the processing of the next frame. From the value of T provided by the harmonic analysis module 57, the module 82 calculates the number of samples to be saved

M = T xE [N / (2T)], E [] designating the whole part, and correspondingly controls the modules 10 and 66.

In the embodiment which has just been described, the tonal frequency is estimated on an average basis on the frame. However, the tonal frequency may vary somewhat over this period. It is possible to take these variations into account in the context of the present invention, by conditioning the signal so as to artificially obtain a constant tone frequency in the frame.

For this, we need that the harmonic analysis module 57 provide the time intervals between the consecutive ruptures of the speech signal attπbuables at closings of the glottis of the intervening speaker for the duration of the frame. Methods usable for detecting such micro-ruptures are well known in the field of harmonic analysis of speech signals. In this regard, we can consult the following articles: M. BASSEVILLE et al., “Sequential detection of abrupt changes m spectral characteristics of digital signais”, IEEE Trans. on Information Theory, 1983, Vol. IT-29, No. 5, pages 708-723; R. ANDRE-OBRECHT, "A new statistical approach for the automatic segmentation of cont ucus speech signais", IEEE Trans. on Acous. , Speech and Sig. Proc, Vol. 36, No. 1, January 1988; & C. MURGIA et al., “An algoπthm for the estimation of giottal closure instants usmg the sequential detection of abrupt changes m speech signais”, Signal Processing VII, 1994, pages 1685-1688.

The principle of these methods is to perform a statistical test between two models, one in the short term and the other in the long term. Both models are adaptive linear prediction models. The value of this statistical test wm. is the cumulative sum of the posterior likelihood ratio of two distributions, corrected by the Kullback divergence. For a distribution of residuals having a Gaussian statistic, this value w.m is given by:

where and σ .2 represent the residue calculated at the time of sample m of the frame and the variance of the long-term model, e _m 1 and σ2- _| _ similarly representing the residual and the variance of the short-term model. The closer the two models are, the closer the wm value of the statistical test to 0. On the other hand, when the two models are distant from each other, this value w _m becomes negative, which indicates a break R of the signal.

FIG. 10 thus shows a possible example of evolution of the value w, showing the breaks R of the speech signal. The time intervals t

(r = 1,2, ...) between two consecutive breaks R are calculated, and expressed in number of samples of the speech signal. Each of these intervals t is inversely proportional to the tone frequency f, which is thus estimated locally: ^f _D ^{= F} _e / t _r over the r-th interval.

We can then correct the temporal variations of the tone frequency (that is to say the fact that the intervals t are not all equal on a given frame), in order to have a constant tone frequency in each of the frames of analysis. This correction is carried out by modifying the sampling frequency over each interval t, so as to obtain, after oversampling, constant intervals between two glottal breaks. The duration between two breaks is therefore modified by oversampling in a variable ratio, so as to lock in on the largest interval. In addition, care is taken to comply with the conditioning constraint that the oversampling frequency is a multiple of the estimated tone frequency.

FIG. 11 shows the means used to calculate the conditioning of the signal in the latter case. The harmonic analysis module 57 is produced so as to implement the above analysis method, and to provide the intervals t relative to the signal frame produced by the module 10. For each of these intervals, the module 70 (block 90 in FIG. 11) calculates the oversampling ratio K _r = p _r / t _r , where the integer p _r is given by the third column of table I when t takes the values indicated in the second column. These reports oversampling K _r are supplied to the frequency change modules 72 and 73, so that the interpolations are carried out with the sampling ratio K over the corresponding time interval t.

The largest T of the time intervals t supplied by the module 57 for a frame is selected by the module 70 (block 91 in FIG. 11) to obtain a torque p, α as indicated in table I. The sampling frequency modified is then f _e ⁼ P- ^F _Θ / ^τ _D as before, the spectral resolution Δf of the discrete Fourier transform of the conditioned signal being always given by Δf = Fe / (α.Tp). For the frequency change module 71, the oversampling ratio K is given by K = p / T (block 92). The module 56 for protecting the harmonics of the tone frequency operates in the same manner as above, using for condition (9) the spectral resolution Δf provided by the block 91 and the tone frequency f -f / p defined according to the value of the integer delay p supplied P * - by block 91.

This embodiment of the invention also involves an adaptation of the window management module 82. The number M of samples of the denoised signal to be saved on the current frame here corresponds to an integer number of consecutive time intervals t between two glottal breaks (see FIG. 10). This arrangement avoids the problems of phase discontinuity between frames, while taking into account the possible variations of the time intervals t on a frame.

Claims

1. A method of denoising a digital speech signal (s) processed by successive frames, in which:

a harmonic analysis of the speech signal is carried out to estimate a tonal frequency (f) of the speech signal on each frame where it exhibits vocal activity;

- we compute spectral components (S- II ,, ^ i.,

S) the speech signal on each frame; - For each frame, the spectral components of the noise included in the speech signal are calculated;

a spectral subtraction is carried out comprising at least one step consisting in respectively subtracting, from each spectral component of the speech signal on the frame (S), a quantity depending on parameters including at least the estimation of the corresponding spectral component of the noise for said frame and the value of the estimated tone frequency; and - a transformation to the time domain is applied to the result of the spectral subtraction for

3 build a noisy speech signal (s).

2. Method according to claim 1, in which the value of the estimated tonal frequency (f) is used to select protected frequencies from the set of frequencies for which spectral components of the speech signal are calculated, and in which for subtract from a given spectral component (Sn, rA of the speech signal, a smaller quantity is adopted if said spectral component corresponds to a protected frequency than if said spectral component does not correspond to a protected frequency.

3. Method according to claim 2, in which the protected frequencies are selected so that the spectral component of the speech signal corresponding to each protected frequency exceeds a noise level determined from the corresponding estimate of the spectral component of the noise.

4. Method according to claim 2 or 3, in which each protected frequency is, among the set of frequencies for which spectral components of the speech signal are calculated, the closest to an integer multiple of the estimated tonal frequency (f ).

5. Method according to claim 2 or 3, in which each protected frequency is, among the set of frequencies for which spectral components of the speech signal are calculated, the closest to a frequency of an interval of the form ηxfp - ηxδ ^' / 2, r \ xf + ηxδi / 2 f denoting the estimated tonal frequency, δf denoting the frequency resolution of the estimation of the tonal frequency, and η denoting an integer, and in which

6. Method according to any one of claims 2 to 5, wherein the amount subtracted from the spectral component (Sn „, yr of the speech signal at a protected frequency is substantially zero.

7. Method according to any one of claims 1 to 6, in which, after having estimated the tonal frequency (f) of the speech signal on a frame, the speech signal of the frame is conditioned by oversampling it at a frequency of oversampling (f) multiple of the estimated tone frequency, and the spectral components (Sn-, r) of the speech signal on the frame are calculated on the basis of the conditioned signal (s') to subtract said quantities from them. / 14739

- 34 -

8. Method according to claim 7, in which spectral components (S I_I ,, _f ) of the speech signal are calculated by distributing the conditioned signal (s') by blocks of N samples subjected to a transformation in the frequency domain, and in which the ratio (p) between the oversampling frequency (f) and the estimated tone frequency is a divisor of the number N.

9. Method according to claim 7 or 8, in which a degree of voicing (χ) of the speech signal on the frame is estimated from a calculation of the entropy (H) of the autocorrelation of the spectral components calculated over the basis of the conditioned signal.

10. The method of claim 9, wherein

2 said spectral components (S _n ) for which the autocorrelation (H) is calculated are those calculated on the basis of the conditioned signal (s') after subtracting said quantities.

11. Method according to claim 9 or 10, in which the degree of voicing (χ) is measured from a normalized entropy H of the form:

N / 2-1

∑ A (k). log [A (k)] k = 0 H = - log (N / 2) where Ν is the number of samples used to calculate the spectral components (S_ II., _Ψ 1) based on the conditioned signal (s' ), and A (k) is the normalized autocorrelation defined by:

N / 2-1

2-, ^s n, f- ^s n, f + kf = 0 A (k) N / 2-1 N / 2-1

Σ ∑ ^S n, f- ^S n, f + F 5 ^ designating the spectral component of rank f calculated on the basis of the conditioned signal.

12. Method according to any one of the preceding claims, in which, after the processing of each frame, a number of samples (M) equal to a multiple is preserved among the samples of the noise-suppressed speech signal provided by this processing. integer of times the ratio (T) between the sampling frequency (F) and ι ~ ^ - the estimated tonal frequency (f).

13. Method according to any one of claims 1 to 11, in which the estimation of the tonal frequency of the speech signal on a frame comprises the following steps:

- Estimated time intervals (t) between two consecutive breaks (R) of the signal attributable to closures of the glottis of the speaker intervening during the duration of the frame, the estimated tone frequency being inversely proportional to said time intervals; - the speech signal is interpolated in said time intervals, so that the conditioned signal (s') resulting from this interpolation has a constant time interval between two consecutive breaks.

14. The method as claimed in claim 13, in which, after the processing of each frame, a number of samples (M) corresponding to an integer number of intervals is preserved among the samples of the noise-suppressed speech signal provided by this processing. estimated time (t).

15. Method according to any one of the preceding claims, in which the values in a signal-to-noise ratio that the speech signal presents on each frame are estimated in the spectral domain, and in which the parameters on which the quantities depend subtracted include estimated ratio values signal-to-noise, the amount subtracted from each spectral component of the speech signal on the frame being a decreasing function of the corresponding estimated value of the signal-to-noise ratio.

16. The method of claim 15, wherein said function decreases towards zero for the highest values of the signal-to-noise ratio.

17. Method according to any one of the preceding claims, in which spectral components (S ^ _j ) of a noise-suppressed signal, obtained by subtracting said quantities from the spectral components (S_n_, _f r) of the speech signal, are used to calculate a masking curve (Mil, y by applying an auditory perception model.

18. The method of claims 11 and 17, wherein the calculation of the masking curve (M-n ,, yq involves the degree of voicing (χ) measured by the normalized entropy H.

19. The method of claim 17 or 18, wherein the parameters on which depends the amount subtracted from a spectral component (S_ y of the speech signal on a frame include a difference between an increased estimate

_Λ 1

(B _n ) of the corresponding spectral component of the noise and the calculated masking curve (Mn, q).

20. The method of claim 19, wherein comparing the increased estimates (B _{n ∑} ) of the spectral components of noise for a frame to the calculated masking curve (M „n, q_), and wherein the amount subtracted from a spectral component (Sn_, i) of the speech signal,

3 for obtaining the components (S _nf ) subject to the transformation to the time domain, is limited to the fraction of the estimate increased by the corresponding spectral component of the noise which exceeds the masking curve.

21. Method according to any one of the preceding claims, in which the spectral subtraction comprises:

- a first subtraction step in which, respectively, from each spectral component (S I_I, _f 1) of the speech signal on the frame is subtracted, a first quantity depending on parameters including an increased estimate (B I_l ± _) of the component corresponding spectral noise for said frame and the estimated tone frequency (f _D ) 'so as to obtain components

₂ spectral (S _n ^) of a first denoised signal;

- the calculation of a masking curve (M_il, q_) by applying a model of auditory perception from

2 spectral components (S _n ^) of the first denoised signal;

^ t

- the comparison of the increased estimates (BI _n l _η -) of the spectral components of the noise for the frame with the calculated masking curve (Mn ,, q; and

- a second subtraction step in which each second spectral component (S_ II, 1 _Ψ ) of the speech signal on the frame is subtracted respectively, a second quantity equal to the minimum between said first corresponding quantity and the fraction of the increased estimate of the corresponding spectral component of the noise which exceeds the masking curve, so as to obtain

₃ of the spectral components (S _nf ) of a second denoised signal subjected to the transformation towards the time domain.

22. Method according to any one of the preceding claims, in which each the estimate of spectral components of the noise taken into account in the spectral subtraction are increased estimates,

_Λ 1 each increased estimate (B- 1,1,1.) Of a spectral component of the noise included in the speech signal being obtained by combining a long-term estimate (BI _n l / JA. Of said spectral component of the noise , and a measure (

) of the variability of said spectral component of noise around its long-term estimate.

23. The method as claimed in claim 22, in which the long-term estimate B _n of a spectral component of the noise over a frame n, corresponding to a frequency included in a band i, is calculated in the form:

^B n, ι = _n , ι- ^B nl _f ι ⁺ ^ n ^ ^{• B} n _f ι

or B _ιl = λ _B. B _n _ _lfl + (l-λ _β ). S _{n / 1} ,

y _{n ι} denotes a degree of non-binary vocal activity of the speech signal, determined for the frame n relative to the frequency band i, S _n -, denotes an average of the amplitude of the spectrum of the speech signal of the frame n on the band i, and λ _β denotes a forgetting factor.

24. The method of claim 23, wherein the degrees of voice activity (γ I _n l, ±,) for the frame n are determined by carrying out a priori denoising of the speech signal of the frame n on the basis of 'noise estimates ^α n-τl v ^B n-τl i) obtained during at least one previous frame, and by analyzing the energy variations of the noise-suppressed signal a priori.

25. The method of claim 24, wherein the degree of vocal activity (γ I _n l _f J _7. ) Relative to a frequency band i is a function varying continuously between 0 and 1.

26. The method as claimed in claim 24 or 25, in which a long-term estimate (E _n -, _) of the energy of the noise-suppressed signal is calculated a priori in the frequency band i, and this long-term estimate is compared. to an instantaneous estimate (E_ II, 1.) of this energy, calculated over frame n, to obtain the degree of vocal activity (γ,) of the speech signal for frame n in the frequency band i.

27. Method according to any one of claims 23 to 26, in which the measurement (

) of the variability of a spectral component of the noise around its long-term estimate ( ^B _{n ι} ) for a frame n, said spectral component corresponding to a frequency included in a band i, is a function of the deviations

S n-k, i B n-k, i calculated for a given number of frames n-k <n on which the speech signal does not present any vocal activity in the band i.

28. Method according to any one of claims 23 to 26, in which the measurement (AB 11, ^a 1; ^x ) of the variability of a spectral component of the noise around its long-term estimate (B _n ) for a frame n, said spectral component corresponding to a frequency included in a band i, is a function of the maximum deviations max ^s nk, f ^B nk _f i calculated for a given number of fe [f (ι-l), fU) [ nk≤n frames on which the speech signal has no voice activity in the band i, S „n_K _u , _f r denoting the spectral component corresponding to a frequency f for the frame nk, and the frequency interval [ j (il), f (i) [corresponding to band i.