WO1995002239A1

WO1995002239A1 - Voice-activated automatic gain control

Info

Publication number: WO1995002239A1
Application number: PCT/US1994/006281
Authority: WO
Inventors: Peter L. Chu
Original assignee: Picturetel Corporation
Priority date: 1993-07-07
Filing date: 1994-06-03
Publication date: 1995-01-19

Abstract

An apparatus (10) and method (30) for automatically controlling the sound level of human speech in an output audio signal (26) generated from an input audio signal (16). A gain controller (12) only increases the gain of an amplifier (14) used to produce the output audio signal (26) from the input audio signal (16) when a detector (30) signals that the input audio signal (16) includes human speech.

Description

VOICE-ACTIVATED AUTOMATIC GAIN CONTROL Background of the Invention The invention relates to automatically controlling the gain of an audio amplifier.

The gain of an audio amplifier is the ratio of the volume of the output of the audio amplifier relative to the volume of the input of the audio amplifier. When an audio amplifier is used as a component of an audio channel in voice transmission applications such as audio or video teleconferencing, it is desirable to control the gain of the audio amplifier so that the volume of speech at the output of the audio channel remains at a relatively constant level. Thus, for example, when a softspoken person or one who is located some distance away from a microphone at the input of the audio channel is speaking, and this speech is being transmitted through the audio channel, it is desirable to increase the gain of the audio amplifier so that the person's voice can be heard more easily by people listening at the output of the audio channel.

One approach used in maintaining the volume of speech at the output of the audio channel at a relatively constant level is to control the gain of the audio amplifier so that the energy output of the audio channel remains at a constant volume. This approach is problematic because it increases the gain when a person stops talking, and thereby increases the volume of the background noise. When the gain is decreased when a person starts talking again, the volume of the background noise is also decreased. The constant change in background noise level is called "pumping" and tends to be quite distracting. Summary of the Invention The invention offers an improved approach to maintaining the volume of speech at the output of an audio channel at a relatively constant level. The improved approach offers an advantage over prior approaches that tried to equalize the volume of the audio output signal based upon all sounds (i.e., speech, chairs moving, background noise, walking sounds, etc.), when what is really desired in a telecommunications environment is constant volume on speech only. The invention also avoids the problem of pumping.

In one aspect, generally, the invention features a device for automatically controlling the volume of an output audio signal that is generated from an input audio signal by an amplifier. The device includes a detector that signals when the input audio signal includes a desired component such as speech and a gain controller that only increases the gain of the amplifier when the detector signals that the input audio signal includes speech. Thus, rather than always maintaining the volume of the audio output signal at a relatively constant level, the device only maintains speech components of the audio output signal at such a level.

In preferred embodiments, the device also includes an estimator that generates an estimate of a background noise component of the input audio signal. The detector compares the input audio signal to the estimate of the background noise component and, when the input audio signal is substantially equal to the estimate of the background noise component, signals that the input audio signal does not include speech. When the input audio signal differs from the estimate of the background noise component, the detector subtracts the estimated background noise component from a representation of the input audio signal and examines the results of the subtraction to determine whether the input audio signal includes speech. This representation of the input audio signal is generated by performing a fast fourier transform ("FFT") on the input audio signal. Use of an FFT is computationally efficient because optimized assembly code for performing FFTs is commonly available. For example, the invention has been efficiently implemented using less than 15 percent of the instruction cycles of a Texas Instruments TMS320C31 Floating Point Digital Signal Processor, clocked at 40 mHz, using 10,000 32-bit words of memory.

In addition to only increasing the gain when speech is present, the gain controller controls the gain in other ways. First, it decreases the gain of the amplifier when the product of the loudness of the input audio signal and the gain is above a predetermined level. The gain controller determines the loudness of the input audio signal by measuring the peak energy of the input audio signal. Second, the gain controller only increases the gain of the amplifier when the product of the loudness of the input audio signal and the gain is below a predetermined level and the detector signals that the input audio signal includes speech. Third, the gain controller decreases the gain of the amplifier when the product of the loudness of a background noise component of the input audio signal and the gain is above a predetermined level. Finally, the gain controller decreases the gain of the amplifier when, within a predetermined period, the detector does not signal that the input audio signal includes the desired speech component. Though this last method of gain control would seem to produce pumping, it has the opposite effect because it serves to decrease the volume of background noise when speech is not present, which is the period during which the background noise would be most noticeable.

Brief Description of the Drawing Fig. 1 is a block diagram of a system using a voice-activated automatic gain control according to the invention.

Fig. 2 is a block diagram of the voice-activated automatic gain control used in the system of Fig. 1. Fig. 3 is a block diagram of a voiced segment detector of the voice activated automatic gain control of Fig. 2.

Fig. 4 is a block diagram of a background noise estimator of the voiced segment detector of Fig. 3.

Fig. 5 is a flowchart of the procedure implemented by a stationary estimator of the background noise estimator of Fig. 4.

Fig. 6 is a flowchart of the procedure implemented by speech detection logic of the voiced segment detector of Fig. 3. Fig. 7 is a block diagram of the gain control logic of the voice activated automatic gain control of Fig. 2.

Description of the Preferred Embodiments Referring to Fig. 1, a voice transmission system 10 includes a voice-activated automatic gain control 12, an analog-to-digital converter 18, an amplifier 14, and a digital-to-analog converter 24. Voice-activated automatic gain control 12 automatically adjusts the gain of amplifier 14 so that the volume of a speech component of an analog audio output signal 26 remains at a relatively constant level.

In voice transmission system 10, analog-to-digital converter 18 converts an analog audio input signal 16 into a digital signal on a line 20 that is divided into frames, with each frame having a 20 ms duration. Because analog-to-digital converter 18 samples input signal 16 at a sampling rate of 16 kHz, each frame of the digital signal on line 20 includes 320 samples. Next, the digital signal on line 20 is input to both voice- activated automatic gain control 12, which uses the digital signal on line 20 to produce a gain control signal on line 28, and amplifier 14, which amplifies the digital signal on line 20 in response to the gain control signal on line 28 to produce an amplified digital signal on line 22. Finally, digital-to-analog converter 24 converts the amplified digital signal on line 22 into analog audio output signal 26.

Referring now to Fig. 2, voice-activated automatic gain control 12 includes a voiced segment detector 30 and a gain control logic 32. Voiced segment detector 30 looks for the vowel sounds of human voiced speech (as opposed to consonant sounds like "she" which have no periodicity) and discriminates against non-human sounds such as doors closing, footsteps, finger snaps, and paper shuffling. When voiced segment detector 30 detects human speech, it produces a speech detection signal on line 44. Gain control logic 32 relies on the speech detection signal on line 44 in producing the gain control signal on line 28. A windowing function 34 reduces the effects of discontinuities introduced at the beginning and end of the frames of the digital signal on line 20 by converting each frame of the digital signal on line 20 into a windowed frame on line 38. To eliminate the effect of these discontinuities on speech detection, windowing function 34 combines each frame of the digital signal on line 20 with a portion from the end of the immediately preceding frame of the digital signal on line 20 to produce the windowed frame on line 38. The duration of this portion is chosen such that the windowed frame on line 38 encompasses two speech pitch periods, which ensures that the entire contents of a particular pitch period will always appear in at least one windowed frame on line 38. In preferred embodiments, each frame of the digital signal on line 20 is combined with the last 12 ms of the preceding frame to produce windowed frames on line 38 having durations of 32 ms. Put another way, each windowed frame on line 38 includes the 320 samples from a frame of the digital signal on line 20 in combination with the last 192 samples of the immediately preceding frame of the digital signal on line 20. The 32 ms duration of each windowed frame ensures that detection of speech having a pitch period of 16 ms or less — corresponding to a pitch frequency of 62.5 Hz or more — will not be affected by frame discontinuities. Most males have a pitch frequency somewhat higher than 80 Hz, with the mean at about 100 Hz, and most females have a pitch frequency even higher than that. Next, each 512 sample windowed frame on line 38 is transformed using a fast fourier transform ("FFT") 36 to produce a 257 component frequency spectrum 40. The frequency components of each frequency spectrum 40 are equally spaced in a range from 0 Hz to 8 kHz (half of the 16 kHz sampling frequency) . Frequency spectra 40 are then input to both voiced segment detector 30 and gain control logic 32.

Referring also to Fig. 3, voiced segment detector 30 works by determining whether sequences of windowed frames on line 38 resulting in frequency spectra 40 contain periodic signals and whether the periodicity of the periodic signals remains relatively constant over the sequence of windowed frames on line 38. When windowed frames on line 38 meet these criteria, voiced segment detector 30 signals that speech is present through the speech detection signal on line 44. While some non- speech sounds such as, for example, musical instruments, meet these criteria, most non-speech sounds that occur in applications such as teleconferencing will not meet these criteria. Non-speech sounds occurring in teleconferencing applications can be broadly categorized as either constant background noise having relatively constant spectra (i.e., noise produced by fans, computer drives, or electronic circuits) or intermittent noise having spectra that change in nature over time (i.e., finger snaps, footsteps, paper shuffling, or doors opening) .

Referring also to Fig. 4, voiced segment detector 30 includes a background noise estimator 46 that analyzes each frequency spectrum 40 and produces a background noise estimate 42 that is an estimate of the average magnitude of each component of frequency spectrum 40 attributable to constant background noise. Background noise estimator 46 continually monitors the frequency spectra 40 and automatically updates background noise estimate 42 in response to changed conditions such as, for example, air conditioning fans turning on and off.

Background noise estimator 46 develops background noise estimate 42 using two approaches. In the first approach, referring to Fig. 4, a stationary estimator 92 generates a stationary estimate 98 by examining one second intervals of frequency spectra 40 that include only constant background noise, if such intervals exist. In the second approach, a running minimum estimator 94 develops a running estimate 100 by examining ten seconds intervals of frequency spectra 40 having unrestricted contents. An estimate selector 96 selects between stationary estimate 98 and running estimate 100 to produce background noise estimate 42. Stationary estimator 92 looks for long sequences of frequency spectra 40 in which the spectral shape of each frequency spectrum 40 is substantially similar to that of the other frequency spectra 40, which indicates that the frequency spectra 40 only contain background noise. When stationary estimator 92 detects a sequence of frequency spectra 40 that meet this condition, stationary estimator 92 takes the average magnitude of each frequency component of the frequency spectra 40 in the central part of the sequence. Stationary estimator 92 excludes the frequency spectra 40 at the beginning and end of the sequence because those frequency spectra 40 potentially contain low level speech components.

Stationary estimator 92 uses the procedure illustrated in Fig. 5 to generate stationary estimate 98. For each frequency spectrum 40, stationary estimator 92 first generates the average spectral shape of previous frequency spectra 40 (step 102) . The average spectral shape, a simplified summary of the frequency spectrum 40, includes a numerical value for each of the eight 1000 Hz frequency bands of frequency spectrum 40 and is generated according to equation 1:

N_d (F_c) = 0.25 ∑ (R² (k, F) + i² (k, F) ) (1)

where F designates a frequency spectrum 40, F_c designates the current frequency spectrum 40, i denotes a 1000 Hz frequency band, k_± = i * 32, 7c indexes the frequency components of a frequency spectrum 40, and R (k, F) and I (k, F) are the real and imaginary components of the Jcth frequency component of a frequency spectrum 40.

Next, stationary estimator 92 generates the spectral shape of frequency spectrum 40 (step 104) according to equation 2: k = Jo, + 31

Si -F_c = ∑ (R² _"k, F_c) + I² (k, F_c) ) (2 ) k - k

where F_c designates the current frequency spectrum 40, i denotes a 1000 Hz frequency band, k± = i * 32, k indexes the frequency components of frequency spectrum 40, and R (k, F_c) and I (k, F_c) are the real and imaginary components of the Jcth frequency component of frequency spectrum 40.

Next, stationary estimator 92 compares the spectral shape of frequency spectrum 40 to the average spectral shape of previous frequency spectra 40 using a lower threshold (step 106) . This comparison determines whether the frequency spectrum 40 differs from the average of the previous frequency spectra 40 by more than the lower threshold and is made according to equations 3 and 4:

N_d (F_c) I & lFj (3)

where F_c designates the current frequency spectrum 40, i = 0, 1, ..., 7, and t_j, is the lower threshold. If equations 3 and 4 are true for more than four values of i (step 108) , then stationary estimator 92 classifies frequency spectrum 40 as being sufficiently different from previous frequency spectra 40 that it includes a signal other than background noise. From this, stationary estimator 92 determines that no stationary estimate can be developed (step 120) . Otherwise, stationary estimator 92 compares the spectral shape of frequency spectrum 40 to the average spectral shape of previous frequency spectra 40 using an upper threshold (step 110) according to equations 5 and 6:

*_<*^•_> > ^tAW (5)

S_d (F_c) > tM lFj (6)

where F_c designates the current frequency spectrum 40, i = 0, 1, ..., 7, and t_u is the upper threshold. If either equation 5 or equation 6 is true for one or more values of i (step 112) , then stationary estimator 92 classifies frequency spectrum 40 as having a signal and determines that no stationary estimate can be developed (step 120) . Otherwise, stationary estimator 92 classifies frequency spectrum 40 as being a noise spectrum (step 114) .

If frequency spectrum 40 is a noise spectrum, and the forty-nine or more previous frequency spectra 40 were classified as noise spectra (where fifty consecutive noise spectra correspond to one second of noise) (step 116) , stationary estimator 92 develops and outputs stationary estimate 98 (step 118) . Stationary estimator 92 does so by summing the tenth through the forty-first spectra of the fifty frequency spectra 40 according to equation 7:

F = F„ * 31

^B = - ³^ FΣ- F₃ ^(i?2(Jc' ^{F) + l2 {k}' ^{F) (7)}

where k = 0, 1, 2, . . . , 255 , F designates a frequency spectrum 40, F_s indicates the tenth frequency spectrum 40, and R (k, F) and I (k, F) are the real and imaginary components of the Tcth frequency component of a frequency spectrum 40. Each B_k designates a component of stationary estimate 98. Referring again to Fig. 4, running minimum estimator 94 generates running estimate 100 by finding, for each frequency component of frequency spectra 40, the average value of the frequency components of the eight consecutive frequency spectra 40 that produce the minimum average value over the selected time duration. Put another way, for each frequency component 7c of the 500 frequency spectra included in a ten second interval, running minimum estimator 94 finds the F_k that minimizes M_k(F_k) of equation 8:

F - F_k * 7

^M -Fk⁾ = ∑ ( ²⁽*' ⁺ I² -^k> *">) ⁽⁸⁾ ⁸ F - Fk where F_k is any frame number occurring within the 10 second interval. Note that, in general, because sounds other than background noise generally will not occur across the entire frequency spectrum at the same time, the F_k that minimizes equation 8 will take on different values for different values of 7c.

Because stationary estimate 98 is more accurate than running estimate 100, estimate selector 96 sets background noise estimate 42 equal to stationary estimate 98, if a recent stationary estimate 98 is available.

When a recent stationary estimate 98 is unavailable, such as where speech or intermittent noise is never absent for more than a second or the background noise itself is never constant in spectral shape, estimate selector 96 sets background noise estimate 42 equal to running estimate 100 if two conditions are met. First, the time elapsed since estimate selector 96 last set background noise estimate 42 equal to stationary estimate 98 must be more than ten seconds. Second, the difference, D, between the background noise estimate 42 and the new running estimate 100 must exceed a predefined threshold. The threshold difference, D, a sum of the squares of the relative difference between each frequency component of the background noise estimate 42 and its corresponding frequency component in the running estimate 100, is defined according to equation 9:

where the max function returns the maximum of its two arguments, N_k are the frequency components of background noise estimate 42, and M_k are the frequency components of running estimate 100.

Referring again to Fig. 3, a signal versus noise detector 48 compares each frequency spectrum 40 with a corresponding background noise estimate 42. If frequency spectrum 40 is sufficiently greater than the corresponding background noise estimate 42, detector 48 determines that a signal other than constant background noise is present and transmits frequency spectrum 40 to a magnitude squaring and noise subtraction unit 50 for further processing. Otherwise, detector 48 determines that only constant background noise is present and transmits a signal on line 58 that causes voiced segment detector 30 to signal that speech is not present through the speech detection signal on line 44. By transmitting the signal on line 58, detector 48 eliminates the need to further evaluate frequency spectrum 40.

If signal versus noise detector 48 determines that a frequency spectrum 40 resulting from a windowed frame on line 38 includes a signal other than constant background noise, voiced segment detector 30 must determine whether the signal is speech or intermittent noise. To do so, voiced segment detector 30 determines the periodicity of the signal and whether this periodicity is similar to periodicities of previous windowed frames on line 38. Because intermittent noise generally lacks similar periodicity over time, voiced segment detector 30 designates a windowed frame on line 38 as containing speech upon detection of such similar periodicity. Voiced segment detector 30 uses a technique known as autocorrelation to detect and estimate the periodicity of a windowed frame on line 38. A central theorem of signal processing is that convolution in the time domain is equivalent to multiplication in the frequency domain. Thus, the autocorrelation of a windowed frame on line 38 (which is equivalent to the convolution of the windowed frame on line 38 with a time-reversed version of itself) is equivalent to multiplying the frequency spectrum 40 corresponding to the windowed frame on line 38 by the complex conjugate of the same frequency spectrum 40, and then taking the inverse fast fourier transform ("IFFT") of the results of the multiplication.

Magnitude squaring and noise subtraction unit 50 performs the first portion of the autocorrelation by squaring the magnitude of each component of frequency spectrum 40, which is equivalent to multiplying frequency spectrum 40 by the complex conjugate of itself. Thus, magnitude squaring and noise subtraction unit 50 generates the squared magnitudes S of frequency spectrum 40 using equation 10:

for 7c = 0,1,2, ...256, where each frequency component of frequency spectrum 40 includes a real component J? and an imaginary component I. To reduce the effect of constant background noise on analysis of the periodicity of windowed frames on line 38, magnitude squaring and noise subtraction unit 50 subtracts magnitude-squared frequency components N of background noise estimate 42 from the magnitude-squared frequency components S of frequency spectra 40 using equation 11:

M_k = S_k-cN_k (11)

for 7c = 0,1,2, ...256, where all M_k are non-negative with those having negative values being set to zero, and c is a fixed constant equal to 1.56 in preferred embodiments (if c is too large, non-noise components of S will be eliminated; if c is too small, constant background noise will cause errors in periodicity detection and estimation) . Magnitude squaring and noise subtraction unit 50 outputs the results of the subtraction as a magnitude-squared, noise-reduced frequency spectrum 60.

Periodic components of the signal result in magnitude peaks in certain components of frequency spectrum 60. A high pass filter 52 further emphasizes these peaks by eliminating slowly changing components of frequency spectrum 60. (The use of a high pass filter 52 is analogous to "whitening" frequency spectrum 60, a technique that has been found to be a useful preprocessing step before performing autocorrelation in pitch estimation.) High pass filter 52 operates according to equation 12:

H_k = M_k - 0 . 5(M_k_₂ + M_{k 2}) (12)

for 7c = 2,3,...254, where H₀ , H_{l r} H₂₅₅ , and H₂₅₆ are set equal to zero, and all H_k are non-negative with those having negative values being set to zero. Using an inverse fast fourier transform ("IFFT") 54, the output 62 of high pass filter 52 is transformed into a time domain signal 64 having 512 samples. IFFT 54 is taken on output 62 with each H_k component from equation 12 representing the real part of the Tcth frequency component of output 62 and the imaginary part of the Tcth frequency component being zero. The output 64 of IFFT 54 approximates an autocorrelation of the periodic component of a windowed frame on line 38. Output 64 is an approximation because of the omission of appended zeroes to windowed frame on line 38 prior to taking FFT 36 to correct for circular convolution artifacts. However, windowing unit 34 substantially reduces the circular convolution artifacts by combining each frame of the digital signal on line 20 with a portion from the end of the immediately preceding frame of the digital signal on line 20, and thereby eliminates the need for appending zeroes. This, in turn, eliminates a significant computational burden that would have resulted from appending the zeroes.

Referring also to Fig. 6, speech detection logic 56 generates the speech detection signal on line 44.

First, speech detection logic 56 examines the signal on line 58 from detector 48 to determine whether frequency spectrum 40 contains just constant background noise or possibly contains speech (step 66) . If frequency spectrum 40 only contains background noise, speech detection logic 56 declares no voiced segment and sets the speech detection signal on line 44 accordingly (step 80) .

If it is possible that frequency spectrum 40 contains speech, speech detection logic 56 finds the maximum average peak of output 64 for lags of from 70 to 220 samples, which corresponds to the range of human pitch (step 68) . To find the maximum average peak for the appropriate lags, speech detection logic 56 generates the average magnitude for each lag of all pairs of samples spaced by between 70 and 220 other samples (the lag) . Speech detection logic 56 then selects the maximum of these average magnitudes.

Next, speech detection logic 56, having determined the lag from output 64 having the maximum magnitude, divides the selected average magnitude by this maximum magnitude, and examines the results (step 70) . If the ratio of these two magnitudes is less than or equal to a predetermined value, 0.7 in the illustrated embodiment, speech detection logic 56 declares no voiced segment and sets the speech detection signal on line 44 accordingly (step 80) . If the ratio of the two magnitudes is greater than 0.7, speech detection logic 56 determines that output 64 is a periodic frame having pitch period equal to the number of samples between the pair of samples having the maximum average amplitude.

Having declared output 64 to have a pitch period, speech detection logic 56 determines whether more than two of the previous ten outputs of IFFT 54 have had pitch (step 72) . If so, speech detection logic 56 generates the standard deviation of the pitch periods of the previous ten outputs that have had pitch (step 74) and examines the generated standard deviation (step 76) . If the standard deviation is less than a predetermined value, fifteen samples in the illustrated embodiment, this means that an extended sound having consistent pitch in the range of human speech is present. In this case, speech detection logic 56 declares a voiced segment and sets the speech detection signal on line 44 accordingly (step 78) . If the standard deviation is greater than or equal to the predetermined value, or if two or less of the previous ten outputs of IFFT 54 have had pitch, speech detection logic 56 declares no voiced segment and sets the speech detection signal on line 44 accordingly (step 80) .

Referring now to Figs. 1, 2, and 7, voice- activated automatic gain control 12, through gain control logic 32, controls the gain of amplifier 14. First, gain control logic 32 determines the peak energy of the digital signal on line 20. The peak energy is the energy of the frame from the previous two seconds of the digital signal on line 20 that has the maximum energy of any of those frames (step 82) . If the peak energy is greater than that necessary for analog audio output signal 26 to be at a suitable volume, gain control logic 32 sets the gain control signal on line 28 to reduce the gain of amplifier 14 without regard to the speech detection signal on line 44 (step 84) . If the peak energy is much less than that necessary for analog audio output signal 26 to be at a suitable volume, and the speech detection signal on line 44 indicates that speech is present, gain control logic 32 sets the gain control signal on line 28 to increase the gain of amplifier 14 so that the speech component of analog audio output signal 26 will be at a suitable volume (step 86) .

Gain control logic 32 further limits the gain of amplifier 14 to prevent the constant background noise component of analog audio output signal 26 from exceeding a suitable volume (step 88) . Gain control logic 32 examines background noise estimate 42 and sets the gain control signal on line 28 to limit the gain of amplifier 14 accordingly. For example, if background noise estimate 42 indicates a high level of background noise, gain control logic 32 sets the gain control signal on line 28 so that amplifier 14 does not amplify the background noise above an acceptable volume. This limitation overrides any increase in gain that is indicated in step 86. Thus, even if audio output signal 26 includes a speech component and step 86 indicates that the gain of amplifier 14 should be increased to bring the volume of the speech component of analog audio output signal 26 to a suitable volume, if this increase in gain would result in the volume of the background noise component of analog audio output signal 26 exceeding a suitable level, then the gain will not be increased. Finally, if there is no speech for several seconds, as indicated by a failure of voiced segment detector 30 to indicate the presence of speech through the speech detection signal on line 44 for that period, gain control logic 32 sets the gain control signal on line 28 to reduce the gain of amplifier 14 so that the volume of any constant background noise component of analog audio output signal 26 is at a fairly low, unobtrusive level (step 90) . Therefore, the system of Figs. 1-7 offers an improved approach to maintaining the volume of speech at the output of an audio channel at a relatively constant, level and avoids the problem of pumping.

Other embodiments are within the following claims.

Claims

1. An apparatus for automatically controlling the volume of an output audio signal generated from an input audio signal that includes a background noise component, comprising: an estimator that generates an estimated background noise component, a detector that, using said estimated background noise component, signals when the input audio signal includes a desired component, an amplifier that amplifies the input audio signal to produce the output audio signal and has a gain that is a ratio of the output audio signal to the input audio signal, and a gain controller that only increases the gain of the amplifier when the detector signals that the input audio signal includes the desired component.

2. The apparatus of claim l, wherein the desired component is human speech.

3. The apparatus of claim 1, wherein the detector compares the input audio signal to the estimated background noise component and does not signal that the input audio signal includes the desired component when the input audio signal is substantially equal to the estimated background noise component.

4. The apparatus of claim 1, wherein the detector subtracts the estimated background noise component from a representation of the input audio signal and examines characteristics of the result of the subtraction to determine whether the input audio signal includes the desired component.

5. The apparatus of claim 1, wherein the gain controller decreases the gain of the amplifier when the product of the loudness of the input audio signal and the gain is above a predetermined level.

6. The apparatus of claim 5, wherein the gain controller determines the loudness of the input audio signal by measuring the peak energy of the input audio signal.

7. The apparatus of claim 5, wherein the gain controller only increases the gain of the amplifier when the product of the loudness of the input audio signal and the gain is below a predetermined level and the detector signals that the input audio signal includes the desired component.

8. The apparatus of claim 1, wherein the gain controller decreases the' gain of the amplifier when the product of the loudness' of the background noise component and the gain is above a predetermined level.

9. The apparatus of claim 1, wherein the gain controller decreases the gain of the amplifier when, within a predetermined period, the detector does not signal that the input audio signal includes the desired component.

10. A method of automatically controlling the volume of an output audio signal generated from an input audio signal that includes a background noise component, comprising the steps of: generating an estimated background noise component, detecting, based in part on the estimated background noise component, whether the input audio signal includes a desired component, and amplifying the input audio signal to produce the output audio signal, wherein gain is defined as a ratio of the output audio signal to the input audio signal, and the gain is only increased when the detecting step determines that the input audio signal includes the desired component.

11. The method of claim 10, wherein the desired component is human speech.

12. The method of claim 10, wherein the detecting step includes determining that the input audio signal does not include the desired component when the input audio signal is substantially equal to the estimated background noise component.

13. The method of claim 10, wherein the detecting step includes determining whether the input audio signal includes the desired component by examining characteristics of the result of subtracting the estimated background noise component from a representation of the input audio signal.

14. The method of claim 13, wherein the method includes the step of generating the representation of the input audio signal by taking an FFT of the input audio signal.

15. The method of claim 10, wherein the gain is decreased when the product of the loudness of the input audio signal and the gain is above a predetermined level.

16. The method of claim 15, wherein the loudness of the input audio signal is determined by measuring the peak energy of the input audio signal.

17. The method of claim 15, wherein the gain is only increased when the product of the loudness of the input audio signal and the gain is below a predetermined level and the input audio signal includes the desired component.

18. The method of claim 10, wherein the gain is decreased when the product of the loudness of the background noise component and the gain is above a predetermined level.

19. The method of claim 10, wherein the gain is decreased if the desired component is not present in the input audio signal for a predetermined period.

20. A method of detecting the presence of human speech in an input audio signal comprising the steps of: windowing the input audio signal to divide the input audio signal into a plurality of frames; for each of said frames: performing a FFT to produce a representation of the frame, said representation including a fixed number of magnitude components, squaring the magnitude components of the representation of the frame, generating an estimate of a background noise component from the representation of the frame, said estimate including a fixed number of magnitude components, squaring the magnitude components of the estimate, subtracting the squared magnitude components of the estimate from the squared magnitude components of the representation of the frames, filtering the results of the subtraction using a high pass filter, performing an IFFT on the filtered results to produce an autocorrelation sequence, said autocorrelation sequence including a fixed number of sample points, generating the maximum average amplitude of all pairs of sample points that are spaced by a predetermined range of sample points, and dividing the generated maximum average amplitude by the maximum amplitude of all of the sample points in the autocorrelation sequence, and when the result of the division is above a predetermined value, signalling that the frame is a periodic frame having a pitch equal to the number of sample points spacing the pair of sample points having the generated maximum average amplitude; and when a frame is a periodic frame having pitch and at least a predetermined number of a predetermined set of previous frames have pitch, generating the standard deviation of the pitches of all of the frames of the predetermined set that have pitch, and, if the standard deviation is below a predetermined number of sample points, signalling that human speech is present.