WO2014010175A1

WO2014010175A1 - Encoding device and encoding method

Info

Publication number: WO2014010175A1
Application number: PCT/JP2013/003814
Authority: WO
Inventors: 江原　宏幸; 貴子堀; 押切　正浩
Original assignee: パナソニック株式会社
Priority date: 2012-07-09
Filing date: 2013-06-19
Publication date: 2014-01-16

Abstract

Provided are an encoding device and encoding method capable of improving the accuracy of determining whether a BGM signal is in a voice signal mode or a music signal mode and minimizing degradation in sound quality. A mode determination unit (201) determines whether an input signal is in a voice signal mode or music signal mode, and an energy calculation unit (202) calculates the average energy of the input signal included in a frame determined to be a music signal. A hangover length determination unit (203) increases the hangover length when the amount of calculated energy information is large, and decreases the hangover length when the amount of calculated energy information is small. A hangover unit (204) uses the mode information of the previous frame, the mode information of the present frame, and the determined hangover length and corrects the mode information for the present frame when a prescribed condition is satisfied.

Description

Encoding apparatus and encoding method

The present invention relates to an encoding device and an encoding method for encoding an audio signal and a music signal.

Currently, 3GPP (3rd Generation Partnership Project) is standardizing EVS (Enhanced Voice Service), which is a speech codec suitable for EPS (Evolved Packet System). In Non-Patent Document 1, the requirements for EVS are determined in consideration of recent telephone service using portable terminals. For example, it is assumed that the hold sound of the mobile terminal is a voice with music, or the voice guidance with BGM from the call center is processed while listening to the mobile terminal, so the music is also required to be reproduced with good quality. It has been.

As a technique for reproducing music with good quality, Patent Document 1 discloses that VOX (Voice Operated Transmission) control is turned off when a hold is instructed in a portable terminal using VSELP (Vector Sum Excited Linear Prediction). A method is disclosed. Also, in Patent Document 2, when music is provided to a mobile terminal, the mobile terminal stores in advance a plurality of sound source files having different numbers of chords, which are the number of frequencies to be transmitted simultaneously, according to the codec used. A method for selecting a sound source file is disclosed.

In addition, a technology for switching between a speech encoding unit designed to be suitable for encoding a speech signal and a music encoding unit designed to be suitable for encoding a music signal in units of frames according to an input signal is not available. As represented by the speech / music universal coding system (USAC: “Unified Speech and Audio Audio Coding” disclosed in Patent Document 2, it is known as a promising technique capable of encoding speech signals and music signals with high sound quality. At this time, since the coding method is switched in units of frames, the performance of the switching method has a great influence on the sound quality. Although the frame length varies depending on the encoding method, in many cases, 20 msec is used.

On the other hand, G. is a method for performing voice / music determination in units of frames. It is standardized by ITU-T as 720.1 (or GSAD: “Generic Sound Activity Detector”) (see Non-Patent Document 3). In GSAD, voice / music determination is performed using feature parameters for each frame, but the determination result becomes unstable, and frequent switching between voice / music may occur. In order to avoid such frequent switching, a technique called hangover is applied. Hangover is a technique for forcibly using the determination result selected in the previous frame a specified number of times, and thus frequent switching can be avoided.

Japanese Patent No. 2983829 Japanese Patent No. 4507822

Hangover is an effective technique for avoiding frequent voice / music switching, but there is a problem when making a voice call in an environment where music is flowing in the background. Hereinafter, such a signal is referred to as a BGM signal. In the BGM signal, music flows in the non-voice section, and music is superimposed on the voice in the voice section.

FIG. 1 shows the determination result when the BGM signal, the signal component of each frame (speech or music), and the hangover value is 2 (that is, 2 frames forcibly use the previous determination result). As shown in FIG. 1, when the signal component is switched, the determination result is not switched immediately because there is a hangover. For this reason, there is a problem that an erroneous determination occurs and the sound quality deteriorates.

This phenomenon is not a problem because, for example, when the background is silent (clean signal) and when background noise is present (background noise signal), it is usually determined as speech. Thus, since music is determined as music when it is flowing in the background, the above-described hangover problem is peculiar to the BGM signal.

An object of the present invention is to provide an encoding device and an encoding method that improve the accuracy of voice / music determination for a BGM signal and suppress deterioration in sound quality.

The encoding apparatus according to the present invention includes a processing delay determining unit that determines a delay time allowed for encoding processing of an input signal, and determines whether the input signal is in a voice signal or music signal mode for each predetermined section. Comparing the mode of the previous section and the mode of the current section determined by the mode determination means, the mode determination means for performing, the hangover length determination means for determining the hangover length according to the delay time, and the comparison And a hangover means for determining the mode of the current section using the result and the hangover length, and an encoding means for encoding the input signal by an encoding method according to the determined mode. Take the configuration.

The encoding method of the present invention includes a processing delay determination step for determining a delay time allowed for encoding processing of an input signal, and determines whether the input signal is in a voice signal mode or a music signal mode for each predetermined section. Comparing the mode determination step, the hangover length determination step for determining the hangover length according to the delay time, the mode of the previous section determined in the mode determination step and the mode of the current section, and the comparison And a hangover step for determining the mode of the current section using the hangover length and an encoding step for encoding the input signal by an encoding method according to the determined mode. I did it.

According to the present invention, it is possible to improve the accuracy of voice / music determination for a BGM signal and suppress deterioration in sound quality.

The figure which shows the mode of the misjudgment of the voice / music judgment due to hangover FIG. 1 is a block diagram showing a configuration of an encoding apparatus according to Embodiment 1 of the present invention. The block diagram which shows the internal structure of the input signal determination part shown in FIG. Diagram showing correspondence between energy information and hangover length The block diagram which shows the internal structure of the hangover part shown in FIG. The figure which shows the effect of the encoding apparatus which concerns on Embodiment 1 of this invention. Block diagram showing a configuration of an encoding apparatus according to Embodiment 2 of the present invention. The flowchart which shows the process sequence of the process delay determination part shown in FIG. The block diagram which shows the internal structure of the encoding part shown in FIG. Diagram showing correspondence between processing delay and hangover length

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. However, in the embodiment, components having the same function are denoted by the same reference numerals, and redundant description is omitted.

(Embodiment 1)
FIG. 2 is a block diagram showing a configuration of coding apparatus 100 according to Embodiment 1 of the present invention. Hereinafter, the configuration of the encoding apparatus 100 will be described with reference to FIG.

The input buffer 101 outputs the input signal to the input signal determination unit 102, temporarily stores the input signal, and outputs it to the mode switching unit 103.

The input signal determination unit 102 determines whether the input signal output from the input buffer 101 is an audio signal or a music signal, and outputs the determination result to the mode switching unit 103 and the output selection unit 105 as mode information. Note that. Details of the input signal determination unit 102 will be described later.

Based on the mode information output from the input signal determination unit 102, the mode switching unit 103 connects the changeover switch to the encoding mode 1 or the encoding mode 2 of the encoding unit core 104, and is output from the input buffer 101. The input signal is output to encoding mode 1 or encoding mode 2 connected. Specifically, the mode switching unit 103 connects the changeover switch to the encoding mode 1 when the mode information indicates an audio signal, and sets the changeover switch to the encoding mode when the mode information indicates a music signal. Connect to 2.

The encoding unit core 104 includes encoding modes 1 and 2, and the encoding mode 1 is, for example, G. The encoding mode 2 is an encoding method suitable for a music signal such as MP3 (MPEG Audio Layer-3) or AAC (Advanced Audio Coding). The encoding unit core 104 encodes the input signal output from the input buffer 101 in the encoding mode 1 or the encoding mode 2, and outputs the encoded information to the output selection unit 105. Note that information indicating which one of the encoding mode 1 and the encoding mode 2 is used for encoding may be output as part of the encoding information. In this case, the decoding process can be performed only from the encoded information.

Based on the mode information output from the input signal determination unit 102, the output selection unit 105 connects to the encoding mode 1 or the encoding mode 2 of the encoding unit core 104, and from the encoding mode 1 or the encoding mode 2 The output encoded information is set as an output of the encoding apparatus 100.

FIG. 3 is a block diagram showing an internal configuration of the input signal determination unit 102 shown in FIG. Hereinafter, the internal configuration of the input signal determination unit 102 will be described with reference to FIG. In FIG. 3, an input signal (framed input signal) divided by a predetermined time length is input to the mode determination unit 201 and the energy calculation unit 202.

The mode determination unit 201 analyzes an input signal using an existing method to calculate a feature parameter, and determines whether the input signal is in a voice signal or music signal mode using the feature parameter. The mode determination unit 201 outputs a determination result (mode information) to the energy calculation unit 202 and the hangover unit 204.

The energy calculation unit 202 calculates the average energy (or energy information) of the input signal included in the frame determined by the mode determination unit 201 as the music signal, and outputs the energy information to the hangover length determination unit 203. As a method for calculating energy information, for example, the following processing is performed.

The energy calculation unit 202 has a buffer for storing the average energy of past frames determined as music signals, and when the current frame is determined as a music signal, the energy calculation unit 202 stores the energy of the current frame in the buffer. Update the stored value. The update is performed according to the following equation (1).

Here, E _avg represents the average energy of past frames determined to be music signals stored in the buffer. Also, E _n represents the energy of the signal current frame is included in the current frame when it is determined that the music signal. Furthermore, α represents a coefficient that is greater than or equal to 0 and less than 1 that controls the update rate, and a numerical value such as α = 0.95 is used, for example. If it is determined that the current frame is an audio signal, the buffer is not updated. Thereby, the average energy of the music signal can be calculated.

The energy calculation unit 202 outputs the average energy E _avg calculated in this way to the hangover length determination unit 203 as energy information.

The hangover length determination unit 203 compares the energy information output from the energy calculation unit 202 with a predetermined threshold, and if the energy information is larger than the threshold, the hangover length is increased and output to the hangover unit 204. To do. On the other hand, when the energy information is smaller than the threshold value, the hangover length determination unit 203 shortens the hangover length and outputs it to the hangover unit 204. As a specific value of the hangover length, for example, as shown in FIG. 4, when the energy information is large, the hangover length is 2 frames. When the energy information is small, the hangover length is 1 frame. And so on.

As described above, when the energy information of the music signal in the non-speech section in the BGM signal is small, erroneous determination of the voice / music determination is less likely to occur, and the determination accuracy is improved. It is possible to reduce erroneous voice / music judgment due to hangover. On the other hand, when the energy information of the voice section (voice signal) in the BGM signal is large, erroneous determination of voice / music determination is likely to occur, and the accuracy of the determination deteriorates. Therefore, by increasing the hangover length, Misjudgment of music judgment can be reduced.

The hangover unit 204 stores the mode information determined in the previous frame, the mode information of the previous frame, the mode information of the current frame output from the mode determination unit 201, and the hangover length determination unit 203. The mode information of the current frame output from the mode determination unit 201 is determined and output using the hangover length.

FIG. 5 is a block diagram showing an internal configuration of the hangover unit 204 shown in FIG. Hereinafter, the internal configuration of the hangover unit 204 will be described with reference to FIG. The storage unit 301 stores the mode information output from the hangover unit 204 in the previous frame, and the mode information of the previous frame is output to the determination unit 302.

The determination unit 302 compares the mode information of the previous frame output from the storage unit 301 with the mode information of the current frame output from the mode determination unit 201. When the mode information of the previous frame matches the mode information of the current frame, the counter built in the determination unit 302 is reset to zero, and the

switches

303 and 304 are switched so that the path (B) becomes valid. The path (B) is a path for outputting the mode information output from the mode determination unit 201 as it is. For this reason, the mode information output from the mode determination unit 201 is output from the hangover unit 204 without any processing.

On the other hand, if the mode information of the previous frame does not match the mode information of the current frame, the counter built in the determination unit 302 is incremented. The determination unit 302 compares the counter value with the hangover length output from the hangover length determination unit 203. When the counter value is equal to or smaller than the hangover length, the determination unit 302 switches the

switches

303 and 304 so that the hangover process is valid, that is, the path (A) is valid. The path (A) is a path in which mode information output from the mode determination unit 201 is corrected by the mode information correction unit 305 and output from the hangover unit 204. If the counter value exceeds the hangover length, the determination unit 302 resets the counter to zero and switches the

switches

303 and 304 so that the path (B) is valid.

The mode information correction unit 305 operates only when the path (A) is valid, replaces the mode information output from the mode determination unit 201 with the mode information of the previous frame stored in the storage unit 301, and outputs it. .

The mode information of the current frame that is output from the hangover unit 204 through the path (A) or the path (B) is stored in the storage unit 301, thereby replacing the mode information that has been stored so far. Prepare for the processing of the next frame.

FIG. 6 is a diagram illustrating the effect of the encoding device 100 described above. In FIG. 6, the same BGM signal as in FIG. 1 is input. In FIG. 1, the hangover length is fixed to 2, whereas in FIG. 6, the energy information calculated by the energy calculation unit 202 is obtained from the threshold value. For this reason, the state when the hangover length is 1 is shown.

As can be seen from FIG. 6, the length of misjudgment caused by hangover is half that of FIG. For this reason, it is possible to improve the sound quality in the encoding apparatus 100 described above.

As described above, according to the first embodiment, when the energy of the input signal is calculated and the calculated energy is small, the hangover length is shortened so that the voice / music determination can be performed in a short section. It is possible to improve the accuracy of voice / music determination with respect to the BGM signal and suppress deterioration in sound quality.

(Embodiment 2)
The audio signal is usually used in conversation with the other party, that is, in bidirectional communication. Therefore, since the conversation does not hold when the delay becomes long, it is necessary to encode the audio signal with a short delay (hereinafter referred to as “low delay”). In addition, the sound signal has a characteristic that the characteristics of the signal greatly change in a relatively short time, such as a silent interval, an unvoiced interval, and a voiced interval. Therefore, even if a long time signal is stored in the encoding buffer (that is, a long delay (hereinafter referred to as “high delay”)) and analyzed, the encoding efficiency is unlikely to be high. For this reason, low delay is suitable for encoding audio signals.

On the other hand, the signal characteristics of a music signal rarely change significantly in a short time like an audio signal. For this reason, encoding efficiency is greatly improved by storing and analyzing a long time signal in the encoding analysis buffer. In addition, since a streaming signal for transmitting data in one direction from a server to a terminal is a main application, a one-way communication is less demanding on delay than two-way communication. For this reason, it can be said that high delay is suitable for encoding music signals.

Hereinafter, the second embodiment of the present invention will be described based on the characteristics of the audio signal and the music signal described above. FIG. 7 is a block diagram showing a configuration of coding apparatus 400 according to Embodiment 2 of the present invention. Hereinafter, the configuration of encoding apparatus 400 will be described with reference to FIG.

The user interface 401 is, for example, a keyboard, a touch panel, or the like, and an input source, that is, an input source switching signal for switching the ON operation of the microphone 402 and the data storage unit 403 to the microphone 402, the data storage unit 403, and the input source specifying unit 404. Output.

The microphone 402 inputs sound according to the input source switching signal output from the user interface 401, converts the input sound into a sound signal, and outputs the sound signal to the encoding unit 406. In addition, the data storage unit 403 stores data such as a holding tone or a message, and outputs the stored data to the encoding unit 406 in accordance with the input source switching signal output from the user interface 401.

The input source specifying unit 404 specifies an input source based on the input source switching signal output from the user interface 401, and outputs input source information indicating the specified input source to the processing delay determination unit 405.

The processing delay determination unit 405 determines a delay time allowed for the encoding process according to the input source information output from the input source specifying unit 404. Specifically, as shown in FIG. 8, when the input source is a microphone (ST501: YES), the processing delay determination unit 405 determines that the data to be encoded is bidirectional, such as a voice call. It is determined that the data requires real-time processing, and the processing delay is determined to be low delay (ST502). On the other hand, when the input source is the data storage unit 403 (ST501: NO), the processing delay determination unit 405 does not need the bidirectional real-time processing such as hold tone or message for the data to be encoded. It is determined that the data is data, and the processing delay is determined to be a high delay (ST503). The processing delay determination unit 405 outputs the determination result (delay information) to the encoding unit 406.

Based on the delay information output from the processing delay determination unit 405, the encoding unit 406 converts the audio signal output from the microphone 402 or the data output from the data storage unit 403 by an encoding method suitable for each characteristic. Encode and output encoded information.

FIG. 9 is a block diagram showing an internal configuration of the encoding unit 406 shown in FIG. Hereinafter, the internal configuration of the encoding unit 406 will be described with reference to FIG. However, the encoding unit 406 in FIG. 9 deletes the input signal determination unit 102 from the encoding device 100 in FIG. 2, changes the mode switching unit 103 to the mode switching unit 601, and outputs the output selection unit 105. The difference is that the selection unit 602 is changed.

Based on the delay information output from the processing delay determination unit 405, the mode switching unit 601 connects the changeover switch to the encoding mode 1 or the encoding mode 2 of the encoding unit core 104 and is output from the input buffer 101. The input signal is output to encoding mode 1 or encoding mode 2 connected. Specifically, the mode switching unit 601 connects to the encoding mode 1 when the delay information indicates a low delay, and connects to the encoding mode 2 when the delay information indicates a high delay. The encoding mode 1 is, for example, G. The encoding mode 2 is an encoding method suitable for an audio signal such as 729, and the encoding mode 2 is an encoding method suitable for a music signal such as MP3 or AAC.

Based on the delay information output from the processing delay determination unit 405, the output selection unit 602 connects to the encoding mode 1 or the encoding mode 2 of the encoding unit core 104, and from the encoding mode 1 or the encoding mode 2 The output encoded information is set as an output of the encoding apparatus 400.

As described above, according to the second embodiment, the processing delay allowed for the encoding process is determined according to the input source, and the input signal is switched by changing the encoding method according to the delay information indicating the determination result. By encoding, it is possible to encode with high accuracy and to suppress deterioration in sound quality.

In the first embodiment, the case where the hangover length is controlled according to the energy information of the input signal has been described. However, the hangover length may be controlled according to the processing delay allowed for the encoding process. Good.

• With high delay, the input signal can be stored in the buffer for a long time, so future data can be referenced. For this reason, the voice / music determination performance itself is improved. In this case, a long hangover is unnecessary, and the hangover length is shortened. As a result, it is possible to avoid deterioration in sound quality due to hangover and improve the overall sound quality.

On the other hand, with low delay, the input signal can only be stored in the buffer for a short time, so future data cannot be referenced. For this reason, only current data can be used for analysis. In this case, since the determination performance of voice / music determination is lowered, the hangover length is lengthened, frequent switching due to low-performance voice / music determination is prevented, and deterioration of sound quality is avoided. FIG. 10 shows an example in which the hangover length is long (hangover length = 2) in the case of low delay and the hangover length is short (hangover length = 1) in the case of high delay. Yes.

4 and 10 describe the case where the hangover length is 1 to 2 frames, but the present invention is not limited to this. The hangover length may be increased when the frame length of the encoding method is short (for example, 10 msec or less), or when speech / music determination performance is poor because noise is superimposed on the input signal. Further, when the frame length of the encoding method is long (for example, 40 msec or more), the hangover length may be shortened.

Note that although cases have been described with the above embodiments as examples where the present invention is configured by hardware, the present invention can also be realized by software in cooperation with hardware.

Further, each functional block used in the description of each of the above embodiments is typically realized as an LSI which is an integrated circuit. These may be individually made into one chip, or may be made into one chip so as to include a part or all of them. The name used here is LSI, but it may also be called IC, system LSI, super LSI, or ultra LSI depending on the degree of integration.

Also, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after manufacturing the LSI or a reconfigurable processor that can reconfigure the connection and setting of circuit cells inside the LSI may be used.

Furthermore, if integrated circuit technology that replaces LSI emerges as a result of advances in semiconductor technology or other derived technology, it is naturally also possible to integrate functional blocks using this technology. Biotechnology can be applied.

The disclosure of the specification, drawings and abstract contained in the Japanese application of Japanese Patent Application No. 2012-153563 filed on July 9, 2012 is incorporated herein by reference.

The encoding device and the encoding method according to the present invention can be applied to a communication terminal such as a mobile phone having a call function, for example.

DESCRIPTION OF SYMBOLS 101 Input buffer 102 Input signal determination part 103,601 Mode switching part 104 Coding part core 105,602 Output selection part 201 Mode determination part 202 Energy calculation part 203 Hangover length determination part 204 Hangover part 301 Storage part 302

Determination part

303 , 304 switch 305 mode information correction unit 401 user interface 402 microphone 403 data storage unit 404 input source identification unit 405 processing delay determination unit 406 encoding unit

Claims

Processing delay determination means for determining a delay time allowed for encoding processing of the input signal;
Mode determination means for determining whether the input signal is in a sound signal or music signal mode for each predetermined section;
Hangover length determining means for determining a hangover length according to the delay time;
Hangover means for comparing the mode of the previous section determined by the mode determination means and the mode of the current section, and using the result of the comparison and the hangover length, to determine the mode of the current section;
Encoding means for encoding the input signal by an encoding method according to a determined mode;
An encoding device comprising:
The hangover length determination means lengthens the hangover length when the delay time is low and low delay, and shortens the hangover length when the delay time is high and high delay.
The encoding device according to claim 1.
Energy calculating means for calculating the energy of the input signal included in the section determined to be a music signal;
The hangover length determining means determines a hangover length according to the energy.
The encoding device according to claim 1.
The hangover length determination means lengthens the hangover length when the energy is large, and shortens the hangover length when the energy is small.
The encoding device according to claim 3.
Comprising input source specifying means for specifying an input source of the input signal;
The processing delay determination means determines the delay time according to the input source.
The encoding device according to claim 1.
The input source is a microphone and stored data;
The encoding device according to claim 5.
The processing delay determination means determines that the delay time is low delay when the input source is a microphone, and determines that the delay time is high delay when the input source is stored data. To
The encoding device according to claim 6.
A processing delay determination step for determining a delay time allowed for the encoding process of the input signal;
A mode determination step for determining, for each predetermined section, whether the input signal is a voice signal or a music signal;
A hangover length determination step for determining a hangover length according to the delay time;
A hangover step of comparing the mode of the previous section determined in the mode determination step with the mode of the current section, and using the result of the comparison and the hangover length to determine the mode of the current section;
An encoding step of encoding the input signal by an encoding method according to a determined mode;
An encoding method comprising: