EP2787503A1

EP2787503A1 - Method and system of audio signal watermarking

Info

Publication number: EP2787503A1
Application number: EP13162596.4A
Authority: EP
Inventors: Andrea Abrardo; Mauro Barni; GianLuigi Ferrari; Davide Arconte
Original assignee: Movym Srl
Current assignee: Movym Srl
Priority date: 2013-04-05
Filing date: 2013-04-05
Publication date: 2014-10-08

Abstract

Method and system of watermarking an audio signal wherein:
- the audio signal is assigned to a class, among a plurality of classes, depending on the semantic content of the audio signal, the plurality of classes being associated with a corresponding plurality of watermark profiles;
- the watermark profile associated with the class assigned to the audio signal is obtained;
- a watermark is embedded into the audio signal by using the obtained watermark profile so as to provide a watermarked audio signal.

Description

The present invention relates to a method and system of audio signal watermarking.
Audio signal watermarking is a process for embedding information data (watermark) into an audio signal without affecting the perceptual quality of the host signal itself.
The watermark should be imperceptible or nearly imperceptible to Human Auditory System (HAS). However, the watermark should be detectable through an automated detection process.
Watermarking techniques are known in the art.
For example, EP 1 594 122 discloses a watermarking method and apparatus employing spread spectrum technology and psycho-acoustic model. According to spread spectrum technology, a small baseband signal bandwidth is spread over a larger bandwidth by injecting or adding a higher frequency signal, or spreading function. Thereby the energy used for transmitting the signal is spread over a wider bandwidth, and appears as noise. According to psycho-acoustic model, based on psycho acoustical properties of the HAS, the watermark signal is shaped to reduce its magnitude so that it has a level that is below a masking threshold of the host audio/video signal. In the method and apparatus disclosed by EP 1 594 122 , at encoder side a spreading function is modulated by watermark data bits for providing a watermark signal; the current masking level of the audio/video signal is determined and a corresponding psycho-acoustic shaping of the watermark signal is performed; the psycho-acoustically shaped watermark signal is additionally shaped in order to reduce on average the magnitude of the watermark signal, whereby for each spectral line the phase of the values of the audio/video signal into which the psycho-acoustically and additionally shaped watermark signal is embedded is kept unchanged by the additional shaping; the psycho-acoustically and additionally shaped watermark signal is embedded into the audio/video signal.
The applicant observes that a watermarking technique should achieve a trade-off between three basic features: imperceptibility, robustness and payload, which are strictly linked to each other by inverse relationships. Depending upon the purpose of using watermarking, a watermarking technique should find a correct balance between the need to keep the watermark imperceptible; to make the watermark robust against attacks and manipulations of the host signal (e.g., noise distortion, A/D or D/A conversion, lossy coding, resizing, filtering, lossy compression); and to achieve the highest possible payload.
The applicant notes that psycho-acoustic model enables to determine the maximum distortion, i.e. the maximum watermark signal energy that can be introduced into a host signal without being perceptible by human senses. However, this model does not provide any information about robustness and payload and about optimization of trade-off among imperceptibility, robustness and payload.
It is thus an object of the invention to provide alternative method and system of audio signal watermarking.
It is a further object of the invention to provide improved method and system of audio signal watermarking with high performances in terms of trade-off among imperceptibility, robustness and payload.
The Applicant found that the above objects are achieved by a method and system of audio signal watermarking wherein audio signals are classified based on their semantic content and watermarks are embedded into the audio signals by using watermark profiles selected on the basis of the classes assigned to the audio signals.
Indeed, as described in more detail below, the Applicant found that, given an audio signal, the trade-off among watermark imperceptibility, robustness and payload can be optimized by fitting the watermark profile depending on the semantic content of the audio signal.
In the present disclosure, the expression "semantic content" in relation to an audio signal refers to the audio type contained in the audio signal. The semantic content of an audio signal can be, for example, speech (e.g. talks from movies, from TV or radio programs, from TV or radio advertisements, from TV or radio talk shows, and similar) or music. In case of music, the semantic content can be, for example, a musical genre (e.g., rock, classic, jazz, blues, instrumental, singing and similar). In case of speech, the semantic content can be, for example, a tone of voice, conversation, single person speaking, whisper, aloud quarrel, and similar.
In the present disclosure, the expression "watermark profile" is used to indicate a set of parameters used for embedding the watermark into the audio signal according to a predetermined watermarking technique.
In a first aspect, the present disclosure relates to a method of watermarking an audio signal comprising:

assigning the audio signal to a class, among a plurality of classes, depending on the semantic content of the audio signal, the plurality of classes being associated with a corresponding plurality of watermark profiles;
obtaining the watermark profile associated with the class assigned to the audio signal;
embedding a watermark into the audio signal by using the obtained watermark profile so as to provide a watermarked audio signal.

In a second aspect, the present disclosure relates to a system of watermarking an audio signal comprising an encoding device comprising:

a classification unit configured to assign the audio signal to a class, among a plurality of classes, depending on the semantic content of the audio signal, the plurality of classes being associated with a corresponding plurality of watermark profiles;
a watermark profile unit configured to obtain the watermark profile associated with the class assigned to the audio signal;
an embedding unit configured to embed a watermark into the audio signal by using the watermark profile obtained by watermark profile unit so as to provide a watermarked audio signal.

The method and system of the present disclosure may have at least one of the following preferred features.
Advantageously, each watermark profile is associated with a corresponding class so that trade-off among watermark imperceptibility, robustness and payload is optimized for said class, depending on the watermark application. For example, depending on the watermark application, one, two or all the features among imperceptibility, robustness and payload could be optimized for each class, by keeping unchanged the other feature(s), if any, among the classes. For example, in noisy applications, robustness could be maximized for each class, by keeping the same payload and imperceptibility level among the classes. On the other hand, in low-noise application and/or when many data need to be contained in the watermark, payload could be maximized for each class, by keeping the same robustness and imperceptibility level among the classes. Otherwise, in low-noise application, imperceptibility could be maximized for each class, by keeping the same payload and robustness level among the classes.
The plurality of watermark profiles can relate to a single watermarking technique or to a plurality of watermarking techniques.
In case of single watermarking technique, the watermark profiles differ from each other for the value taken by at least one parameter of said set of parameters.
In case of a plurality of watermarking techniques, the watermark profiles differ from each other for at least one of the parameters and/or for the values taken by at least one of the common parameters.
The watermarking technique(s) can be selected, for example, from the group comprising: spread spectrum watermarking technique, echo hiding watermarking technique, phase coding technique, informed watermarking schemes like QIM (Quantization Index Modulation) and Spread Transform Dither Modulation (STDM).
In a preferred embodiment, the method is a computer-implemented method.
In an embodiment, the embedding unit can comprise a plurality of embedding sub-units. Each embedding sub-unit can be configured to embed the watermark into the audio signal by using one watermark profile of said plurality of watermarking profiles or one watermarking technique of said plurality of watermarking techniques.
At least one parameter of said set of parameters defining the watermark profile may be selected, for example from the group comprising: watermark bit rate; frequency range hosting the watermark; Document to Watermark Ratio (DWR); watermark frame length; masking threshold modulation factor F, intended as a quantity by which the masking threshold of the audio signal -computed according to a psychoacoustic model- is multiplied to vary its amplitude with respect to the computed value; channel coding scheme (which may also include error detection techniques such as, for example, Cyclic Redundancy Check); number, amplitude, offset and decay rate of echo pulses (in case of echo hiding watermarking technique); spreading factor, intended as number of audio signal frequency or phase samples needed to insert one bit of watermark (in case of spread spectrum watermarking technique with, respectively, frequency or phase modulation).
Preferably, the plurality of classes is associated with the corresponding plurality of watermark profiles in a database. The database can be internal or external to the encoding device.
In a preferred embodiment, the expression watermarking refers to digital watermarking.
Digital watermarking relates to a computer-implemented watermarking process.
In a preferred embodiment, a masking threshold of the audio signal according to a psychoacoustic model is computed. Preferably, before computing the masking threshold, the audio signal is split into time windows and a masking threshold is computed for each time window of the audio signal. The masking threshold can be computed after, before or at the same time of the audio signal classification.
The psychoacoustic model can be a psychoacoustic model known in the art.
Preferably, the psychoacoustic model is adapted to calculate the masking threshold in time and/or frequency domain and is based on one of the following analysis: block based FFT (Fast Fourier Transform), block based DCT (Discrete Cosine Transform), block based MDCT (Modified Discrete Cosine Transform), block based MCLT (Modified Complex Lapped Transform), block based STFT (Short-Time Fourier Transform), sub-band or wavelet packet analysis.
Preferably, the encoding device comprises a masking unit configured to perform the masking threshold computation and, optionally, the time windows splitting.
Preferably, embedding the watermark into the audio signal comprises the step of shaping the energy of the watermark according to the computed masking threshold.
This advantageously enables to guarantee watermark imperceptibility to human auditory system.
Preferably, the watermark is shaped to reduce its energy below the computed masking threshold of the audio signal. When the set of parameters defining the obtained watermark profile comprises the masking threshold modulation factor F, the watermark is preferably shaped to reduce its energy below the computed masking threshold, multiplied by the masking threshold modulation factor F.
Preferably, the watermark is shaped by the embedding unit.
Preferably, the masking threshold modulation factor F is at least equal to 0.5. More preferably, the masking threshold modulation factor F is at least equal to 0.7, even more preferably at least equal to 0.8, even more preferably at least equal to 0.9.
Preferably, the masking threshold modulation factor F is not higher than 1.5. More preferably, the masking threshold modulation factor F is not higher than 1.3, even more preferably not higher than 1.2, even more preferably not higher than 1.1.
Assigning the audio signal to a class, according to the semantic content of the audio signal, is preferably performed based upon analysis of at least one audio signal feature.
In the present disclosure, the audio signal feature is related to the semantic content of the audio signal.
In the present disclosure, the audio signal feature is preferably related to time, frequency, energy or cepstrum domain.
For example, the at least one audio signal feature can be selected from the group comprising: loudness, brightness, beats per minute (BPM) bandwidth, pitch, odd-to-even harmonic energy ratio, spectral energy bands (e.g. spectrum sparsity), spectral and tonal complexities, spectral roll-off point (intended as any percentile of the power spectral distribution), spectral centroid (defined as the center of gravity of the magnitude spectrum), spectral "flux" (intended as squared difference between the normalized magnitudes of successive spectral distributions), time domain Zero-Crossing Rate, Cepstrum Resynthesis Residual Magnitude, Mel-Frequency Cepstral Coefficients, band periodicity (defined as the periodicity of a sub-band and derived by sub-band correlation analysis).
The analysis of at least one audio signal feature preferably comprises checking if the at least one audio signal feature meets one or more predetermined constraints. For example, the value of the at least one audio signal feature can be compared to one or more predetermined thresholds or one or more predetermined ranges of values. Each class is advantageously defined by predetermined constrains (e.g. set of values) to be met by the at least one audio signal feature.
Preferably, before assigning the audio signal to a class, the audio signal is split into sub-signals of shorter duration and a class is assigned to each sub-signal independently from the other sub-signals.
Suitably, the duration of the sub-signals is longer than the duration of the time windows in which the audio signal is split for performing masking threshold computation.
Preferably, the method of audio signal watermarking comprises a decoding process comprising extraction of the watermark from the watermarked audio signal. Preferably, the watermark is extracted by using the same watermark profile used for embedding the watermark into the audio signal.
Preferably, the system also comprises a decoding device configured to extract the watermark from the watermarked audio signal.
In an embodiment of the decoding process, the watermarked audio signal is assigned to a class, among the plurality of classes, depending on the semantic content of the watermarked audio signal, the plurality of classes being associated with the corresponding plurality of watermark profiles. Preferably, the class is assigned by a classification unit of the decoding device. According to this embodiment, the watermark profile associated with the class assigned to the audio signal is obtained and used for extracting the watermark from the watermarked audio signal. Preferably, the watermark is extracted from the watermarked audio signal by an extraction unit of the decoding device.
According to another embodiment of the decoding process, said plurality of watermark profiles are tried in sequence for extracting the watermark till the watermark is successfully extracted from the watermarked audio signal. In this case, the decoding device can comprise a single extraction unit for trying in sequence the plurality of watermark profiles. In alternative, the extraction unit can comprise a plurality of sub-extraction units, one for each watermark profile or for each watermarking technique, for trying the plurality of watermark profiles at least partly in parallel. In this embodiment, audio signal classification is not necessary at the decoding side.
According to a further embodiment, a second watermark is embedded into the audio signal, comprising the class assigned to the audio signal, by using a predefined watermark profile, common to all audio signals independently from their class. The second watermark can be embedded into the watermarked audio signal, already watermarked with the first watermark. In alternative, the first and the second watermarks can be embedded into different sub-bands of the audio signal. Watermark extraction can then be performed by first extracting the second watermark from the watermarked audio signal (by using the common watermark profile) so as to retrieve the class of the audio signal, and then by obtaining the watermark profile associated with the retrieved class and extracting the watermark from the watermarked audio signal with the obtained watermark profile. In this embodiment, audio signal classification is not necessary at the decoding side.
Further characteristics and advantages of the present invention will become clearer from the following detailed description of some preferred embodiments thereof, made as an example and not for limiting purposes with reference to the attached drawings. In such drawings,

figure 1 schematically shows a system of audio signal watermarking according to an embodiment of the invention;
figure 2 schematically shows the energy spectrum of four audio signals having a different semantic content;
figure 3 schematically shows a system of audio signal watermarking according to another embodiment of the invention;
figure 4 schematically shows a first embodiment of a decoding device of the system of figure 3;
figure 5 schematically shows a second embodiment of the decoding device of the system of figure 3;
figure 6 schematically shows an exemplary implementation of the system of figure 3.

Figure 1 discloses a system 1 of audio signal watermarking according to an embodiment of the invention.
The system 1 comprises an encoding device 10 comprising an input 11 for an audio signal, an input 13 for a watermark and an output 15 for a watermarked audio signal.
The encoding device 10 comprises a classification unit 12, a watermark profile unit 14, a masking unit 18 and an embedding unit 16.
The classification unit 12, watermark profile unit 14, masking unit 18 and embedding unit 16 comprise hardware and/or software and/or firmware configured to implement the method of the present disclosure.
The classification unit 12 is configured to assign the audio signal to a class depending on the semantic content of the audio signal.
The classification unit 12 is configured to analyse at least one audio signal feature related to the semantic content of the audio signal, to compare the at least one audio signal feature with one or more constrains and to assign the audio signal a class, selected among a predetermined plurality of classes, depending on the result of the comparison.
For example, the at least one audio signal feature can be selected from the group comprising: loudness, brightness, beats per minute (BPM) bandwidth, pitch, odd-to-even harmonic energy ratio, spectral energy bands (e.g. spectrum sparsity), spectral and tonal complexities, spectral roll-off point (intended as any percentile of the power spectral distribution), spectral centroid (defined as the center of gravity of the magnitude spectrum), spectral "flux" (intended as squared difference between the normalized magnitudes of successive spectral distributions), time domain Zero-Crossing Rate, Cepstrum Resynthesis Residual Magnitude, Mel-Frequency Cepstral Coefficients, band periodicity (defined as the periodicity of a sub-band and derived by sub-band correlation analysis).
For example, the at least one audio signal feature can be compared with one or more predetermined thresholds or one or more predetermined ranges of values and each class can be defined by a predetermined set of values that can be taken by the at least one audio signal feature.
The plurality of classes can be stored in a suitable class database 17 internal (as shown in figure 1) or external to the encoding device 10.
In a preferred embodiment (not shown), before assigning the audio signal to a class, the encoding device 10 is configured to split the audio signal into sub-signals of shorter duration (e.g. from few tenths to few tens of seconds) and the classification unit 12 is configured to classify each sub-signal independently from the other sub-signals.
In a preferred embodiment, the classification unit 12 is configured to classify the audio signals (or sub-signals) by analysing the spectrum sparsity of their energy spectrum.
The spectrum sparsity is an audio signal feature indicative of the energy concentration in a sub-band compared to the energy in the whole audio signal (or sub-signal) band.
The energy spectrum of the audio signal (or sub-signal) is considered sparse (or colored) if most part of its energy is concentrated in a small spectrum sub-band, otherwise it is considered non-sparse (or noise-like).
For example, figure 2 shows the energy spectrum of four audio signals having a different semantic content: speech, rock, jazz, piano solo.
For example, in case of figure 2, three different classes can be defined by analyzing the spectrum sparsity feature and, in the example, by comparing the fraction of signal energy (normalized to the total energy) contained in the 0-1000 Hz sub-band with two threshold levels S_L and S_H. If said fraction of energy in the 0-1000 Hz sub-band is lower than S_L, the audio signal (or sub-signal) can be classified into a "low sparse" class; if said fraction of energy in the 0-1000 Hz sub-band is between S_L and S_H, the audio signal (or sub-signal) can be classified into a "medium sparse" class; if it is higher than S_H, the audio signal can be classified into a "high sparse" class.
In the example of figure 2, by setting S_L = 0.85 and S_H = 0.90, talk and rock signals are classified as "low sparse" signals, jazz signal is classified as "medium sparse" signal, and piano solo signal is classified as "high sparse" signal.
It is noted that most audio signals have their energy concentrated in the 0-1000 Hz sub-band. Even slight differences (e.g. of about 0.05) between threshold levels S_L and S_H can thus be significant in such sub-band.
The plurality of classes used by the classification unit 12 are associated with a corresponding plurality of watermark profiles in a suitable watermark profile database 19 internal (as shown in figure 1) or external to the encoding device 10.
It is noted that, even if database 17 and database 19 are shown in the figures as two distinct entities, they can also be implemented into a single database.
Each watermark profile is defined by a set of parameters used for embedding the watermark into the audio signal according to a predetermined watermarking technique.
The watermarking technique can be a technique known in the art as, for example, a spread spectrum watermarking technique (e.g. wherein the watermark is spread over many frequency bins so that the energy in one bin is very small and undetectable), a echo hiding watermarking technique (e.g. wherein the watermark is embedded into an audio signal by introducing one or more echoes that are offset in time from the audio signal by an offset value associated with the data value of the bit), a phase coding technique (e.g. wherein phase differences between selected frequency component portions of an audio signal are modified to embed the watermark in the audio signal) or any other watermarking technique known in the art.
The set of parameters can comprise at least one parameter selected from the group comprising: watermark bit rate; frequency range hosting the watermark; Document to Watermark Ratio (DWR); watermark frame length; masking threshold modulation factor F; channel coding scheme; number, amplitude, offset and decay rate of echo pulses; spreading factor.
The plurality of watermark profiles associated with the plurality of classes can relate to a single watermarking technique or to a plurality of watermarking techniques.
In case of single watermarking technique, the watermark profiles are all defined by the same set of parameters and differ from each other for the values taken by at least one of the parameters.
In case of a plurality of watermarking techniques, the watermark profiles relating to different watermarking techniques are defined by different sets of parameters. The watermark profiles can thus differ from each other for at least one parameter and/or for at least one value taken by a common parameter.
According to the present disclosure, within the watermark profile database 19, each class is associated with a corresponding watermark profile that enables to optimize trade-off among watermark imperceptibility, robustness and payload for each class, depending on the watermark application.
In fact, the applicant observed that it is possible to obtain different optimized trade-offs for the audio signals, depending on the semantic content of each audio signal.
For example, as visible in figure 2, talk and rock signals, classified as "low sparse" signals, allows to introduce a higher level of distortion than jazz signals, classified as "medium sparse" signals, and than piano solo signals, classified as "high sparse" signals.
According to the invention, the level of distortion which is actually "available" for each audio signal class is advantageously exploited in order to optimize trade-off among watermark imperceptibility, robustness and payload, depending on the watermark application. For example, the higher level of distortion available for the "low sparse" and "medium sparse" signals, compared with the "high sparse" signals, could be exploited to maximize, for each class, one or two features among imperceptibility, robustness and payload, by keeping unchanged the other feature(s) among the classes.
For example, when the audio signal is intended to be transmitted in a low noise channel and/or played in a low-noise ambient (e.g. domestic ambient), payload could be maximized for each class, by keeping the same robustness and imperceptibility levels among the classes. On the other hand, when the audio signal is intended to be transmitted in a high noise channel and/or played in a high-noise environment (e.g. public ambient as a train station or airport), robustness could be maximized for each class, by keeping the same payload and imperceptibility level among the classes. Otherwise, when the audio signal is intended to be played in a low-noise ambient, imperceptibility could be maximized for each class, by keeping the same payload and robustness level among the classes.
Within the set of parameters defining a watermark profile, payload could be optimized by acting on the watermark bit rate; imperceptibility could be optimized by acting on the masking threshold modulation factor F; and robustness could be optimized by acting on at least one of: frequency range hosting the watermark; Document to Watermark Ratio; watermark frame length; channel coding scheme; number, amplitude, offset and decay rate of echo pulses; and spreading factor.
For example, in the case of figure 2, supposing to keep the same robustness and imperceptibility levels among classes, the level of distortion actually "available" for each audio signal class could be exploited in order to maximize the payload feature for each class, by associating a watermark profile with a higher bit rate with the low sparse class, a watermark profile with an intermediate bit rate with the medium sparse class and a watermark profile with a lower bit rate with the high sparse class. On the other hand, supposing to keep the same payload and imperceptibility level among classes, the robustness feature could be maximized for each class, by associating a more robust watermark profile with the low sparse class, an intermediate robust watermark profile with the medium sparse class and a lower robust watermark profile with the high sparse class. Otherwise, supposing to keep the same payload and robustness level among classes, the imperceptibility feature could be maximized for each class, by associating, for example, a watermark profile having a masking threshold modulation factor F ≥ 1 with the low sparse class, a watermark profile having a masking threshold modulation factor F = 1 with the medium sparse class and a watermark profile having a masking threshold modulation factor F ≤ 1 with the high sparse class.
As to the masking threshold modulation factor F, it is also observed that it could be set to different values, depending on the frequency ranges of the audio signal.
As also explained in more detail below, in case of F = 1, the watermark is shaped by the embedding unit 16 according to a masking threshold as computed by the masking unit 18, on the basis of a psycho-acoustic model. In case of F > 1, the watermark is shaped according to a higher masking threshold, whereby the imperceptibility level of the watermark is decreased with respect to the level set according to the psycho-acoustic model. In case of F < 1, the watermark is shaped according to a lower masking threshold, whereby the imperceptibility level of the watermark is increased with respect to the level set according to the psycho-acoustic model.
This is advantageous because the psychoacoustic model, as such, produces a representation of sound perception based on average human auditory system, without taking into account high or low level psychoacoustic effects. Indeed, the masking threshold computed according to the psychoacoustic model can be too strict in some situations (e.g. in case of rock music, noisy signal, conversation or aloud quarrel) or too light in other situations (e.g. in case of classic or instrumental music and expert hearer).
Accordingly, the masking threshold modulation factor F enables to vary the amplitude of the masking threshold, as computed according to the psycho-acoustic model, depending on the semantic content of the audio signal. In this way, the imperceptibility level of the watermark can be finely tuned, with respect to the level set according to the psycho-acoustic model, depending on the semantic content of the audio signal and on the watermark application.
Once the audio signal is assigned to a class by the classification unit 12, the watermark profile unit 14 is configured to retrieve from the watermark profile database 19 the watermark profile associated with said class and to provide it to the embedding unit 16.
The masking unit 18 is configured to compute a masking threshold of the audio signal according to a psycho-acoustic model and to provide it to the embedding unit 16.
The psychoacoustic model can be any psychoacoustic model known in the art.
Preferably, the psychoacoustic model calculates the masking threshold in time and/or frequency domain and is based on one of the following analysis: block based FFT (Fast Fourier Transform), block based DCT (Discrete Cosine Transform), block based MDCT (Modified Discrete Cosine Transform), block based MCLT (Modified Complex Lapped Transform), block based STFT (Short-Time Fourier Transform), sub-band or wavelet packet analysis.
Preferably, the masking unit 18 is configured to split the audio signal into suitable time windows (e.g. of about few ms) and to compute a masking threshold for each time window.
In the embodiment shown, the masking threshold computation is performed in parallel to audio signal classification.
The embedding unit 16 is configured to embed the watermark into the audio signal by using the watermark profile obtained by the watermark profile unit 14 so as to provide a watermarked audio signal.
The embedding unit 16 is also configured to shape the energy of the watermark according to the masking threshold computed by the masking unit 18.
When the set of parameters defining the watermark profile obtained by the watermark profile unit 14 comprises the masking threshold modulation factor F, the embedding unit 16 is also preferably configured to shape the watermark so as to reduce its energy below the masking threshold computed by the masking unit 18, multiplied by the masking threshold modulation factor F.
In an embodiment (not shown) the embedding unit 16 can comprise a plurality of embedding sub-units, one for each different watermark profile of said plurality of watermarking profiles or one for each of the watermarking techniques to which the plurality of watermarking profiles relates.
Figure 3 shows an embodiment wherein the system 1 comprises the encoding device 10, a communication network 30 and a decoding device 20.
As far as the encoding device 10 is concerned reference is made to what disclosed above.
The communication network 30 and the decoding device 20 comprise hardware and/or software and/or firmware configured to implement the method of the present disclosure.
The communication network 30 can be any type of communication network adapted to transmit the watermarked audio signal.
The decoding device 20 is configured to receive the watermarked audio signal and to extract the watermark from it.
Preferably, the watermark is extracted by using the same watermark profile used for embedding the watermark into the audio signal. In view of this, the decoding device 20 needs to know the watermark profile used for embedding the watermark.
Figure 4 shows a first embodiment of the decoding device 20 comprising a classification unit 22, a watermark profile unit 24, an extraction unit 26, a class database 27 and a watermark profile database 29.
The classification unit 22 is configured to assign the watermarked audio signal a class depending upon the semantic content of the audio signal, in the same way to what disclosed above with reference to classification unit 12 of encoding device 10. The class assigned to the watermarked audio signal will thus be the same as that assigned in the encoding device 10.
The class database 27 (which could also be external to the decoding device 20) stores the plurality of classes in which the audio signal can be classified.
The watermark profile database 29 (which could also be external to the decoding device 20) stores an association between the plurality of classes and the corresponding plurality of watermark profiles.
It is noted that, even if database 27 and database 29 are shown in the figures as two distinct entities, they can also be implemented into a single database.
Once the audio signal is assigned to a class by the classification unit 22, the watermark profile unit 24 is configured to retrieve from the watermark profile database 29 the watermark profile associated with said class and to provide it to the extraction unit 26.
The association in the watermark profile database 29 is the same as that in the watermark profile database 19 of the encoding device 10. The watermark profile retrieved by the watermark profile unit 24 will thus be the same as that used in the encoding device 10.
The extraction unit 26 is configured to use the watermark profile retrieved by watermark profile unit 24 for extracting the watermark from the watermarked audio signal.
Figure 5 shows a second embodiment of the decoding device 20 comprising an extraction unit 26. In this embodiment, watermarked audio signal classification is not performed in decoding device 20.
According to a first implementation of this embodiment, extraction unit 26 is configured to try in sequence the plurality of watermark profiles for extracting the watermark till the watermark is successfully extracted from the watermarked audio signal. The extraction unit 26 can comprise a plurality of extraction sub-units (not shown), one for each watermark profile or for each watermarking technique, for trying the plurality of watermark profiles at least partly (or wholly) in parallel.
According to a second implementation, the class of the audio signal can be inserted into the audio signal by the embedding unit 16 of the encoding device 10 by embedding a second watermark (containing said class) into the audio signal with a predefined watermark profile, common to all audio signals independently from their class. The second watermark can be embedded into the already watermarked audio signal. In alternative, the two watermarks can be embedded into different sub-bands of the audio signal.
In this second variant, the extraction unit 16 is preferably configured to first use the predefined common watermark profile to extract the second watermark from the watermarked audio signal thereby retrieving the class of the audio signal. Then, the extraction unit 16 is configured to obtain (e.g., from a watermark profile database - not shown in figure 5- similar to watermark profile database 29) the watermark profile associated with the retrieved class and to extract the watermark from the watermarked audio signal with the obtained watermark profile.
Figure 6 shows an exemplary implementation of the system 1 of audio signal watermarking.
According to this exemplary implementation, the encoding device 10 is deployed on an entity 50 for embedding watermarks into audio signals.
The entity 50 can be, for example, a recording industry, a music producer or a service supply company providing services to the users.
The watermark can comprise data relating to signature information, copyright information, serial numbers of broadcasted audio signals, product identification in audio signal broadcasting and similar.
The audio signal can comprise music or speech as, for example, talks from movies, from TV or radio programs, from TV or radio advertisements, from TV or radio talk shows, and similar.
The decoding device 20 is deployed on a user device 60.
The user device 60 can be, for example, a PC, a smart phone, a tablet, a portable media player (e.g. an iPod®), or other similar device.
The user device 60 can be adapted to download or stream a video from the internet or to detect audio signals of a TV program broadcasted on the TV or to detect audio signals of a movie played by means of a DVD player, a VHS player, a decoder or similar.
A media provider 40 can obtain watermarked audio signals from entity 50 and supply them to user device 60, through communication network 30.
The watermarked audio signals can be, for example, supplied to the user device 60 by means of broadcasting (e.g. from a TV or radio station), streaming (e.g. from the internet) or downloading (e.g. from a PC).
The media provider 40 can be, for example, a TV station, a radio station, a PC or other similar device.
User device 60, equipped with the decoding device 20, will be configured to extract the watermark from the watermarked audio signals.
For example, the watermark can comprise information enabling the user device 60 to connect, through communication network 30, to a service provider 70 that supplies predetermined services to users. In this case, the audio signals can be, for example, audio signals of a TV talk show or movie (or similar) and the user service can involve the provision of information to users about TV images the users are watching into the TV (e.g. information about the actors, about items of clothing and/or furnishing, about the movies or talk shows, about the set and similar).

Claims

Method of watermarking an audio signal comprising:
- assigning the audio signal to a class, among a plurality of classes, depending on the semantic content of the audio signal, the plurality of classes being associated with a corresponding plurality of watermark profiles;

- obtaining the watermark profile associated with the class assigned to the audio signal;

- embedding a watermark into the audio signal by using the obtained watermark profile so as to provide a watermarked audio signal.
Method according to claim 1, wherein each watermark profile is associated with a corresponding class so that trade-off among watermark imperceptibility, robustness and payload is optimized for said class, depending on watermarking application.
Method according to claim 1 or 2, wherein the plurality of watermark profiles relate to a single watermarking technique or to a plurality of watermarking techniques.
Method according to any of claims 1 to 3, wherein each watermark profile is defined by a set of parameters and the watermark profiles differ from each other for the value taken by at least one parameter of said set of parameters and/or for at least one of the set of parameters.
Method according to any of claims 1 to 4, further comprising: computing a masking threshold of the audio signal according to a psychoacoustic model.
Method according to claim 5, wherein embedding the watermark into the audio signal comprises the step of shaping the energy of the watermark according to the computed masking threshold.
Method according to claim 6, wherein each watermark profile is defined by a set of parameters comprising a masking threshold modulation factor F, and the energy of the watermark is shaped according to the computed masking threshold, multiplied by the masking threshold modulation factor F.
Method according to any of claims 1 to 7, wherein assigning the audio signal to a class, depending on the semantic content of the audio signal, is performed based upon analysis of at least one audio signal feature related to time, frequency, energy or cepstrum domain of the audio signal.
Method according to any of claims 1 to 8, further comprising a decoding process comprising: extracting the watermark from the watermarked audio signal by using the same watermark profile used for embedding the watermark into the audio signal.
Method according to claim 9, wherein the decoding process comprises:
- assigning the watermarked audio signal to a class, among said plurality of classes, depending on the semantic content of the watermarked audio signal;

- obtaining the watermark profile associated with the class assigned to the audio signal; and

- extracting the watermark from the watermarked audio signal by using the obtained watermark profile.
Method according to claim 9, wherein the decoding process comprises: trying in sequence said plurality of watermark profiles till the watermark is successfully extracted from the watermarked audio signal.
Method according to claim 9, wherein embedding the watermark into the audio signal comprises: embedding a second watermark into the audio signal, comprising the class assigned to the audio signal, by using a common watermark profile; and the decoding process comprises:
- extracting the second watermark from the watermarked audio signal by using the common watermark profile so as to retrieve the class of the audio signal,

- obtaining the watermark profile associated with the retrieved class, and

- extracting the watermark from the watermarked audio signal with the obtained watermark profile.
System (1) of watermarking an audio signal comprising an encoding device (10) comprising:
- a classification unit (12) configured to assign the audio signal to a class, among a plurality of classes, depending on the semantic content of the audio signal, the plurality of classes being associated with a corresponding plurality of watermark profiles;

- a watermark profile unit (14) configured to obtain the watermark profile associated with the class assigned to the audio signal;

- an embedding unit (16) configured to embed a watermark into the audio signal by using the watermark profile obtained by watermark profile unit so as to provide a watermarked audio signal.
System (1) according to claim 13, further comprising a database (19) storing the plurality of classes associated with the corresponding plurality of watermark profiles.
System (1) according to claim 13 or 14, further comprising a decoding device (20) configured to extract the watermark from the watermarked audio signal, by using the same watermark profile used by the embedding unit (16) for embedding the watermark into the audio signal.