WO2009074773A1

WO2009074773A1 - Processing a content signal

Info

Publication number: WO2009074773A1
Application number: PCT/GB2008/003999
Authority: WO
Inventors: Serverius Petrus Paulus Pronk; Johannes Henricus Maria Korst
Original assignee: Ambx Uk Limited
Priority date: 2007-12-11
Filing date: 2008-12-04
Publication date: 2009-06-18
Also published as: GB201009206D0; GB2467273A; GB2467273B

Abstract

A method of processing a content signal comprises receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints, a best match, the identifying comprising for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate. Advantageously, the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal.

Description

Processing a content signal

FIELD OF THE INVENTION

This invention relates to a method and system for processing a content signal. The method provides for the identification of video streams based on fingerprints.

BACKGROUND OF THE INVENTION

Traditional consumer electronics devices such as televisions are being augmented, either directly with new features, or indirectly with additional devices that provide additional content to the local environments. One example of such an augmented television is the Ambilight television, from Philips, which was introduced into the market in 2004. This television has additional lights that provide background lighting to augment the user's experience of watching the television. The colours that the lights show can be derived from the television signal.

This device is only a first step towards creating a richer user experience when enjoying conventional video content. Additional effects such as rumbling, wind, temperature changes in the room, odours, sophisticated light effects, can be played in concert with the video stream, which can be modelled as a succession of video frames. These effects can, for instance, be realized by synchronously executing a script containing descriptions of these effects and this is used to control the appropriate hardware peripherals based on these descriptions. Typically, an effect can be triggered at a certain frame in the video and be instantaneous, last for a specified number of frames, fade out, et cetera. Such scripts can be hand-crafted and be very rich in terms of the bouquet of effects they can describe.

One of the problems that occur in this context is that the system that is providing the augmentation in the user's environment is quite often given an unknown video stream, at an unknown position, i.e. frame, within this stream, and the system needs to identify the video stream as well as know the exact position within the video stream, in order to be able to synchronise the augmentation effects correctly with the video content. For example, if a user turns to a new television channel to watch a film that is being shown, there has to be a method of identifying the film and the specific point with the film that has been reached. Currently, fingerprinting technology is being used to solve this problem. For example, International Patent Application Publication WO 2002/065782 discloses a method of generating and matching hashes of multimedia content. Hashes are short summaries or signatures of data files which can be used to identify the file. Hashing multimedia content (audio, video, images) is difficult because the hash of original content and processed (e.g. compressed) content may differ significantly. The method of this document generates robust hashes for multimedia content, for example, audio clips. The audio clip is divided into successive (preferably overlapping) frames. For each frame, the frequency spectrum is divided into bands. A robust property of each band (e.g. energy) is computed and represented by a respective hash bit. An audio clip is thus represented by a concatenation of binary hash words, one for each frame. To identify a possibly compressed audio signal, a block of hash words derived therefrom is matched by a computer with a large database. Such matching strategies are also disclosed. In an advantageous embodiment, the extraction process also provides information as to which of the hash bits are the least reliable. Flipping these bits considerably improves the performance of the matching process.

This technique can also be used for video frames. In this case, a fingerprint is a small digest of a single frame, based on average luminance values, and is typically 32 bits long and includes, for each bit, whether it is reliable or not. A number of fingerprints from successive frames of the given video stream are sent by a client device such as a set-top box, to an identification server, which searches a large database of complete sequences of fingerprints from a large collection of video streams (for example, movies) to identify the video stream and indicate the position within this stream where the fingerprints came from. The latter comes virtually for free when the identification has been done. A typical value for the number of successive fingerprints required for reliable identification is 250, corresponding to ten seconds of video material. This means that there is a large amount of processing required on the identification server and that there is a delay of at least ten seconds before identification is performed.

SUMMARY OF THE INVENTION It is therefore an object of the invention to improve upon the known art.

According to a first aspect of the present invention, there is provided a method of processing a content signal comprising receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.

According to a second aspect of the present invention, there is provided a system for processing a content signal comprising a client device arranged to receive a content signal comprising a series of frames, to select a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, and to extract a fingerprint for each selected frame to create a set of fingerprints, and an identification server arranged to match a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and to identify from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.

Owing to the invention, it is possible to diminish the amount of processing required by not using a complete block of successive fingerprints, but instead only a subset of selected fingerprints from this block, based on a number of predetermined criteria, to calculate the quality of a match. The amount of processing required is reduced, and the amount of traffic that must travel between the client device and the identification server is also reduced.

Advantageously, the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal. The use of frames that are spaced apart from each other to create the fingerprints produces a set of fingerprints that will still be able to be used to identify the content signal, but have reduced redundancy compared to a complete set of fingerprints for a given set of frames.

Preferably, the predefined algorithm uses a predefined distance d, where d is an integer greater than 1 , and the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal. The use of a constant d provides a simple method of determining which frames to use to produce the fingerprints. For example, d could be set to be equal to 5, and so, if the start frame is frame 0 and the end frame is frame 250, then every 5^th frame will be selected, and have its fingerprint extracted. In an alternative embodiment, the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are no more than the distance k apart in the content signal. In this case a pseudo-random selection of frames can be made, with a limit though on the maximum spacing between the frames that are selected. For example k could be set to be equal to 8, which means that when the frames are selected for fingerprint extraction, a pair of frames can be no more than 8 frames apart. This supports a selection of frames that is not as rigid as the "every dth frame" of the previous paragraph.

A further process by which the selection of fingerprints for matching could be made further comprises extracting a fingerprint for each frame from the start frame to the end frame, and calculating a reliability indicator for each extracted fingerprint, and wherein the creation of the set of fingerprints comprises selecting those extracted fingerprints whose reliability indicator is below a predetermined threshold. In this case, a larger number of fingerprints are calculated (across the whole of a chunk of the content signal), but not all of these fingerprints are actually used in the matching process. The selection of which fingerprints to use is made on a quality basis, with only those fingerprints that have their reliability indicator below a predetermined threshold being used in the matching part of the operation. This methodology sacrifices some time, but improves the quality of the matching. Advantageously, the step of selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, comprises selecting a predefined number of frames. However the start and end frames are selected, the selection of the frames between these two frames is limited to a predetermined fixed number, which is set according to the number needed to perform the necessary matching.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:- Fig. 1 is a schematic diagram of a system for processing a content signal,

Fig. 2 is a schematic diagram of the content signal, Fig. 3 is a schematic diagram of an identification server, Fig. 4 is a flowchart of a method of processing the content signal, Fig. 5 is a further flowchart of a method of processing the content signal, and Fig. 6 is a view similar to Fig. 2, of the content signal.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Fig. 1 shows a system that comprises a client device 10, an identification server 12, a script server 14 and rendering devices 16. The client device 12 and the rendering devices 16 are located in the same environment, such as a room in a user's house. The identification server 12 and the script server 14 are located remotely from the client device 12 and rendering devices 16, and are connected to the client device 12 via a wide area network such as the Internet. The two servers 12 and 14 are not necessarily at the same physical location, it is sufficient that they be available to the client device 10, as and when they are needed.

The client device 10 is a standalone device which has processing, storage and communication functions. The device 10 has a broadband Internet connection, to communicate with the servers 12 and 14, and has a wireless transceiver to communicate with the rendering devices 16. The rendering devices 16 comprise lights, fans, heating devices, and rumble pads etc., all of which are used to provide elements of an ambient environment. The client device 10 receives a content signal 18. This content signal could be an audio signal, a video signal, or a combination of the two, for example. A simple configuration of the system of Fig. 1 is for the client device 10 to be connected to a video out socket of a device such as a DVD player. In this case, the client device 10 receives the same content signal 10 that is currently being displayed to the user via a conventional display device (not shown).

The client device 10 extracts fingerprints 20 from the content signal 18 (described in detail below) and transmits these to the identification server 12. The identification server 12 uses the fingerprints 20 to identify the content signal 18 (again described below in more detail) and transmits back an ID reference 22, which may also include a position reference stating where in the content signal the fingerprints are located. This is because it is obviously useful to know, in addition to the identity of the content signal 18, where in the content signal 18 the user is currently located (two minutes from the start, or ten minutes from the start etc.).

The client device 10 then makes a request 24 to the script server 14, using the ID 22. The script server 14, in reply, supplies a script 26 which can be used by the client device 10 to control the various rendering devices 16. The script 26 will contain details of effects to render in concert with the content signal 18, to augment the user's experience of the content that they are currently watching. The content signal 18 is supplied to a display device such as a television, and the script is supplied to the augmentation devices.

Referring to the architecture shown in Fig. 1, in summary, it depicts the basic building blocks of the system. As the video signal 18 arrives at the client device 10, this client device 10 extracts a number of fingerprints 20 according to the invention, whereupon it sends them to the identification server 12. This identification server 12 will return a content TD (with, optionally, a position,) if it can locate the provided fingerprints 20 with sufficient accuracy. Otherwise it returns null. If the client device 10 does obtain an ID and, optionally, a position, it sends a request to the script server 14, which returns, if available, a script 26 corresponding to the identified content. The client device 10 next feeds the script 26 as well as the content signal 18 to a number of rendering devices 16. Although in this system, the content 18 considered is video, it also applies to other content, such as audio, hi this embodiment, the script 26 is shown as received by the client device 10 and passed directly to the rendering devices 16. In other embodiments, the client device 10 will process the script 26, and feed appropriate commands to the rendering devices 16.

Fig. 2 shows the content signal 18 in more detail, as it is handled by the client device 10. The client device 10 is arranged to receive the content signal 18, which comprises a series of frames 28, and to select a plurality of frames 28 in the content signal 18 according to a predefined algorithm. The selected frames 28 comprise a start frame, an end frame, and only a portion of the frames 28 between the start frame and the end frame. In this embodiment, the selecting of the plurality of frames 28 comprises selecting frames 28 that are non-successive in the content signal 18. To achieve this, the predefined algorithm uses a predefined distance d, where d is an integer > 1, equal to 3 in Fig. 2, and the selecting of the plurality of frames 28 comprises selecting frames 28 that are the distance d apart in the content signal 18. Once the frames 28 have been selected, the client device 10 is further arranged to extract a fingerprint 20 for each selected frame 28 to create a set 30 of fingerprints 20.

In the above mentioned prior art, a number m of successive frames, numbered i, i + 1, ... , i + m - 1 and abbreviated as [i..i + m), of a piece of an unidentified video stream is used to compute a corresponding number of fingerprints f[0..m), one for each of the frames, each fingerprint being characterized by a bit string of typically 32 bits long, each bit also containing an indication of whether it is reliable or not. In one embodiment, this information is next used to identify the video stream in the following way. It is assumed that a server stores the fingerprints of a large number of movies to serve as ground truth, and all of the bits in these fingerprints are reliable.

Initially, one of the computed fingerprints, for instance f(j), is used to identify all movies and positions within these movies where the computed fingerprint matches the stored fingerprints at all reliable-bit positions. For each movie-position pair (v, i) found, the computed fingerprints are used to compute a bit-error rate (BER) between these fingerprints and the stored fingerprints at positions [i - j..i + m - j) in v, again considering only the reliable bits. The movie-position pair with the lowest BER, if sufficiently low, is returned. If there is no such pair, then other computed fingerprints are used successively to repeat the same procedure until a movie-position pair with a sufficiently low BER is found and this pair is returned. If no movie-position pair yields a sufficiently low BER, then none is returned.

In the actual prior art implementation, finding all movie-position pairs as described above can be done iteratively by successively matching all 2ⁿ fingerprints that result from flipping all n unreliable bits independently, whereby matching is performed on all bits, reliable or not. At each of the iterations, the movie-position pairs found are used to compute a BER as described above. The order in which the fingerprints are chosen to find movie-position pairs can be optimized by selecting the fingerprints in order of increasing number of unreliable bits.

The system of Figs. 1 and 2 is characterized in that not all fingerprints 20 are used to perform the matching, but only a selected number. The matching occurs at the identification server 12, shown in more detail in Fig. 3. The server 12 comprises a network interface 32, a processor 34 and a database 36. The identification server 12 is arranged to match a fingerprint 20 from the set 30 of fingerprints 20 to a plurality of fingerprints stored in the database 36, and to identify from the plurality of matched fingerprints, a best match. The identifying of the best match comprises, for further fingerprints 20 in the set 30 of fingerprints 20, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints 20 has the lowest error rate. In one embodiment, illustrated in Fig. 2, only fingerprints 20 at a fixed distance larger than 1 from each other and relative to the initially used fingerprint are used. To be more precise, if fingerprint f(j) has been used to find a number of movie-position pairs, then only fingerprints ... ,f(j - 2k), fi(j - k), f(j), f(j + k), ... within the block of m successive fingerprints 20 are used in the matching process. Since the client device 10 decides which fingerprints 20 are used to find a number of movie-position pairs, then only the actually used fingerprints 20 need to be transferred to the server 12, leading to, besides less computational effort, to less communication effort as well. Although this subsampling may adversely affect the quality of the result returned by the server, a slight increase in the size m of the block of fingerprints 20 will offset this loss of quality.

Fig. 4 summarises the method of processing the content signal 18, which is carried out by the client device 10 and the identification server 12. Referring to the flowchart of Fig. 4, it deals with a sketch of the overall algorithm to identify the content signal 18 and, optionally, a position within this content signal 18, which is to where the extracted fingerprints 20 relate. As the content signal 18 arrives at the client device 10, a number of fingerprints 20 are extracted, yielding an array f[O..k) of fingerprints 20 and (in this embodiment) a distance d between the fingerprints 20. This information is next sent to the identification server, which executes the following commands. First, an index j in [0..k) is chosen to find all content-position pairs m[0..n), p[0..n) where this fingerprint 20 is found. Next, at most one index i is returned that yields a content-position pair m[i], p[i] with the lowest bit-error rate (BER). This m[i] and, optionally, m[i], or null if there is no such pair, is returned to the client device 10.

The step "find best index i" shown in Fig. 4 is shown in more detail in Fig. 5. Referring to this second flowchart, a bit-error rate (BER) is calculated for all content-position pairs m[0..n), p[0..n). One pair with the lowest BER is selected. If this BER is sufficiently low, then the corresponding index is returned, otherwise it is returned that no index has been found. The identification server 12 is identifying from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing the BER, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest BER. The index corresponding to a movie- position pair with the lowest BER is returned, provided that this BER is sufficiently low. Otherwise it is returned that no index has been found.

To illustrate the above matching process, consider the simplest example, in which two fingerprints 20 are calculated from the content signal 18, these two fingerprints 20 being for two frames 28 that are spaced apart by a distance of five frames. Of course in a practical embodiment, there will be perhaps fifty fingerprints spread over 250 frames. In this example though, the client device 10 will transmit these two selected fingerprints 20 to the identification server 12. This server 12 will take one of these fingerprints 20 (for example the earliest in the original signal 18), which constitutes fingerprint j, and perform a table look-up to immediately identify comparable fingerprints stored in the database 36. Every piece of content within the database 36 will have a fingerprint 20 calculated for every frame 28. The fingerprint j is compared with all of these stored fingerprints.

Now suppose that the fingerprint j matches with three different fingerprints ^■ stored in the database 36. This multiple matching will occur due to the similarity of fingerprints for frames in different content. The identification server 12 must then determine which of these three is the best match; the step "find best index i". To execute the best matching process, the server 12 takes the second fingerprint (of the two originally received) which is j+5, and compares that fingerprint with the corresponding fingerprints five frames forward in each of the stored content that matched with fingerprint j. The error rate of the j+5 matching is computed (the BER) and the index i with the lowest BER is the best match. This identifies the content signal 18 that the user is currently viewing, and also the position within that content.

In the above discussion, it is assumed that the spacing between the frames 28 that are used to generate the fingerprints 20 are spaced apart by the fixed distance d. In another embodiment, a number of fingerprints 20 at variable distances from each other are used. For example, only fingerprints f(j₀), f(jθ,.. ,f(ji-O, with jj -Ji_-1 > k, for some k > 1 and for all i = 1, 2,... , 1 - 1, are used, one of which is used to find a number of movie position pairs. Such sequence of fingerprints 20 may be obtained, for instance, by only using those fingerprints 20 that have at most a specified number of unreliable bits and are not from successive frames 28. This is illustrated in Fig. 6.

Further schemes for selecting the frames 28 to use could be based around a random selection of frames 28, using a constant number as a minimum distance between the selected frames 28. In this case, the predefined algorithm uses a predefined constant j, where j is an integer > 1 , and the step of selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal. The use of a sufficiently large value for j ensures that the frames 28 that are selected are likely to be sufficiently different from one another that a good range of fingerprints 20 are returned, assisting the matching process.

In the embodiment of Fig. 6, the client device 10 is arranged to extract a fingerprint 20 for each frame 28 from the start frame to the end frame, and to calculating a reliability indicator 38 for each extracted fingerprint 20. In this case, the creation of the set 30 of fingerprints 20 comprises selecting those extracted fingerprints 20 whose reliability indicator 38 is below a predetermined threshold. The indicator 38 can be a measure of the number of unreliable bits in the respective fingerprint 20. The threshold can be determined dynamically, to ensure that a sufficient number of fingerprints 20 are selected to be transferred to the identification server 12.

The indicator 38 can also be used on the server side. For example, the fingerprints 20 can be sorted in order of increasing number of unreliable bits and only a first fraction of these ordered fingerprints 20 are used to calculate a BER, when searching for the best match. Also this will result in the transfer of a number of fingerprints at different distances in the movie to the server. In yet another embodiment, the lowest BER computed thus far is maintained as more fingerprints are used to generate movie-position pairs and this lowest BER is used to stop matching around a movie position pair to calculate the BER when it can be concluded that the final BER for this movie-position pair can not be lower than the lowest BER found thus far. Of course, the abovementioned embodiments can be combined in various ways, as can easily be seen by a person of moderate skill in the art.

Claims

CLAMS:

1. A method of processing a content signal comprising receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints, a best match, the identifying comprising

- for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and

- determining which of the plurality of matched fingerprints has the lowest error rate.

2. A method according to claim 1 , wherein the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal.

3. A method according to claim 2, wherein the predefined algorithm uses a predefined distance d, where d is an integer > 1 , and the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal.

4. A method according to claim 2, wherein the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are no more than the distance k apart in the content signal.

5. A method according to claim 2, wherein the predefined algorithm uses a predefined constant j, where j is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal.

6. A method according to claim 1, and further comprising extracting a fingerprint for each frame from the start frame to the end frame, and calculating a reliability indicator for each extracted fingerprint, and wherein the creation of the set of fingerprints comprises selecting those extracted fingerprints whose reliability indicator is below a predetermined threshold.

7. A method according to any preceding claim, wherein the step of selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, comprises selecting a predefined number of frames.

8. A system for processing a content signal comprising a client device arranged to receive a content signal comprising a series of frames, to select a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, and to extract a fingerprint for each selected frame to create a set of fingerprints, and an identification server arranged to match a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and to identify from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.

9. A system according to claim 8, wherein the client device is arranged, when selecting the plurality of frames, to select frames that are non-successive in the content signal.

10. A system according to claim 9, wherein the predefined algorithm uses a predefined distance d, where d is an integer > 1, and the client device is arranged, when selecting the plurality of frames, to select frames that are the distance d apart in the content signal.

11. A system according to claim 9, wherein the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the client device is arranged, when selecting the plurality of frames, to select frames that are no more than the distance k apart in the content signal.

12. A system according to claim 9, wherein the predefined algorithm uses a predefined constant j, where j is an integer > 1, and the client device is arranged, when selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal.

13. A system according to claim 8, wherein the client device is further arranged to extract a fingerprint for each frame from the start frame to the end frame, and to calculate a reliability indicator for each extracted fingerprint, and, when creating of the set of fingerprints, to select those extracted fingerprints whose reliability indicator is below a predetermined threshold.

14. A system according to any one of claims 8 to 13, wherein the client device is arranged, when selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, to select a predefined number of frames.