WO2009074773A1 - Processing a content signal - Google Patents

Processing a content signal Download PDF

Info

Publication number
WO2009074773A1
WO2009074773A1 PCT/GB2008/003999 GB2008003999W WO2009074773A1 WO 2009074773 A1 WO2009074773 A1 WO 2009074773A1 GB 2008003999 W GB2008003999 W GB 2008003999W WO 2009074773 A1 WO2009074773 A1 WO 2009074773A1
Authority
WO
WIPO (PCT)
Prior art keywords
frames
fingerprints
content signal
selecting
frame
Prior art date
Application number
PCT/GB2008/003999
Other languages
French (fr)
Inventor
Serverius Petrus Paulus Pronk
Johannes Henricus Maria Korst
Original Assignee
Ambx Uk Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ambx Uk Limited filed Critical Ambx Uk Limited
Priority to GB1009206.2A priority Critical patent/GB2467273B/en
Publication of WO2009074773A1 publication Critical patent/WO2009074773A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/68Systems specially adapted for using specific information, e.g. geographical or meteorological information
    • H04H60/73Systems specially adapted for using specific information, e.g. geographical or meteorological information using meta-information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H20/00Arrangements for broadcast or for distribution combined with broadcast
    • H04H20/28Arrangements for simultaneous broadcast of plural pieces of information
    • H04H20/30Arrangements for simultaneous broadcast of plural pieces of information by a single channel
    • H04H20/31Arrangements for simultaneous broadcast of plural pieces of information by a single channel using in-band signals, e.g. subsonic or cue signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/29Arrangements for monitoring broadcast services or broadcast-related services
    • H04H60/31Arrangements for monitoring the use made of the broadcast services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/35Arrangements for identifying or recognising characteristics with a direct linkage to broadcast information or to broadcast space-time, e.g. for identifying broadcast stations or for identifying users
    • H04H60/37Arrangements for identifying or recognising characteristics with a direct linkage to broadcast information or to broadcast space-time, e.g. for identifying broadcast stations or for identifying users for identifying segments of broadcast information, e.g. scenes or extracting programme ID
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/56Arrangements characterised by components specially adapted for monitoring, identification or recognition covered by groups H04H60/29-H04H60/54

Definitions

  • This invention relates to a method and system for processing a content signal.
  • the method provides for the identification of video streams based on fingerprints.
  • This device is only a first step towards creating a richer user experience when enjoying conventional video content. Additional effects such as rumbling, wind, temperature changes in the room, odours, sophisticated light effects, can be played in concert with the video stream, which can be modelled as a succession of video frames. These effects can, for instance, be realized by synchronously executing a script containing descriptions of these effects and this is used to control the appropriate hardware peripherals based on these descriptions. Typically, an effect can be triggered at a certain frame in the video and be instantaneous, last for a specified number of frames, fade out, et cetera. Such scripts can be hand-crafted and be very rich in terms of the bouquet of effects they can describe.
  • Hashes are short summaries or signatures of data files which can be used to identify the file. Hashing multimedia content (audio, video, images) is difficult because the hash of original content and processed (e.g. compressed) content may differ significantly.
  • the method of this document generates robust hashes for multimedia content, for example, audio clips.
  • the audio clip is divided into successive (preferably overlapping) frames. For each frame, the frequency spectrum is divided into bands. A robust property of each band (e.g. energy) is computed and represented by a respective hash bit.
  • An audio clip is thus represented by a concatenation of binary hash words, one for each frame.
  • a block of hash words derived therefrom is matched by a computer with a large database.
  • Such matching strategies are also disclosed.
  • the extraction process also provides information as to which of the hash bits are the least reliable. Flipping these bits considerably improves the performance of the matching process.
  • a fingerprint is a small digest of a single frame, based on average luminance values, and is typically 32 bits long and includes, for each bit, whether it is reliable or not.
  • a number of fingerprints from successive frames of the given video stream are sent by a client device such as a set-top box, to an identification server, which searches a large database of complete sequences of fingerprints from a large collection of video streams (for example, movies) to identify the video stream and indicate the position within this stream where the fingerprints came from. The latter comes virtually for free when the identification has been done.
  • a typical value for the number of successive fingerprints required for reliable identification is 250, corresponding to ten seconds of video material. This means that there is a large amount of processing required on the identification server and that there is a delay of at least ten seconds before identification is performed.
  • a method of processing a content signal comprising receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.
  • a system for processing a content signal comprising a client device arranged to receive a content signal comprising a series of frames, to select a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, and to extract a fingerprint for each selected frame to create a set of fingerprints, and an identification server arranged to match a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and to identify from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.
  • the invention it is possible to diminish the amount of processing required by not using a complete block of successive fingerprints, but instead only a subset of selected fingerprints from this block, based on a number of predetermined criteria, to calculate the quality of a match.
  • the amount of processing required is reduced, and the amount of traffic that must travel between the client device and the identification server is also reduced.
  • the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal.
  • the use of frames that are spaced apart from each other to create the fingerprints produces a set of fingerprints that will still be able to be used to identify the content signal, but have reduced redundancy compared to a complete set of fingerprints for a given set of frames.
  • the predefined algorithm uses a predefined distance d, where d is an integer greater than 1 , and the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal.
  • d is an integer greater than 1
  • the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal.
  • d could be set to be equal to 5, and so, if the start frame is frame 0 and the end frame is frame 250, then every 5 th frame will be selected, and have its fingerprint extracted.
  • the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are no more than the distance k apart in the content signal.
  • a pseudo-random selection of frames can be made, with a limit though on the maximum spacing between the frames that are selected. For example k could be set to be equal to 8, which means that when the frames are selected for fingerprint extraction, a pair of frames can be no more than 8 frames apart. This supports a selection of frames that is not as rigid as the "every dth frame" of the previous paragraph.
  • a further process by which the selection of fingerprints for matching could be made further comprises extracting a fingerprint for each frame from the start frame to the end frame, and calculating a reliability indicator for each extracted fingerprint, and wherein the creation of the set of fingerprints comprises selecting those extracted fingerprints whose reliability indicator is below a predetermined threshold.
  • a larger number of fingerprints are calculated (across the whole of a chunk of the content signal), but not all of these fingerprints are actually used in the matching process.
  • the selection of which fingerprints to use is made on a quality basis, with only those fingerprints that have their reliability indicator below a predetermined threshold being used in the matching part of the operation. This methodology sacrifices some time, but improves the quality of the matching.
  • the step of selecting a plurality of frames in the content signal according to a predefined algorithm comprises selecting a predefined number of frames.
  • the start and end frames are selected, the selection of the frames between these two frames is limited to a predetermined fixed number, which is set according to the number needed to perform the necessary matching.
  • FIG. 1 is a schematic diagram of a system for processing a content signal
  • Fig. 2 is a schematic diagram of the content signal
  • Fig. 3 is a schematic diagram of an identification server
  • Fig. 4 is a flowchart of a method of processing the content signal
  • Fig. 5 is a further flowchart of a method of processing the content signal
  • Fig. 6 is a view similar to Fig. 2, of the content signal.
  • Fig. 1 shows a system that comprises a client device 10, an identification server 12, a script server 14 and rendering devices 16.
  • the client device 12 and the rendering devices 16 are located in the same environment, such as a room in a user's house.
  • the identification server 12 and the script server 14 are located remotely from the client device 12 and rendering devices 16, and are connected to the client device 12 via a wide area network such as the Internet.
  • the two servers 12 and 14 are not necessarily at the same physical location, it is sufficient that they be available to the client device 10, as and when they are needed.
  • the client device 10 is a standalone device which has processing, storage and communication functions.
  • the device 10 has a broadband Internet connection, to communicate with the servers 12 and 14, and has a wireless transceiver to communicate with the rendering devices 16.
  • the rendering devices 16 comprise lights, fans, heating devices, and rumble pads etc., all of which are used to provide elements of an ambient environment.
  • the client device 10 receives a content signal 18.
  • This content signal could be an audio signal, a video signal, or a combination of the two, for example.
  • a simple configuration of the system of Fig. 1 is for the client device 10 to be connected to a video out socket of a device such as a DVD player. In this case, the client device 10 receives the same content signal 10 that is currently being displayed to the user via a conventional display device (not shown).
  • the client device 10 extracts fingerprints 20 from the content signal 18 (described in detail below) and transmits these to the identification server 12.
  • the identification server 12 uses the fingerprints 20 to identify the content signal 18 (again described below in more detail) and transmits back an ID reference 22, which may also include a position reference stating where in the content signal the fingerprints are located. This is because it is obviously useful to know, in addition to the identity of the content signal 18, where in the content signal 18 the user is currently located (two minutes from the start, or ten minutes from the start etc.).
  • the client device 10 then makes a request 24 to the script server 14, using the ID 22.
  • the script server 14, in reply, supplies a script 26 which can be used by the client device 10 to control the various rendering devices 16.
  • the script 26 will contain details of effects to render in concert with the content signal 18, to augment the user's experience of the content that they are currently watching.
  • the content signal 18 is supplied to a display device such as a television, and the script is supplied to the augmentation devices.
  • Fig. 1 depicts the basic building blocks of the system.
  • this client device 10 extracts a number of fingerprints 20 according to the invention, whereupon it sends them to the identification server 12.
  • This identification server 12 will return a content TD (with, optionally, a position,) if it can locate the provided fingerprints 20 with sufficient accuracy. Otherwise it returns null.
  • the client device 10 does obtain an ID and, optionally, a position, it sends a request to the script server 14, which returns, if available, a script 26 corresponding to the identified content.
  • the client device 10 next feeds the script 26 as well as the content signal 18 to a number of rendering devices 16.
  • the content 18 considered is video, it also applies to other content, such as audio, hi this embodiment, the script 26 is shown as received by the client device 10 and passed directly to the rendering devices 16. In other embodiments, the client device 10 will process the script 26, and feed appropriate commands to the rendering devices 16.
  • Fig. 2 shows the content signal 18 in more detail, as it is handled by the client device 10.
  • the client device 10 is arranged to receive the content signal 18, which comprises a series of frames 28, and to select a plurality of frames 28 in the content signal 18 according to a predefined algorithm.
  • the selected frames 28 comprise a start frame, an end frame, and only a portion of the frames 28 between the start frame and the end frame.
  • the selecting of the plurality of frames 28 comprises selecting frames 28 that are non-successive in the content signal 18.
  • the predefined algorithm uses a predefined distance d, where d is an integer > 1, equal to 3 in Fig. 2, and the selecting of the plurality of frames 28 comprises selecting frames 28 that are the distance d apart in the content signal 18.
  • the client device 10 is further arranged to extract a fingerprint 20 for each selected frame 28 to create a set 30 of fingerprints 20.
  • a number m of successive frames, numbered i, i + 1, ... , i + m - 1 and abbreviated as [i..i + m), of a piece of an unidentified video stream is used to compute a corresponding number of fingerprints f[0..m), one for each of the frames, each fingerprint being characterized by a bit string of typically 32 bits long, each bit also containing an indication of whether it is reliable or not.
  • this information is next used to identify the video stream in the following way. It is assumed that a server stores the fingerprints of a large number of movies to serve as ground truth, and all of the bits in these fingerprints are reliable.
  • one of the computed fingerprints is used to identify all movies and positions within these movies where the computed fingerprint matches the stored fingerprints at all reliable-bit positions.
  • the computed fingerprints are used to compute a bit-error rate (BER) between these fingerprints and the stored fingerprints at positions [i - j..i + m - j) in v, again considering only the reliable bits.
  • the movie-position pair with the lowest BER, if sufficiently low, is returned. If there is no such pair, then other computed fingerprints are used successively to repeat the same procedure until a movie-position pair with a sufficiently low BER is found and this pair is returned. If no movie-position pair yields a sufficiently low BER, then none is returned.
  • finding all movie-position pairs as described above can be done iteratively by successively matching all 2 n fingerprints that result from flipping all n unreliable bits independently, whereby matching is performed on all bits, reliable or not.
  • the movie-position pairs found are used to compute a BER as described above.
  • the order in which the fingerprints are chosen to find movie-position pairs can be optimized by selecting the fingerprints in order of increasing number of unreliable bits.
  • the system of Figs. 1 and 2 is characterized in that not all fingerprints 20 are used to perform the matching, but only a selected number.
  • the matching occurs at the identification server 12, shown in more detail in Fig. 3.
  • the server 12 comprises a network interface 32, a processor 34 and a database 36.
  • the identification server 12 is arranged to match a fingerprint 20 from the set 30 of fingerprints 20 to a plurality of fingerprints stored in the database 36, and to identify from the plurality of matched fingerprints, a best match.
  • the identifying of the best match comprises, for further fingerprints 20 in the set 30 of fingerprints 20, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints 20 has the lowest error rate.
  • fingerprints 20 at a fixed distance larger than 1 from each other and relative to the initially used fingerprint are used.
  • fingerprint f(j) has been used to find a number of movie-position pairs
  • only fingerprints ... ,f(j - 2k), fi(j - k), f(j), f(j + k), ... within the block of m successive fingerprints 20 are used in the matching process. Since the client device 10 decides which fingerprints 20 are used to find a number of movie-position pairs, then only the actually used fingerprints 20 need to be transferred to the server 12, leading to, besides less computational effort, to less communication effort as well. Although this subsampling may adversely affect the quality of the result returned by the server, a slight increase in the size m of the block of fingerprints 20 will offset this loss of quality.
  • Fig. 4 summarises the method of processing the content signal 18, which is carried out by the client device 10 and the identification server 12. Referring to the flowchart of Fig. 4, it deals with a sketch of the overall algorithm to identify the content signal 18 and, optionally, a position within this content signal 18, which is to where the extracted fingerprints 20 relate. As the content signal 18 arrives at the client device 10, a number of fingerprints 20 are extracted, yielding an array f[O..k) of fingerprints 20 and (in this embodiment) a distance d between the fingerprints 20. This information is next sent to the identification server, which executes the following commands.
  • an index j in [0..k) is chosen to find all content-position pairs m[0..n), p[0..n) where this fingerprint 20 is found.
  • at most one index i is returned that yields a content-position pair m[i], p[i] with the lowest bit-error rate (BER).
  • BER bit-error rate
  • a bit-error rate (BER) is calculated for all content-position pairs m[0..n), p[0..n). One pair with the lowest BER is selected. If this BER is sufficiently low, then the corresponding index is returned, otherwise it is returned that no index has been found.
  • the identification server 12 is identifying from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing the BER, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest BER. The index corresponding to a movie- position pair with the lowest BER is returned, provided that this BER is sufficiently low. Otherwise it is returned that no index has been found.
  • the client device 10 will transmit these two selected fingerprints 20 to the identification server 12.
  • This server 12 will take one of these fingerprints 20 (for example the earliest in the original signal 18), which constitutes fingerprint j, and perform a table look-up to immediately identify comparable fingerprints stored in the database 36. Every piece of content within the database 36 will have a fingerprint 20 calculated for every frame 28. The fingerprint j is compared with all of these stored fingerprints.
  • the identification server 12 must then determine which of these three is the best match; the step "find best index i". To execute the best matching process, the server 12 takes the second fingerprint (of the two originally received) which is j+5, and compares that fingerprint with the corresponding fingerprints five frames forward in each of the stored content that matched with fingerprint j. The error rate of the j+5 matching is computed (the BER) and the index i with the lowest BER is the best match. This identifies the content signal 18 that the user is currently viewing, and also the position within that content.
  • the spacing between the frames 28 that are used to generate the fingerprints 20 are spaced apart by the fixed distance d.
  • Such sequence of fingerprints 20 may be obtained, for instance, by only using those fingerprints 20 that have at most a specified number of unreliable bits and are not from successive frames 28. This is illustrated in Fig. 6.
  • the predefined algorithm uses a predefined constant j, where j is an integer > 1
  • the step of selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal.
  • the use of a sufficiently large value for j ensures that the frames 28 that are selected are likely to be sufficiently different from one another that a good range of fingerprints 20 are returned, assisting the matching process.
  • the client device 10 is arranged to extract a fingerprint 20 for each frame 28 from the start frame to the end frame, and to calculating a reliability indicator 38 for each extracted fingerprint 20.
  • the creation of the set 30 of fingerprints 20 comprises selecting those extracted fingerprints 20 whose reliability indicator 38 is below a predetermined threshold.
  • the indicator 38 can be a measure of the number of unreliable bits in the respective fingerprint 20.
  • the threshold can be determined dynamically, to ensure that a sufficient number of fingerprints 20 are selected to be transferred to the identification server 12.
  • the indicator 38 can also be used on the server side.
  • the fingerprints 20 can be sorted in order of increasing number of unreliable bits and only a first fraction of these ordered fingerprints 20 are used to calculate a BER, when searching for the best match. Also this will result in the transfer of a number of fingerprints at different distances in the movie to the server.
  • the lowest BER computed thus far is maintained as more fingerprints are used to generate movie-position pairs and this lowest BER is used to stop matching around a movie position pair to calculate the BER when it can be concluded that the final BER for this movie-position pair can not be lower than the lowest BER found thus far.
  • the abovementioned embodiments can be combined in various ways, as can easily be seen by a person of moderate skill in the art.

Abstract

A method of processing a content signal comprises receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints, a best match, the identifying comprising for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate. Advantageously, the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal.

Description

Processing a content signal
FIELD OF THE INVENTION
This invention relates to a method and system for processing a content signal. The method provides for the identification of video streams based on fingerprints.
BACKGROUND OF THE INVENTION
Traditional consumer electronics devices such as televisions are being augmented, either directly with new features, or indirectly with additional devices that provide additional content to the local environments. One example of such an augmented television is the Ambilight television, from Philips, which was introduced into the market in 2004. This television has additional lights that provide background lighting to augment the user's experience of watching the television. The colours that the lights show can be derived from the television signal.
This device is only a first step towards creating a richer user experience when enjoying conventional video content. Additional effects such as rumbling, wind, temperature changes in the room, odours, sophisticated light effects, can be played in concert with the video stream, which can be modelled as a succession of video frames. These effects can, for instance, be realized by synchronously executing a script containing descriptions of these effects and this is used to control the appropriate hardware peripherals based on these descriptions. Typically, an effect can be triggered at a certain frame in the video and be instantaneous, last for a specified number of frames, fade out, et cetera. Such scripts can be hand-crafted and be very rich in terms of the bouquet of effects they can describe.
One of the problems that occur in this context is that the system that is providing the augmentation in the user's environment is quite often given an unknown video stream, at an unknown position, i.e. frame, within this stream, and the system needs to identify the video stream as well as know the exact position within the video stream, in order to be able to synchronise the augmentation effects correctly with the video content. For example, if a user turns to a new television channel to watch a film that is being shown, there has to be a method of identifying the film and the specific point with the film that has been reached. Currently, fingerprinting technology is being used to solve this problem. For example, International Patent Application Publication WO 2002/065782 discloses a method of generating and matching hashes of multimedia content. Hashes are short summaries or signatures of data files which can be used to identify the file. Hashing multimedia content (audio, video, images) is difficult because the hash of original content and processed (e.g. compressed) content may differ significantly. The method of this document generates robust hashes for multimedia content, for example, audio clips. The audio clip is divided into successive (preferably overlapping) frames. For each frame, the frequency spectrum is divided into bands. A robust property of each band (e.g. energy) is computed and represented by a respective hash bit. An audio clip is thus represented by a concatenation of binary hash words, one for each frame. To identify a possibly compressed audio signal, a block of hash words derived therefrom is matched by a computer with a large database. Such matching strategies are also disclosed. In an advantageous embodiment, the extraction process also provides information as to which of the hash bits are the least reliable. Flipping these bits considerably improves the performance of the matching process.
This technique can also be used for video frames. In this case, a fingerprint is a small digest of a single frame, based on average luminance values, and is typically 32 bits long and includes, for each bit, whether it is reliable or not. A number of fingerprints from successive frames of the given video stream are sent by a client device such as a set-top box, to an identification server, which searches a large database of complete sequences of fingerprints from a large collection of video streams (for example, movies) to identify the video stream and indicate the position within this stream where the fingerprints came from. The latter comes virtually for free when the identification has been done. A typical value for the number of successive fingerprints required for reliable identification is 250, corresponding to ten seconds of video material. This means that there is a large amount of processing required on the identification server and that there is a delay of at least ten seconds before identification is performed.
SUMMARY OF THE INVENTION It is therefore an object of the invention to improve upon the known art.
According to a first aspect of the present invention, there is provided a method of processing a content signal comprising receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.
According to a second aspect of the present invention, there is provided a system for processing a content signal comprising a client device arranged to receive a content signal comprising a series of frames, to select a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, and to extract a fingerprint for each selected frame to create a set of fingerprints, and an identification server arranged to match a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and to identify from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.
Owing to the invention, it is possible to diminish the amount of processing required by not using a complete block of successive fingerprints, but instead only a subset of selected fingerprints from this block, based on a number of predetermined criteria, to calculate the quality of a match. The amount of processing required is reduced, and the amount of traffic that must travel between the client device and the identification server is also reduced.
Advantageously, the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal. The use of frames that are spaced apart from each other to create the fingerprints produces a set of fingerprints that will still be able to be used to identify the content signal, but have reduced redundancy compared to a complete set of fingerprints for a given set of frames.
Preferably, the predefined algorithm uses a predefined distance d, where d is an integer greater than 1 , and the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal. The use of a constant d provides a simple method of determining which frames to use to produce the fingerprints. For example, d could be set to be equal to 5, and so, if the start frame is frame 0 and the end frame is frame 250, then every 5th frame will be selected, and have its fingerprint extracted. In an alternative embodiment, the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are no more than the distance k apart in the content signal. In this case a pseudo-random selection of frames can be made, with a limit though on the maximum spacing between the frames that are selected. For example k could be set to be equal to 8, which means that when the frames are selected for fingerprint extraction, a pair of frames can be no more than 8 frames apart. This supports a selection of frames that is not as rigid as the "every dth frame" of the previous paragraph.
A further process by which the selection of fingerprints for matching could be made further comprises extracting a fingerprint for each frame from the start frame to the end frame, and calculating a reliability indicator for each extracted fingerprint, and wherein the creation of the set of fingerprints comprises selecting those extracted fingerprints whose reliability indicator is below a predetermined threshold. In this case, a larger number of fingerprints are calculated (across the whole of a chunk of the content signal), but not all of these fingerprints are actually used in the matching process. The selection of which fingerprints to use is made on a quality basis, with only those fingerprints that have their reliability indicator below a predetermined threshold being used in the matching part of the operation. This methodology sacrifices some time, but improves the quality of the matching. Advantageously, the step of selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, comprises selecting a predefined number of frames. However the start and end frames are selected, the selection of the frames between these two frames is limited to a predetermined fixed number, which is set according to the number needed to perform the necessary matching.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:- Fig. 1 is a schematic diagram of a system for processing a content signal,
Fig. 2 is a schematic diagram of the content signal, Fig. 3 is a schematic diagram of an identification server, Fig. 4 is a flowchart of a method of processing the content signal, Fig. 5 is a further flowchart of a method of processing the content signal, and Fig. 6 is a view similar to Fig. 2, of the content signal.
DETAILED DESCRIPTION OF THE EMBODIMENTS
Fig. 1 shows a system that comprises a client device 10, an identification server 12, a script server 14 and rendering devices 16. The client device 12 and the rendering devices 16 are located in the same environment, such as a room in a user's house. The identification server 12 and the script server 14 are located remotely from the client device 12 and rendering devices 16, and are connected to the client device 12 via a wide area network such as the Internet. The two servers 12 and 14 are not necessarily at the same physical location, it is sufficient that they be available to the client device 10, as and when they are needed.
The client device 10 is a standalone device which has processing, storage and communication functions. The device 10 has a broadband Internet connection, to communicate with the servers 12 and 14, and has a wireless transceiver to communicate with the rendering devices 16. The rendering devices 16 comprise lights, fans, heating devices, and rumble pads etc., all of which are used to provide elements of an ambient environment. The client device 10 receives a content signal 18. This content signal could be an audio signal, a video signal, or a combination of the two, for example. A simple configuration of the system of Fig. 1 is for the client device 10 to be connected to a video out socket of a device such as a DVD player. In this case, the client device 10 receives the same content signal 10 that is currently being displayed to the user via a conventional display device (not shown).
The client device 10 extracts fingerprints 20 from the content signal 18 (described in detail below) and transmits these to the identification server 12. The identification server 12 uses the fingerprints 20 to identify the content signal 18 (again described below in more detail) and transmits back an ID reference 22, which may also include a position reference stating where in the content signal the fingerprints are located. This is because it is obviously useful to know, in addition to the identity of the content signal 18, where in the content signal 18 the user is currently located (two minutes from the start, or ten minutes from the start etc.).
The client device 10 then makes a request 24 to the script server 14, using the ID 22. The script server 14, in reply, supplies a script 26 which can be used by the client device 10 to control the various rendering devices 16. The script 26 will contain details of effects to render in concert with the content signal 18, to augment the user's experience of the content that they are currently watching. The content signal 18 is supplied to a display device such as a television, and the script is supplied to the augmentation devices.
Referring to the architecture shown in Fig. 1, in summary, it depicts the basic building blocks of the system. As the video signal 18 arrives at the client device 10, this client device 10 extracts a number of fingerprints 20 according to the invention, whereupon it sends them to the identification server 12. This identification server 12 will return a content TD (with, optionally, a position,) if it can locate the provided fingerprints 20 with sufficient accuracy. Otherwise it returns null. If the client device 10 does obtain an ID and, optionally, a position, it sends a request to the script server 14, which returns, if available, a script 26 corresponding to the identified content. The client device 10 next feeds the script 26 as well as the content signal 18 to a number of rendering devices 16. Although in this system, the content 18 considered is video, it also applies to other content, such as audio, hi this embodiment, the script 26 is shown as received by the client device 10 and passed directly to the rendering devices 16. In other embodiments, the client device 10 will process the script 26, and feed appropriate commands to the rendering devices 16.
Fig. 2 shows the content signal 18 in more detail, as it is handled by the client device 10. The client device 10 is arranged to receive the content signal 18, which comprises a series of frames 28, and to select a plurality of frames 28 in the content signal 18 according to a predefined algorithm. The selected frames 28 comprise a start frame, an end frame, and only a portion of the frames 28 between the start frame and the end frame. In this embodiment, the selecting of the plurality of frames 28 comprises selecting frames 28 that are non-successive in the content signal 18. To achieve this, the predefined algorithm uses a predefined distance d, where d is an integer > 1, equal to 3 in Fig. 2, and the selecting of the plurality of frames 28 comprises selecting frames 28 that are the distance d apart in the content signal 18. Once the frames 28 have been selected, the client device 10 is further arranged to extract a fingerprint 20 for each selected frame 28 to create a set 30 of fingerprints 20.
In the above mentioned prior art, a number m of successive frames, numbered i, i + 1, ... , i + m - 1 and abbreviated as [i..i + m), of a piece of an unidentified video stream is used to compute a corresponding number of fingerprints f[0..m), one for each of the frames, each fingerprint being characterized by a bit string of typically 32 bits long, each bit also containing an indication of whether it is reliable or not. In one embodiment, this information is next used to identify the video stream in the following way. It is assumed that a server stores the fingerprints of a large number of movies to serve as ground truth, and all of the bits in these fingerprints are reliable.
Initially, one of the computed fingerprints, for instance f(j), is used to identify all movies and positions within these movies where the computed fingerprint matches the stored fingerprints at all reliable-bit positions. For each movie-position pair (v, i) found, the computed fingerprints are used to compute a bit-error rate (BER) between these fingerprints and the stored fingerprints at positions [i - j..i + m - j) in v, again considering only the reliable bits. The movie-position pair with the lowest BER, if sufficiently low, is returned. If there is no such pair, then other computed fingerprints are used successively to repeat the same procedure until a movie-position pair with a sufficiently low BER is found and this pair is returned. If no movie-position pair yields a sufficiently low BER, then none is returned.
In the actual prior art implementation, finding all movie-position pairs as described above can be done iteratively by successively matching all 2n fingerprints that result from flipping all n unreliable bits independently, whereby matching is performed on all bits, reliable or not. At each of the iterations, the movie-position pairs found are used to compute a BER as described above. The order in which the fingerprints are chosen to find movie-position pairs can be optimized by selecting the fingerprints in order of increasing number of unreliable bits.
The system of Figs. 1 and 2 is characterized in that not all fingerprints 20 are used to perform the matching, but only a selected number. The matching occurs at the identification server 12, shown in more detail in Fig. 3. The server 12 comprises a network interface 32, a processor 34 and a database 36. The identification server 12 is arranged to match a fingerprint 20 from the set 30 of fingerprints 20 to a plurality of fingerprints stored in the database 36, and to identify from the plurality of matched fingerprints, a best match. The identifying of the best match comprises, for further fingerprints 20 in the set 30 of fingerprints 20, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints 20 has the lowest error rate. In one embodiment, illustrated in Fig. 2, only fingerprints 20 at a fixed distance larger than 1 from each other and relative to the initially used fingerprint are used. To be more precise, if fingerprint f(j) has been used to find a number of movie-position pairs, then only fingerprints ... ,f(j - 2k), fi(j - k), f(j), f(j + k), ... within the block of m successive fingerprints 20 are used in the matching process. Since the client device 10 decides which fingerprints 20 are used to find a number of movie-position pairs, then only the actually used fingerprints 20 need to be transferred to the server 12, leading to, besides less computational effort, to less communication effort as well. Although this subsampling may adversely affect the quality of the result returned by the server, a slight increase in the size m of the block of fingerprints 20 will offset this loss of quality.
Fig. 4 summarises the method of processing the content signal 18, which is carried out by the client device 10 and the identification server 12. Referring to the flowchart of Fig. 4, it deals with a sketch of the overall algorithm to identify the content signal 18 and, optionally, a position within this content signal 18, which is to where the extracted fingerprints 20 relate. As the content signal 18 arrives at the client device 10, a number of fingerprints 20 are extracted, yielding an array f[O..k) of fingerprints 20 and (in this embodiment) a distance d between the fingerprints 20. This information is next sent to the identification server, which executes the following commands. First, an index j in [0..k) is chosen to find all content-position pairs m[0..n), p[0..n) where this fingerprint 20 is found. Next, at most one index i is returned that yields a content-position pair m[i], p[i] with the lowest bit-error rate (BER). This m[i] and, optionally, m[i], or null if there is no such pair, is returned to the client device 10.
The step "find best index i" shown in Fig. 4 is shown in more detail in Fig. 5. Referring to this second flowchart, a bit-error rate (BER) is calculated for all content-position pairs m[0..n), p[0..n). One pair with the lowest BER is selected. If this BER is sufficiently low, then the corresponding index is returned, otherwise it is returned that no index has been found. The identification server 12 is identifying from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing the BER, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest BER. The index corresponding to a movie- position pair with the lowest BER is returned, provided that this BER is sufficiently low. Otherwise it is returned that no index has been found.
To illustrate the above matching process, consider the simplest example, in which two fingerprints 20 are calculated from the content signal 18, these two fingerprints 20 being for two frames 28 that are spaced apart by a distance of five frames. Of course in a practical embodiment, there will be perhaps fifty fingerprints spread over 250 frames. In this example though, the client device 10 will transmit these two selected fingerprints 20 to the identification server 12. This server 12 will take one of these fingerprints 20 (for example the earliest in the original signal 18), which constitutes fingerprint j, and perform a table look-up to immediately identify comparable fingerprints stored in the database 36. Every piece of content within the database 36 will have a fingerprint 20 calculated for every frame 28. The fingerprint j is compared with all of these stored fingerprints.
Now suppose that the fingerprint j matches with three different fingerprints stored in the database 36. This multiple matching will occur due to the similarity of fingerprints for frames in different content. The identification server 12 must then determine which of these three is the best match; the step "find best index i". To execute the best matching process, the server 12 takes the second fingerprint (of the two originally received) which is j+5, and compares that fingerprint with the corresponding fingerprints five frames forward in each of the stored content that matched with fingerprint j. The error rate of the j+5 matching is computed (the BER) and the index i with the lowest BER is the best match. This identifies the content signal 18 that the user is currently viewing, and also the position within that content.
In the above discussion, it is assumed that the spacing between the frames 28 that are used to generate the fingerprints 20 are spaced apart by the fixed distance d. In another embodiment, a number of fingerprints 20 at variable distances from each other are used. For example, only fingerprints f(j0), f(jθ,.. ,f(ji-O, with jj -Ji-1 > k, for some k > 1 and for all i = 1, 2,... , 1 - 1, are used, one of which is used to find a number of movie position pairs. Such sequence of fingerprints 20 may be obtained, for instance, by only using those fingerprints 20 that have at most a specified number of unreliable bits and are not from successive frames 28. This is illustrated in Fig. 6.
Further schemes for selecting the frames 28 to use could be based around a random selection of frames 28, using a constant number as a minimum distance between the selected frames 28. In this case, the predefined algorithm uses a predefined constant j, where j is an integer > 1 , and the step of selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal. The use of a sufficiently large value for j ensures that the frames 28 that are selected are likely to be sufficiently different from one another that a good range of fingerprints 20 are returned, assisting the matching process.
In the embodiment of Fig. 6, the client device 10 is arranged to extract a fingerprint 20 for each frame 28 from the start frame to the end frame, and to calculating a reliability indicator 38 for each extracted fingerprint 20. In this case, the creation of the set 30 of fingerprints 20 comprises selecting those extracted fingerprints 20 whose reliability indicator 38 is below a predetermined threshold. The indicator 38 can be a measure of the number of unreliable bits in the respective fingerprint 20. The threshold can be determined dynamically, to ensure that a sufficient number of fingerprints 20 are selected to be transferred to the identification server 12.
The indicator 38 can also be used on the server side. For example, the fingerprints 20 can be sorted in order of increasing number of unreliable bits and only a first fraction of these ordered fingerprints 20 are used to calculate a BER, when searching for the best match. Also this will result in the transfer of a number of fingerprints at different distances in the movie to the server. In yet another embodiment, the lowest BER computed thus far is maintained as more fingerprints are used to generate movie-position pairs and this lowest BER is used to stop matching around a movie position pair to calculate the BER when it can be concluded that the final BER for this movie-position pair can not be lower than the lowest BER found thus far. Of course, the abovementioned embodiments can be combined in various ways, as can easily be seen by a person of moderate skill in the art.

Claims

CLAMS:
1. A method of processing a content signal comprising receiving a content signal comprising a series of frames, selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, extracting a fingerprint for each selected frame to create a set of fingerprints, matching a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and identifying from the plurality of matched fingerprints, a best match, the identifying comprising
- for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and
- determining which of the plurality of matched fingerprints has the lowest error rate.
2. A method according to claim 1 , wherein the step of selecting the plurality of frames comprises selecting frames that are non-successive in the content signal.
3. A method according to claim 2, wherein the predefined algorithm uses a predefined distance d, where d is an integer > 1 , and the step of selecting the plurality of frames comprises selecting frames that are the distance d apart in the content signal.
4. A method according to claim 2, wherein the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are no more than the distance k apart in the content signal.
5. A method according to claim 2, wherein the predefined algorithm uses a predefined constant j, where j is an integer > 1, and the step of selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal.
6. A method according to claim 1, and further comprising extracting a fingerprint for each frame from the start frame to the end frame, and calculating a reliability indicator for each extracted fingerprint, and wherein the creation of the set of fingerprints comprises selecting those extracted fingerprints whose reliability indicator is below a predetermined threshold.
7. A method according to any preceding claim, wherein the step of selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, comprises selecting a predefined number of frames.
8. A system for processing a content signal comprising a client device arranged to receive a content signal comprising a series of frames, to select a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, and to extract a fingerprint for each selected frame to create a set of fingerprints, and an identification server arranged to match a fingerprint from the set of fingerprints to a plurality of stored fingerprints, and to identify from the plurality of matched fingerprints, a best match, the identifying comprising, for further fingerprints in the set of fingerprints, computing an error rate, against corresponding stored fingerprints, and determining which of the plurality of matched fingerprints has the lowest error rate.
9. A system according to claim 8, wherein the client device is arranged, when selecting the plurality of frames, to select frames that are non-successive in the content signal.
10. A system according to claim 9, wherein the predefined algorithm uses a predefined distance d, where d is an integer > 1, and the client device is arranged, when selecting the plurality of frames, to select frames that are the distance d apart in the content signal.
11. A system according to claim 9, wherein the predefined algorithm uses a predefined constant k, where k is an integer > 1, and the client device is arranged, when selecting the plurality of frames, to select frames that are no more than the distance k apart in the content signal.
12. A system according to claim 9, wherein the predefined algorithm uses a predefined constant j, where j is an integer > 1, and the client device is arranged, when selecting the plurality of frames comprises selecting frames that are at least the distance j apart in the content signal.
13. A system according to claim 8, wherein the client device is further arranged to extract a fingerprint for each frame from the start frame to the end frame, and to calculate a reliability indicator for each extracted fingerprint, and, when creating of the set of fingerprints, to select those extracted fingerprints whose reliability indicator is below a predetermined threshold.
14. A system according to any one of claims 8 to 13, wherein the client device is arranged, when selecting a plurality of frames in the content signal according to a predefined algorithm, the selected frames comprising a start frame, an end frame, and only a portion of the frames between the start frame and the end frame, to select a predefined number of frames.
PCT/GB2008/003999 2007-12-11 2008-12-04 Processing a content signal WO2009074773A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1009206.2A GB2467273B (en) 2007-12-11 2008-12-04 Processing a content signal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP07122901.7 2007-12-11
EP07122901 2007-12-11

Publications (1)

Publication Number Publication Date
WO2009074773A1 true WO2009074773A1 (en) 2009-06-18

Family

ID=40404814

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2008/003999 WO2009074773A1 (en) 2007-12-11 2008-12-04 Processing a content signal

Country Status (2)

Country Link
GB (1) GB2467273B (en)
WO (1) WO2009074773A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160234205A1 (en) * 2015-02-11 2016-08-11 Electronics And Telecommunications Research Institute Method for providing security service for wireless device and apparatus thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4739398A (en) * 1986-05-02 1988-04-19 Control Data Corporation Method, apparatus and system for recognizing broadcast segments
WO2003067466A2 (en) * 2002-02-05 2003-08-14 Koninklijke Philips Electronics N.V. Efficient storage of fingerprints
US20060271947A1 (en) * 2005-05-23 2006-11-30 Lienhart Rainer W Creating fingerprints

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4739398A (en) * 1986-05-02 1988-04-19 Control Data Corporation Method, apparatus and system for recognizing broadcast segments
WO2003067466A2 (en) * 2002-02-05 2003-08-14 Koninklijke Philips Electronics N.V. Efficient storage of fingerprints
US20060271947A1 (en) * 2005-05-23 2006-11-30 Lienhart Rainer W Creating fingerprints

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAITSMA J ET AL: "Robust Audio Hashing For Content Identification", PROCEEDINGS INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING CBMI'01, BRESCIA, ITALY, 19 September 2001 (2001-09-19) - 21 September 2001 (2001-09-21), pages 1 - 8, XP002264398 *
OOSTVEEN J ET AL: "FEATURE EXTRACTION AND A DATABASE STRATEGY FOR VIDEO FINGERPRINTING", LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER VERLAG, BERLIN; DE, vol. 2314, 11 March 2002 (2002-03-11), pages 117 - 128, XP009017770, ISSN: 0302-9743 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160234205A1 (en) * 2015-02-11 2016-08-11 Electronics And Telecommunications Research Institute Method for providing security service for wireless device and apparatus thereof

Also Published As

Publication number Publication date
GB201009206D0 (en) 2010-07-14
GB2467273A (en) 2010-07-28
GB2467273B (en) 2013-01-23

Similar Documents

Publication Publication Date Title
CN110083714B (en) Acquisition, recovery, and matching of unique information from file-based media for automatic file detection
KR101579224B1 (en) Systems and methods for live media content matching
EP2843964B1 (en) Method for watermarking a content
US20110258211A1 (en) System and method for synchronous matching of media samples with broadcast media streams
US8699862B1 (en) Synchronized content playback related to content recognition
US10848821B2 (en) Watermark based content recognition improvements
CN101256568B (en) Method, system and apparatus for providing multimedia resource
CN1830211A (en) Method and device for generating and detecting fingerprints for synchronizing audio and video
US11223433B1 (en) Identification of concurrently broadcast time-based media
CN107004210A (en) For the system and method for the user for recognizing viewing television advertising
US11792254B2 (en) Use of in-band metadata as basis to access reference fingerprints to facilitate content-related action
CN108932254A (en) A kind of detection method of similar video, equipment, system and storage medium
US9338257B2 (en) Scene-based variable compression
US9807453B2 (en) Mobile search-ready smart display technology utilizing optimized content fingerprint coding and delivery
WO2009074773A1 (en) Processing a content signal
EP3785444B1 (en) Server-side insertion of media fragments
US9223458B1 (en) Techniques for transitioning between playback of media files
CN105554558A (en) Embedded-local-area-network-based video on-demand method and service system
CN115174960B (en) Audio and video synchronization method and device, computing equipment and storage medium
US20240028637A1 (en) Use of Mismatched Query Fingerprint as Basis to Validate Media Identification
KR20230096687A (en) A method and apparatus for searching for videos in which some segments are matched
KR20110062574A (en) Method and apparatus for synchronizing of multimedia contents
CN116028669A (en) Video searching method, device and system based on short video and storage medium
Shi et al. A Real-Time Smart Display Detection System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08859609

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 1009206

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20081204

WWE Wipo information: entry into national phase

Ref document number: 1009206.2

Country of ref document: GB

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08859609

Country of ref document: EP

Kind code of ref document: A1