Device for capturing audio/video data and metadata
The invention relates to a device and a method of capturing video data and associated metadata.
The invention relates more particularly to the addition of metadata to audio/video data running in real time.
Content indexing has become a necessity in devices supporting large storage capacities. The emergence of digital products and the integration in such products of high capacity storage means, such as hard disks, optical disks, has led to data indexing requirements which enable high speed access to the stored data.
In the known systems, the video clips were indexed manually by keywords, but the accumulation of digital data has made it necessary to develop robust tools for automatically analysing videos by their content, in other words using attributes extracted automatically.
In other systems, the images are indexed automatically according to their content. The images are analysed with respect to certain attributes that can be low level, such as colour, texture, etc, or of semantic type, such as the presence of landscapes, people, etc. Such systems are therefore not always suited to the requirements of the users.
Patent Application US 2002/0001395 published on 3 January 2002 and registered in the name of Digimarc Corporation proposes a device for authenticating metadata and including authentication information in the data. However, such a device is dedicated to indexing devices and does not apply to real time devices. The added metadata is not specifically representative of actions representative of the video clip.
The invention relates more specifically to the insertion of metadata concerning an event which takes place live in the associated video data and also relates to the triggering of actions in a post-production studio according to the metadata associated with the video data.
The invention proposes a device for capturing audio/video data
representative of an event, said device comprising means of adding metadata to the captured data, characterized in that the means of adding metadata associate a predefined metadata item with a type of event.
The invention can therefore make it possible to mark the captured audio/video data according to its content. This means that, if necessary, in a subsequent processing step, the captured data can be manipulated more easily to edit, modify or save it for example.
The invention will be better understood and illustrated by means of exemplary and advantageous embodiments, by no means limiting, given with reference to the appended figures in which:
- Figure 1 represents a system comprising a device according to an embodiment of the invention, - Figure 2 represents an embodiment of an application of the invention.
According to Figure 1 , the device 1 is in the preferred embodiment, a video camera comprising a user interface 3 with three buttons 6, 7 and 8.
The buttons (or keys) 6, 7 and 8 enable the user to add information to the video that is being filmed in real time by the use of audio/video capture means 2 which are the conventional capture means of a standard video camera.
The standard configuration of a video camera is well known to those skilled in the art. The camera comprises an optical system, an image sensor and control means such as a microprocessor, storage means, various means of communication with the external environment. The microprocessor uses a known operating system such as windows CE marketed by Microsoft.
The memory can include ROM- (read only memory), RAM- (random access memory) type memories or memory cards in PCMCIA format for example.
The camera is distinguished from the conventional cameras known to those skilled in the art by the fact that it also comprises a user interface for adding information directly to the content of the filmed video.
The user interface is represented in Figure 1 by three buttons 6, 7 and 8.
The buttons 6, 7 and 8 are each associated with different events.
When an event that is noteworthy with respect to the event being filmed occurs, the cameraman presses one of the buttons associated with the event.
Each event is then transcribed into an MXF (material exchange format) type stream to be transferred with the video data to which the event relates, in the MXF stream.
The MXF standard is described in the SMPTE standard document 380M.
The representative events are events that can be contained in the "shot" type field.
- the "shot start position" field is associated with the time at which the event occurs,
- the "shot duration" field is not used,
- the "shot track Ids" field is incremented with each noteworthy event, - the "shot description" field comprises the event type, in other words "goal", "red card", etc.
According to the type of event being filmed, it is also possible to modify the functions associated with these buttons. To this end, the user interface 3 is provided with an additional button, a control knob for example, not shown, for indicating the type of event being captured.
Each camera can thus be not dedicated to a type of event but configurable according to the event.
In the case of an event corresponding to a football match, the three buttons
6, 7 and 8 correspond, for example, to the following functions:
Button 6: goal.
Button 7: card (red or yellow). Button 8: fight.
In the case of an event corresponding to a tennis match, the three buttons correspond, for example, to the following functions:
Button 6: set. Button 7: point. Button 8: injury.
Each button 6, 7 and 8 can also correspond not to a single function but to several functions. It is possible, for example, to press the buttons for a longer or shorter time, or even to press them a number of times to obtain another function.
The number of buttons is of course given by way of indication and does not constitute a limitation of the invention. It is, of course, possible to have wider capabilities in user interface terms.
In other embodiments, the user interface 3 can also be provided with sound pick-up means, independently of the conventional sound pick-up means of a video camera included in the means 2. Thus, the cameraman can add representative words for the event, such as "goal", "red card", "yellow card", "substitution", in the context of a football match.
It is also possible to associate a set of authorized words with each event, the cameraman then using only those words.
The remote editing device 5 then searches for the words associated with the type of event in the MXF stream received and sets up the video editing in the same way as it does when the information is entered via the buttons.
It is also possible, in other embodiments, to combine a user interface having both buttons and sound pick-up means.
The number of buttons and the layout of the buttons must also be designed according to the ergonomics required for the camera. Too many buttons can lead to operating difficulties on the part of the user. The ergonomics of this interface are not the concern of this patent application.
The information received from the means 3 and the means 2 is then transmitted to means 4 of creating MXF streams.
The functions associated with the buttons or the sound pick-up means are
then converted into MXF data according to the event indicated by the control knob.
The conversion of the functions associated with the buttons into MXF data is achieved through means available in the camera, such as programmable- type circuits hardwired to the buttons 6, 7 and 8 and/or sound-pick-up means, or even through processors.
The MXF data is normally used to transmit information linked to the shot parameters.
The added camera parameters are linked to the lens (aperture, depth of field) to which can be added metadata values such as the time code.
The stream creation means 4 create the MXF stream.
Figure 2 represents an embodiment of the device according to the invention, the event corresponding to a football-match-type event.
Two actions are illustrated, one representing a storage action and the other an alarm action.
Cameras 10, 11 and 12 film a football match 9. The cameras 10, 11 and 12 are arranged around a football pitch and designed to display all of the pitch area.
The cameras are connected to parsers 13, 14, 15, 17, 18 and 19. The parsers 13, 14, 15, 17, 18 and 19 receive the video streams from the cameras and analyse them. The video streams are then transmitted to the wall of video screens 16 which displays the videos from the different cameras on a plurality of screens.
The metadata is representative of an important action and representative of a highlight of the event.
The reception device will act according to the metadata.
Several types of action are provided for according to the variety of metadata.
The reception devices are also designed to arrange the metadata.
The metadata is classified according to its order of importance for the current event. For example, a metadata item carrying goal information is metadata of the highest importance for a football match. The action of the reception device will therefore in this case be to transmit the information coming from the camera displaying this goal and not from the camera showing any player on the pitch or a fight on the terraces. To this end, the reception device is provided with selection means enabling it to select the video information. Another action that can be considered in the case of a goal is to retransmit a slow-motion replay of the last few seconds of video representative of the action.
There now follows a description of the first conditional alert application.
The reception device constantly analyses the metadata that it receives from the various capture devices.
The reception device is programmed to generate an alarm on reception of a particular metadata item and, in this case, in the case of a football match, a goal.
When a goal is detected, the reception device transmits the video images received from the camera having captured the goal. Naturally, when talking of a goal, this can comprise the action preceding the goal, in other words, a situation that may perhaps lead to a goal. In this case, the cameraman presses the corresponding button and the alarm is triggered on the reception device.
There now follows a description of the second conditional storage application.
It is possible to store in memory only video sequences associated with metadata having a predefined value, this predefined value, furthermore, possibly also being dependent on the event displayed. The mechanism implemented is similar to that implemented in the conditional alarm application. The data is stored only if it is associated with a particular
metadata item, or with several particular metadata items. This application can then be used for teaching purposes.
Other actions can, of course, be considered and fall within the context of this invention. Worthy of note, but in a non-exhaustive manner, is the example of video editing which consists in remodelling the video information, retaining, for example, only the highlights that are therefore associated with metadata. This could, for example, lead to the automatic or non-automatic creation of audio and/or video summaries.