CN101178924B

CN101178924B - System and method for inserting a description of images into audio recordings

Info

Publication number: CN101178924B
Application number: CN2007101692692A
Authority: CN
Inventors: 彼德·C.·伯伊勒; 张宇
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-11-09
Filing date: 2007-11-08
Publication date: 2010-09-15
Anticipated expiration: 2027-11-08
Also published as: US7996227B2; US20080114601A1; CA2567505A1; CN101178924A

Abstract

There is disclosed a system and method for interpreting and describing graphic images. In an embodiment, the method of inserting a description of an image into an audio recording includes: interpreting an image and producing a word description of the image including at least one image keyword; parsing an audio recording into a plurality of audio clips, and producing a transcription of each audio clip, each audio clip transcription including at least one audio keyword; calculating a similarity distance between the at least one image keyword and the at least one audio keyword of each audio clip; and selecting the audio clip transcription having a shortest similarity distance to the at least one image keyword as a location to insert the word description of the image. The word description of the image can then be appended to the selected audio clip to produce an augmented audio recording including the interpreted word description of the image.

Description

Be used for the explanation of image is inserted into the system and method for audio recording

Copyright notice

The part of the disclosure of this patent file comprises material protected by copyright.The copyright owner does not oppose to xerox the patent document or patent disclosure content of duplicating as in appearing at Patent﹠Trademark Office's patent document or record, in any case otherwise keep all copyright rights whatsoever.

Technical field

The present invention relates to be used for the explanation of image is inserted into the system and method for audio recording.

Background technology

Teaching or give a lecture with the audio format record can be the mode easily and effectively that diffuses information beyond directly attending.Yet if speech comprises image, chart and figure, lacking vision content may cause very big influence for the validity that transmits.Needed is the method and system that is used for the explanation of image is inserted into audio recording.

Summary of the invention

In an embodiment, method is from the interpretation of images by the non-content of text of deciphering, any meta-tag information of compilation, collection optical character identification (OCR) data or the like.This method set then, filtration and prioritisation (prioritize) information are to create useful, simple and clear invisible (for example, audio frequency, text or the like) explanation of image.The result of this image interpretation and explanation is when image is unavailable, such as increased invisible content when listening to audio record or when listening a text-to-speech system to read text.For example, commonly used presenting and graphic package can be explained by system, such as Microsoft PowerPoint ^TMAnd Visio ^TM, and the explanation of image is inserted in the audio recording that presents.

On the one hand, provide a kind of explanation to be inserted into method in the audio recording, having comprised: interpretation of images and produce the explanatory note of the image that comprises at least one image keyword image; Audio recording is resolved to a plurality of audio clips, and the transcript (transcription) that produces each audio clips, each audio clips transcript comprises at least one audio frequency key word; Calculate the similarity distance of at least one audio frequency key word of at least one image keyword and each audio clips; And select to have to the audio clips transcript of the shortest similarity distance of at least one image keyword position as the explanatory note that inserts image.

In one embodiment, this method also comprises the audio clips that the explanatory note of image is appended to selection, comprises the audio recording of increase of at least one explanatory explanatory note of image with generation.

In another embodiment, this method also comprises the template that at least one interpretation of images is provided, and this at least one template comprises at least one image interpretation parts, is used to produce the explanatory note of image.

In another embodiment, this method comprises that also at least one the technology that provides in optical character identification (OCR) technology, edge searching technology, colour edging searching technology, curve searching technology, shape searching technology and the contrast searching technology is as the image interpretation parts in this at least one template.

In another embodiment, this method also comprises a plurality of audio clips that audio recording resolved to substantially the same length, and the length of regulating each audio clips finishes with natural pause place at voice.

In another embodiment, this method also comprises by calculating similarity distance between at least one audio frequency key word of at least one image keyword of image and audio clips and calculates similarity distance between image and audio clips.

In another embodiment, this method also comprises by calculating in the semantic electronic dictionary of the hierarchy path between these key words and obtains similarity distance between at least one image keyword and at least one audio frequency key word.

On the other hand, provide a kind of system that is used for the explanation of image is inserted into audio recording, having comprised: interpreting means is used for the explanatory note that interpretation of images and generation comprise the image of at least one image keyword; Resolver is used for that audio recording resolved to a plurality of audio clips and produces the transcript of each audio clips, and each audio clips transcript comprises at least one audio frequency key word; Calculation element is used to calculate the similarity distance between this at least one audio frequency key word of this at least one image keyword and each audio clips; Selecting arrangement is used to select to have to the audio clips transcript of the shortest similarity distance of at least one image keyword position as the explanatory note that inserts image.

In one embodiment, this system also comprises attachment device, is used for the explanatory note of image is appended to the audio clips of selection, comprises the audio recording of increase of at least one explanatory explanatory note of image with generation.

In another embodiment, this system also comprises the template of at least one interpretation of images, and this at least one template comprises at least one image interpretation parts, is used to produce the explanatory note of image.

In another embodiment, this system also comprises the image interpretation parts of at least one technology conduct in this at least one template in optical character identification (OCR) technology, edge searching technology, colour edging searching technology, curve searching technology, shape searching technology and the contrast searching technology.

In another embodiment, this system is configured to audio recording is resolved to a plurality of audio clips of substantially the same length, and the length of regulating each audio clips finishes with natural pause place at voice.

In another embodiment, this system is configured to calculate similarity distance between image and audio clips by calculating similarity distance between at least one audio frequency key word of at least one image keyword of image and audio clips.

In another embodiment, this system is configured to calculate the similarity distance between at least one image keyword and at least one audio frequency key word according to the path between these key words in the semantic electronic dictionary of hierarchy.

On the other hand, a kind of readable medium of data processor that are used to store the data processor code are provided, when it is loaded onto in the data processing equipment, the equipment that makes is inserted into the explanation of image in the audio recording, and the readable medium of this data processor comprise: the code that is used for interpretation of images and the explanatory note that produces the image that comprises at least one image keyword; Be used for audio recording is resolved to a plurality of audio clips, and the code that produces the transcript of each audio clips, each audio clips transcript comprises at least one audio frequency key word; Be used to calculate the code of similarity distance of at least one audio frequency key word of at least one image keyword and each audio clips; And be used to select to have to the audio clips transcript of the shortest similarity distance of at least one image keyword code as the position of the explanatory note that inserts image.

In one embodiment, the readable medium of this data processor also comprise the audio clips that is used for the explanatory note of image is appended to selection, comprise the code of audio recording of increase of at least one explanatory explanatory note of image with generation.

In one embodiment, the readable medium of this data processor also comprise the code of the template that is used to provide at least one interpretation of images, and this at least one template comprises at least one image interpretation parts, is used to produce the explanatory note of image.

In one embodiment, the readable medium of this data processor also comprise at least one technology being used for providing optical character identification (OCR) technology, edge searching technology, colour edging searching technology, curve searching technology, shape searching technology and the contrast searching technology code as the image interpretation parts in this at least one template.

In one embodiment, the readable medium of this data processor also comprise a plurality of audio clips that are used for audio recording is resolved to substantially the same length, and the code that finishes with natural pause place at voice of the length of regulating each audio clips.

In one embodiment, the readable medium of this data processor also comprise and being used for by calculating the code that similarity distance between at least one audio frequency key word of at least one image keyword of image and audio clips calculates the similarity distance between image and audio clips.

In one embodiment, the readable medium of this data processor also comprise the code that is used for obtaining by the path of calculating between semantic these key words of electronic dictionary of hierarchy the similarity distance between at least one image keyword and at least one audio frequency key word.

From following to understanding these and other aspect of the present invention the more specific description of exemplary embodiment.

Description of drawings

In the figure that shows exemplary embodiment of the present invention:

Fig. 1 is the synoptic diagram that the conventional data disposal system of working environment can be provided;

Fig. 2 is the indicative flowchart according to the image interpretation method of embodiment;

Fig. 3 A and 3B are the indicative flowcharts of and preprocess method definite according to the source of embodiment;

Fig. 4 shows the image file processing method according to embodiment;

Fig. 5 A and 5B are the indicative flowcharts according to the parts assembling method of embodiment;

Fig. 6 shows the indicative flowchart according to the SoundRec preprocess method of embodiment;

Fig. 7 shows the indicative flowchart according to the image insertion position searching method of embodiment;

Fig. 8 shows the indicative flowchart according to the image insertion method of embodiment; And

Fig. 9 shows can be according to the illustrative example of embodiments of the invention identification and the image of describing.

Embodiment

As mentioned above, the present invention relates to be used to explain and describe the system and method for graph image.

The present invention can be put into practice in various embodiments.Suitably the data handling system of configuration and relevant communication network, equipment, software and firmware can be provided for enabling the one or more platform in these system and methods.As an example, Fig. 1 shows conventional data disposal system 100, and it can comprise the CPU (central processing unit) (" CPU ") 102 that is connected to storage unit 104 and random access memory 106.CPU 102 can handle operating system 101, application program 103 and data 123.Operating system 101, application program 103 and data 123 can be stored in storage unit 104 and be loaded onto storer 106, if necessary.Operator 107 can by use the video display 108 that connects by video interface 105 with 109 that be connected by the I/O interface, such as the such various input-output apparatus of keyboard 110, mouse 112 and disk drive 114, and with data handling system 100 interactions.In known manner, mouse 112 can be configured to be controlled at moving of cursor on the video display 108, and controls with the various graphic user interfaces (" GUI ") that mouse button operation occurs on video display 108.Disk drive 114 can be configured to accept the readable medium of data handling system 116.Data handling system 100 can form the part of network via network interface 111, allows data handling system 100 to communicate by letter with the data handling system (not shown) that other suitably disposes.The concrete configuration that shows as an example and do not mean that restriction in this manual.

More generally, can comprise according to the method for an embodiment and to explain and describe image, and make audio frequency or textual description synchronous with the logic insertion point in audio frequency or text transcription.

When keychart or figure, picture pattern (pattern) recognition technology can be used for discerning content.Image processing techniques can be used for extracting such as title and the such text of note.The meta-tag technology can be used by writer or by the copywriter, and these marks can be used for increasing and the standardization translation.The meta-tag example for example can comprise the segmentation of identification X and Y-axis, subtype, chart and legend or the like.

Filtering technique also can be used for eliminating some data (such as number of pages, title and footer) and outstanding out of Memory, such as Chart Title.The OCR technology also can be used for determining other content of text.This OCR information singly can not obtained content of text, can also obtain position, orientation, text size and font etc., and be used in the filtration that can be further described below, subsequently of this information and the prioritisation's process.

Speech recognition technology can be used for visiting original source context and extract the content that can help description figure and/or the information that helps the explanation of image is registered to original source contents.Translation technology can be utilized to from a context to another context wording content again, and like this, it more is applicable to last purpose.

According to another embodiment, this method can be with respect to other source contents of image analysis of explaining, so that aim at two kinds of content types.Natural languages processing and semantic electronic dictionary can be used for measuring the Semantic Similarity distance between image and other source contents.The position that has the shortest similarity distance in other source contents can be used for placing image.Because great majority present the order that can follow logic, in case correct reference point is established, the just easier picture specification explained is put back into presents.

To the independent control of illustrative extention can allow the user before this method is applied to from original source contents with image in the future.This will help decoding system is registered to original audio frequency or text, and they are used as reference point then, to continue decoding and to aim at.Alignment procedures only need be carried out once, because the user can download the note version that presents, rather than loading source and increase information dividually.

Referring now to Fig. 2 illustrative method 200 is described.As shown in the figure, method 200 beginnings, and in square 202 reception a series of images (for example, as using in presenting) conduct input.Then, method 200 advances to square 204, and for each image, method 200 is determined image type.At square 206, method 200 then, advances to Decision Block 208 according to image type pretreatment image (describing in more detail with reference to Fig. 3 A and 3B as following), with the definite success of estimated image type.At Decision Block 208, if the answer is in the negative, then method 200 advances to square 210, may use meta-tag and pattern Mapping, further carries out pre-service, advances to square 212 then, and method 200 can be learnt new pattern.Method 200 turns back to square 204 further to carry out pre-service with this fresh information.

If answer is yes at Decision Block 208, then method 200 advances to square 214, method 200 processing and generation and image-related a series of images key word.Then, method 200 advances to square 216, and method 200 can be eliminated irrelevant word (for example, number of pages, copyright statement).Method 200 advances to square 218 then, and method 200 generates the explanation of image according to image keyword.Method 200 advances to square 220 then, and method 200 determines whether image in addition.If then method 200 turns back to square 204 and proceeds.If it's not true, then method 200 advances to square D (Fig. 6).

Fig. 3 A and 3B show the indicative flowchart of and preprocess method 300 definite according to the data source of embodiment.Method 300 is from square 302, and at square 304, reception sources data or image.At Decision Block 306, method 300 determines that this source is a still data file (for example, ppt, vsd) of image file (for example, jpeg, pdf).If data file, then method 300 advances to square 308, and the anticipatory data file has by the additional information of stored digital (for example, doc, ppt, vsd, xls, 123 or the like) therein.Method 300 advances to square 310 then, and whether method 300 specified data files comprise additional meta-tag therein, to help image interpretation.If do not comprise, then method 300 directly advances to square 502 via connector C.If then method 300 advances to square 312, meta-tag is resolved and explained to method 300 therein.These meta-tag can be industrial standards, or are exclusively used in the mark of source file type.Method 300 advances to square 314 and square C (Fig. 5 A and 5B) then.

If at Decision Block 306, this source is an image file, and then method 300 advances to square 316 (Fig. 3 B).Because image file typically has the less source data retrieved, so method 300 advances to square 318, method 300 preparations therein are used for the image file of the parsing of other type.This preparation can comprise for example go to tilt, noise reduces, signal is average or the like to noise.

Method 300 advances to square 320 then, can compare with the pattern or the template that are stored in the pattern file folder from the pattern of preparing to obtain therein, to determine the possible type of source images.For example, pattern or template matches can represent that source images is bar chart, pie chart, text form, line style chart or the like.(can be used in various technology this method, that be used for graphical analysis roughly discusses at http://en.wikipedia.org/wiki/Computer vision.For example, being used for the whole bag of tricks that noise reduces describes at http://www.mathtool.net/Java/Image Processing/.Graph and image processing comprises inclination, automatically shears, Boundary Extraction and noise distortion are eliminated and described in http://www.sharewareriver.com/products/6116.htm automatically.Optical character identification (OCR) technology is described in http://www.nuance.com/omnipage/professional/ and http://www.csc.liv.ac.uk/～wda2003/Papers/Section IV/Paper 14.pdf.Use the contrast technology to come segmentation in http://www.ph.tn.tudelft.nl/Courses/FIP/noframes/fip-Segmenta.h tml, to describe from the project of image.Circle and curve determine that technology describes in http://homepages.inf.ed.ac.uk/cgi/rbf/CVONLINE/entries.p17TAG38 2.Figure describes in http://ichemed.chem.wisc.edu/iournal/issues/2003/Sep/abs1093 2.html to the data conversion line technology.The colour edging detection technique that is used for bar chart, pie chart or the like is described at http://ai.stanford.edu/～ruzon/compass/color.html.Volume definite (being used for venn figure, pie chart or the like) is described at http://www.spl/harvard.edu:8000/pages/papers/guttmann/ms/guttma nn_rev.html.)

Method 300 advances to square 322 then, and method 300 is according to its possible type of process source image file therein.For example,, then can retrieve the corresponding template that is used for bar chart, and can be used to other bar chart content of template analysis of explaining and illustrating by use if source contents is a bar chart.

Referring now to Fig. 4, show image file processing method 400 on the figure according to embodiment.Method 400 is from square 402, and advances to Decision Block 404, whether surpasses predetermined threshold value to determine the pattern in the pattern file folder, supposes to have mated the source image file type.If then method 400 advances to square C (Fig. 5 A and 5B).If not, method 400 advances to square 406, method 400 pre-service therein and movement images file and pattern from " preferably adaptive " of existing pattern file folder.Method 400 advances to Decision Block 408 then.

At Decision Block 408, if can not satisfy minimum threshold value, then image can not be explained and be described (for example, image may be the freehand drawing that abstract oil painting or rough draft are drawn), and method 400 turns back to square 302 via connector A.If can satisfy minimum threshold value at square 408, then method 400 advances to square 410.In this step 410, system can document image as potential new pattern, and need not any further processing, turn back to square 302 via connector A.At the end of processing procedure, can check a series of potential new pattern images (for example) again, and can generate the new template that is used for based on the data extract of pattern by systematic analysis.These new template can be stored in the pattern file folder, and like this, they can be used in the automated procedure of next round.

Referring now to Fig. 5 A and 5B, show indicative flowchart on the figure according to the parts assembling method 500 of embodiment.Method 500 is from square 502, and advances to Decision Block 504, and method 500 determines that source files are still data files (for example, ppt, vsd) of image file (for example, jpeg, pdf) therein.

If data file, then method 500 advances to square 506, and method 500 applying templates come to comprise attribute, context, digital value or the like from the extracting data content therein.For example, the template that is used for the x-y curve map can information extraction, such as the title of title, x axle, the title of y axle, and the details of the line of on chart, drawing and be used for any label of line.It will be appreciated that template can be for the data file of each particular type sketch that draws, so that extract key information.

Method 500 advances to square 508 then, and method 500 can the construction logic text structure therein, and from the data of using template extraction, duplicate (populate) they.For example, in order to describe the x-y coordinate diagram, text structure can comprise title, x axle title, y axle title and by their slope and the relative position text structure of describing straight line on the x-y coordinate diagram.

Method 500 advances to square 510 then, and the result that can memory segment handles of method 500 therein is as discernible parts in logical organization.Method 500 advances to square A (Fig. 3 A) via connector A then.

The step of Fig. 5 B display packing 500, if are image files at Decision Block 504 source files, method 500 advances to square 514, pattern of Xuan Zeing or template are used for image file is segmented into parts (for example, legend, axle, title or the like) therein.

Then, method 500 advances to the one or more of square 516,518,520,522,524,526, with the interpretation of images file.For example, at square 516, method 500 can use OCR to determine content of text.At square 518, method 500 can use edge searching technology to find out the line graph component.At square 520, method 500 can use the colour edging technology to find out the line graph component.At square 522, method 500 can use curve searching technology to find out the curvilinear figure unit.At square 524, method 500 can use circle, ellipse and bubble searching technology to find out the 2D graphics component.At square 526, method 500 can use contrast searching technology to find out bar shaped segmentation, circular segmentation or the like.

Method 500 advances to square 528 then, and method 500 can be explained the target that each finds therein, draws numeral, label or other attribute, such as the relative position of from left to right bar shaped, the relative percentage of circular segmentation, or the like.

Method 500 advances to square 530 then, and method 500 can be made commentary and annotation (document) by using the segmenting unit that aforesaid one or more analytical technology is found therein.Method 500 advances to square 532 then, and method 500 can be coordinated and aligning parts therein.Method 500 advances to aforesaid square 508 (Fig. 5 A) then, and proceeds.

Referring now to Fig. 6, show the indicative flowchart of audio frequency preprocess method 600 on the figure.Method 600 is from square 602, and advances to Decision Block 604, to receive audio recording as input.Method 600 advances to square 606 then, and method 600 is divided into the vector of audio clips to audio program therein, and each audio clips finishes in natural pause place of voice, such as the end of sentence, and the length that approaches to fix (for example, 30 seconds).

Method 600 advances to square 608 then, and method 600 is proceeded for each audio clips therein.Method 600 advances to square 610 then, and speech recognition technology can be used for audio clips is converted to text therein.At square 612, method 600 can be used the natural language resolver then, resolves the text of conversion.Method 600 can produce noun phrase vector then, and it comprises 0 to n the noun phrase that extracts from audio clips.Method 600 advances to square 616 then, and 600 some adopted names that does not find in dictionary of method or title are transformed into the word in the dictionary therein.Method 600 advances to square 618 then, and method 600 is calculated the importance value of each noun phrase therein, and removes not too significant phrase.Method 600 advances to square 620 then, and method 600 produces and comprises 0 to keyword vector n key word, audio clips therein.Method 600 advances to Decision Block 622 then, to determine whether audio clips in addition.If method 600 turns back to square 608, and proceeds.If not, method 600 advances to the method 700 of Fig. 7 via connector E.

Referring now to Fig. 7, show indicative flowchart on the figure according to the image insertion position searching method 700 of embodiment.Method 700 is from square 702, and advance to square 704, method 700 receptions therein are by the pretreated image of the image keyword vector representative that comprises 0 to n key word and pretreated audio program (representing under the situation of audio clips at each audio clips keyword vector) the conduct input of being represented by a vector of audio clips keyword vector.

Method 700 advances to square 706 then, and method 700 is proceeded for each audio clips in audio program therein.At square 708, method 700 is proceeded for each key word in the image keyword vector.Method 700 advances to square 710 then, and method 700 is proceeded for each key word in representing the audio frequency keyword vector of audio clips therein.Method 700 advances to square 712 then, the similarity distance of method 700 calculating therein between current images key word and audio frequency key word.In step 714, method 700 is updated in the bee-line between this image keyword and the audio frequency key word, and by turning back to square 710, advances to the next key word in the audio clips, if present.If there is no, then method 700 advances to square 716, and method 700 specifies this shortest distance value as the similarity distance between this image keyword and audio clips therein.Method 700 advances to square 718 then, and method 700 is updated in the bee-line between this image keyword and the audio clips therein, and by turning back to square 708, advances to the next key word of image, if present.If there is no, then method 700 advances to square 720, and method 700 specifies this shortest distance value as the similarity distance between this image and audio clips therein.

Method 700 advances to square 722 then, and method 700 writes down the audio clips with bee-line therein, and by turning back to square 706, advances to next audio clips, if present.If there is no, then method 700 advances to square 724, and method 700 identification therein has audio clips to the shortest similarity distance of image as the place of inserting image.Method 700 advances to square F (Fig. 8) via connector F then.

Referring now to Fig. 8, show image insertion method 800 on the figure according to embodiment.Method 800 is from square 802, and advances to square 804, and to receive the input of a series of images, each image is by the corresponding insertion point representative of image keyword vector sum.Method 800 advances to square 806 then, and method 800 is proceeded for each sound clip in SoundRec therein.Method 800 advances to square 808 then, this sound clip is appended to the SoundRec of the picture specification increase that finally obtains.

Method 800 advances to square 810 then, to proceed for each image in a series of images.Method 800 advances to square 812 then, after determining whether image should be inserted in current sound clip.If not, then method 800 turns back to square 812.If then method 800 advances to square 814, to generate the picture specification audio clips from image keyword by use speech production instrument.Method 800 advances to square 816 then, and method 800 appends to newly-generated picture specification audio clips the insertion point of identification therein.Method 800 advances to Decision Block 818 then, to determine to turn back to square 810, still advances to Decision Block 812.At Decision Block 812, method 800 determines to turn back to square 806, still finishes.

Just as will be seen, above-mentioned method is discerned and with text and audio description image, is positioned at insertion point suitable in original audio recording by using the similarity distance that calculates according to key word, and picture specification is inserted into suitable recognizing site.Therefore, the invisible image of the listener of audio recording will be described being inserted into and increasing in the picture specification audio clips of original SoundRec.

Example

Fig. 9 shows by Clayton M.Christensen according to chart illustrative example adaptive and simplification, that can pass through the graph image 900 of identification of use said method and description.

For example, in the audio recording of teaching, the lecturer can relate to a plurality of figures or chart, all graph images 900 as shown in Figure 9.At some point of teaching, for example at the time reference of 10:25am, the lecturer can be with reference to the chart with title " disruptors ".Then, he we can say " figure " and " line " at 10:30am, and this can be interpreted as him with reference to the line style chart.He also can be at 10:35am specifically, " simple in order to make figure, I only describe the ability of use improvement project as single line ... ".

According to embodiment, system can be hidden in the explaining of chart 900 of 10:30am, and this can followingly set forth: title " disruptors ", X-axis: " time ", Y-axis: " performance ".Row A has about 10 ° slope, and title is " ability of using improvement project ".Row B has about 25 ° slope, and title is " innovation ".Row B intersects with row A in time D.Row C has about 25 ° slope, and title is " disruption ".Row C intersects at time E and row A.

Just as will be seen, can explain such as the such chart of chart 900 and the system and method for verbal communication is provided,, can provide more context to teach to the listener to understand compared with at the system and method that does not provide under such information state.

Though described various illustrative embodiment of the present invention above, it will be apparent to those skilled in the art that to make to change and modification.Therefore, scope of the present invention is by following claim regulation.

Claims

1. the explanation image is inserted into the method in the audio recording, comprising:

Interpretation of images and generation comprise the explanatory note of the image of at least one image keyword;

Audio recording is resolved to a plurality of audio clips, and produce the transcript of each audio clips, each audio clips transcript comprises at least one audio frequency key word;

Calculate the similarity distance between at least one audio frequency key word of at least one image keyword and each audio clips; And

Selection has to the audio clips transcript of the shortest similarity distance of at least one image keyword position as the explanatory note that inserts image.

2. according to the method for claim 1, also comprise: the explanatory note of image is appended to the audio clips of selection, comprise the audio recording of increase of at least one explanatory explanatory note of image with generation.

3. according to the method for claim 1, also comprise: the template of at least one interpretation of images is provided, and this at least one template comprises at least one image interpretation parts, is used to produce the explanatory note of image.

4. according to the method for claim 3, also comprise: the image interpretation parts of at least one technology conduct in optical character recognition, edge searching technology, colour edging searching technology, curve searching technology, shape searching technology and the contrast searching technology in this at least one template are provided.

5. according to the method for claim 1, also comprise: audio recording is resolved to a plurality of audio clips of substantially the same length, and the length of regulating each audio clips finishes with natural pause place at voice.

6. according to the method for claim 1, also comprise: calculate similarity distance between image and audio clips by calculating similarity distance between at least one audio frequency key word of at least one image keyword of image and audio clips.

7. according to the method for claim 6, also comprise: obtain similarity distance between at least one image keyword and at least one audio frequency key word by calculating in the semantic electronic dictionary of the hierarchy path between at least one image keyword and at least one audio frequency key word.

8. system that is used for the explanation of image is inserted into audio recording comprises:

Interpreting means is used for the explanatory note that interpretation of images and generation comprise the image of at least one image keyword;

Resolver is used for that audio recording resolved to a plurality of audio clips and produces the transcript of each audio clips, and each audio clips transcript comprises at least one audio frequency key word;

Calculation element is used to calculate the similarity distance between at least one audio frequency key word of this at least one image keyword and each audio clips; And

Selecting arrangement is used to select to have to the audio clips transcript of the shortest similarity distance of at least one image keyword position as the explanatory note that inserts image.

9. system according to Claim 8 also comprises attachment device, is used for the explanatory note of image is appended to the audio clips of selection, comprises the audio recording of increase of at least one explanatory explanatory note of image with generation.

10. system according to Claim 8 also comprises the template of at least one interpretation of images, and this at least one template comprises at least one image interpretation parts, is used to produce the explanatory note of image.

11., also comprise the image interpretation parts of at least one technology conduct in this at least one template in optical character recognition, edge searching technology, colour edging searching technology, curve searching technology, shape searching technology and the contrast searching technology according to the system of claim 10.

12. system according to Claim 8, wherein resolver is configured to audio recording is resolved to a plurality of audio clips of substantially the same length, and the length of regulating each audio clips finishes with natural pause place at voice.

13. system according to Claim 8, wherein calculation element is configured to calculate similarity distance between image and audio clips by calculating similarity distance between at least one audio frequency key word of at least one image keyword of image and audio clips.

14. according to the system of claim 13, wherein system is configured to calculate similarity distance between at least one image keyword and at least one audio frequency key word according to the path between at least one image keyword and at least one the audio frequency key word in the semantic electronic dictionary of hierarchy.