CA2077274C - Method and apparatus for summarizing a document without document image decoding - Google Patents
Method and apparatus for summarizing a document without document image decodingInfo
- Publication number
- CA2077274C CA2077274C CA002077274A CA2077274A CA2077274C CA 2077274 C CA2077274 C CA 2077274C CA 002077274 A CA002077274 A CA 002077274A CA 2077274 A CA2077274 A CA 2077274A CA 2077274 C CA2077274 C CA 2077274C
- Authority
- CA
- Canada
- Prior art keywords
- image
- document
- units
- image units
- significant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Abstract
A method and apparatus for excerpting and summarizing an undecoded document image, without first converting the document image to optical character codes such as ASCII text, identifies significant words, phrases and graphics in the document image using automatic or interactive morphological image recognition techniques, document summaries or indices are produced based on the identified significant portions of the document image. The disclosed method is particularly adept for improvement of reading machines for the blind.
Description
NETHOD AND APPARATUS FOR SUMM~TZING A
DO~ul~.~ WITHOUT DO~u~I~NL IMAGE DECODING
BACKGROUND OF THE INVENTION
1. Field of the Invention This invention relates to improvements in methods and apparatuses for automatic document processing, and more particularly to improvements in methods and appara-tuses for recognizing semantically significant words, characters, images, or image segments in a document image without first decoding the document image and auto-matically creating a summary version of the document ~ contents.
DO~ul~.~ WITHOUT DO~u~I~NL IMAGE DECODING
BACKGROUND OF THE INVENTION
1. Field of the Invention This invention relates to improvements in methods and apparatuses for automatic document processing, and more particularly to improvements in methods and appara-tuses for recognizing semantically significant words, characters, images, or image segments in a document image without first decoding the document image and auto-matically creating a summary version of the document ~ contents.
2. Background It has long been the goal in computer based electronic document processing to be able, easily and reliably, to identify, access and extract information contained in electronically encoded data representing documents; and to summarize and characterize the informa-tion contained in a document or corpus of documents which has been electronically stored. For example, to facili-tate review and evaluation of the information content of a document or corpus of documents to determine the relevance of same for a particular user's needs, it is desirable to be able to identify the semantically most significant portions of a document, in terms of the information they contain; and to be able to present those portions in a manner which facilitates the user's recognition and appreciation of the document contents. However, the problem of identifying the significant portions within a document is particularly difficult when dealing with images of the documents (bitmap image data), rather than with code representations thereof (e.g., coded represen-tations of text such as ASCII). As opposed to ASCII text files, which permit users to perform operations such as Boolean algebraic key word searches in order to locate text of interest, electronic documents which have been produced by scanning an original without decoding to produce document images are difficult to evaluate without exhaustive viewing of each document image, or without ~f, 207727~
hand-crafting a summary of the document for search pur-poses. Of course, document viewing or creation of a document summary require extensive human effort.
On the other hand, current image recognition methods, particularly involving textual material, gen-erally involve dividing an image segment to be analyzed into individual characters which are then deciphered or decoded and matched to characters in a character library.
One general class of such methods includes optical character recognition (OCR) techniques. Typically, OCR
~ techniques enable a word to be recognized only after each of the individual characters of the word have been decoded, and a corresponding word image retrieved from a library.
15Moreover, optical character recognition decoding operations generally require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing, especially with regard to word recogni-tion. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and identi-fied in a decision making process as a distinct character in a predetermined set of characters. Further, the image quality of the original document and noise inherent in the generation of a scanned image contribute to uncertainty regarding the actual appearance of the bitmap for a character. Most character identifying processes assume that a character is an independent set of connected pixels. When this assumption fails due to the quality of the image, identification also fails.
4. References European patent application number 0-361-464 by Doi describes a method and apparatus for producing an abstract of a document with correct meaning precisely indicative of the content of the document. The method includes listing hint words which are preselected words indicative of the presence of significant phrases that can reflect content of the document, searching all the hint ~07727~
words in the document, extracting sentences of the docu-ment in which any one of the listed hint words is found by the search, and producing an abstract of the document by juxtaposing the extracted sentences. Where the number of hint words produces a lengthy excerpt, a morphological language analysis of the abstracted sentences is performed to delete unnecessary phrases and focus on the phrases using the hint words as the right part of speech according to a dictionary containing the hint words.
10"A Business Intelligence System" by Luhn, IBM
~ Journal, October 1958 describes a system which in part, auto-abstracts a document, by ascertaining the most frequently occurring words (significant words) and analyzes all sentences in the text conta-ining such words.
A relative value of the sentence significance is then established by a formula which reflects the number of significant words contained in a sentence and the prox-imity of these words to each other within the sentence.
Several sentences which rank highest in value of signifi-cance are then extracted from the text to constitute theauto-abstract.
SUM~Y OF THE INVENTION
Accordingly, it is an object of an aspect of the invention to provide a method and apparatus for automatically excerpting and summarizing a document image without decoding or otherwise understanding the contents thereof.
It is an object of an aspect of the invention to provide a method and apparatus for automatically generating ancillary document images reflective of the contents of an entire primary document image.
It is an object of an aspect of the invention to provide a method and apparatus for the type described for automatically extracting summaries of material and providing links from the summary back to the original document.
It is an object of an aspect of the invention to provide a method and apparatus of the type described for producing Braille document summaries or speech synthesized summaries of a document.
~ 4 ~ 2077~74 It is an object of an aspect of the invention to provide a method and apparatus of the type described which is useful for enabling document browsing through the development of image gists, or for document categorization through the use of lexical gists.
It is an object of an aspect of the invention to provide a method and apparatus of the type described that does not depend upon statistical properties of large, pre-analyzed document corpora.
10The invention provides a method and apparatus for ~ segmenting an undecoded document image into undecoded image units, identifying semantically significant image units based on an evaluation of predetermined image characteristics of the image units, without decoding the document image or reference to decoded image data, and utilizing the identified significant image units to create an ancillary document image of abbreviated information content which is reflecti~e of the subject matter content of the original document image. In accordance with one aspect of the invention, the ancillary document image is a condensation or summarization of the original document image which facilitates ~rowsing. In accordance with another aspect of the invention, the identified signifi-cant image units are presented as an index of key words, which may be in decoded form, to permit document categori-zation.
Thus, in accordance with one aspect of the inven-tion, a method is presented for excerpting information from a document image containing word image units.
According to the invention, the document image is seg-mented into word image units (word units), and the word units are evaluated in accordance with morphological image properties of the word units, such as word shape. Signif-icant word units are then identified, in accordance with one or more predetermined or user selected significance criteria, and the identified significant word units are outputted.
In accordance with another aspect of the 207727~
lnvention, an apparatus is ~rovided for excerpting information from a document.containing a word unit text.
s Lhe apparatus includes an input means for inputting the document and producing a document lmage electronic representation of the document, and a data processing system for performing data driven processing and ~hich comprises execution processinq means for performing functions by executing program instructions in a predetermined manner contained in a memory means. The program instructions operate the execution processing means to identify significant word units in accordance with a predetermined significance criteria from morpholog-ical properties of the word units, and to output selected ones of the identified significant word units. The output of the selected significant word units can be to an electrostatographic reproduction machine, a speech synthe-sizer means, a Braille printer, a bitmap display, or other appropriate output means.
Other aspects of this invention are a~ follow~:
A method for electronically processing an electronic document image, comprising:
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and creating an abbreviated document image based on said identified significant imaqe units.
A method of excerpting significant informa-tion from an undecoded document image without decoding the document image, comprising:
segmenting the document image into image units without decoding the document image;
- 5a - 2077274 identifying significant ones of said image units in accordance with selected morpholoqical image _haracteristics; and outputting selected ones of said identified significant image units for further processing.
An apparatus for automatically summarizing the information content of an undecoded document image without decoding the document image, comprising:
means for segmenting the document image into image units without decoding the document i~age;
means ~or evaluating selected image units according to at least one morphological image characteris-tic thereof to identify sig~ificant image units, means for creating a supplemental document image based on said identified significant image units.
These and other objects, features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of the inven-tion, when read in conjunction with the accompanying drawings and appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
A preferred embodiment of the invention is illus-trated in the accompanying drawing, in which:
Figure 1 is a flow chart of a method of the invention.
Figure 2 is a block diagram of an apparatus according to the invention for carryinq out the method of Figure l.
207727~
- 5b -DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In contrast to prior techniques, such as those described above, the invention is based upon the recogni-tion that scanned imaqe files and character code files exhibit important differences for imaqe processinq, especially in data retrieval. ~he method of a preferred embodiment of the lnvention capitalizes on the ~isual ~jr i, properties of text contained in paper documents, such as the presence or frequency of linguistic terms (such as words of importance like "important", "significant", "crucial", or the like) used by the author of the text to draw attention to a particular phrase or a region of the text; the structural placement within the document image of section titles and page headers, and the placement of graphics; and so on. A preferred embodiment of the method of the invention is illustrated in the flow chart of Figure 1, and an apparatus for performing the method is shown in Figure 2. For the sake of clarity, the invention will be described with reference to the processing of a single document. However, it will be appreciated that the invention is applicable to the processing of a corpus of documents containing a plurality of documents. M o r e particularly, the invention provides a method and appara-tus for automatically excerpting semantically significant information from the data or text of a document based on certain morphological (structural) image characteristics of image units corresponding to units of understanding contained within the document image. The excerpted information can be used, among other things, to automati-cally create a document index or summary. The selection of image units for summarization can be based on frequency of occurrence, or predetermined or user selected selection criteria, depending upon the particular application in which the method and apparatus of the invention is employed.
The invention is not limited to systems utilizing document scanning. Rather, other systems such as a bitmap workstation (i.e., a workstation with a bitmap display) or a system using both bitmapping and scanning would work equally well for the implementation of the methods and apparatus described herein.
With reference first to Figure 2, the method is performed on an electronic image of an original document 5, which may include lines of text 7, titles, drawings, figures 8, or the like, contained in one or more sheets or pages of paper 10 or other tangible form. The electronic document image to be processed is created in any conven-tional manner, for example, by a conventional scanning means such as those incorporated within a document copier or facsimile machine, a Braille reading machine, or by an electronic beam scanner or the like. Such scanning means are well known in the art, and thus are not described in detail herein. An output derived from the scanning is digitized to produce undecoded bit mapped image data representing the document image for each page of the docu-~ ment, which data is stored, for example, in a memory lS of a special or general purpose digital computer data pro-cessing system 13. The data processing system 13 can be a data driven processing system which comprises sequential execution processing means 16 for performing functions by executing program instructions in a predetermined sequence contained in a memory, such as the memory 15. The output from the data processing system 13 is delivered to an output device 17, such as, for example, a memory or other form of storage unit; an output display 17A as shown, which may be, for instance, a CRT display; a printer device 17B as shown, which may be incorporated in a document copier machine or a Braille or standard form printer; a facsimile machine, speech synthesizer or the like.
Through use of equipment such as illustrated in Figure 2, the identified word units are detected based on significant morphological image characteristics inherent in the image units, without first converting the scanned document image to character codes.
The method by which such image unit identification may be performed is described with reference now to Figure 1. The first phase of the image processing technique of the invention involves a low level document image analysis in which the document image for each page is segmented into undecoded information containing image units (step 20) using conventional image analysis techniques; or, in the case of text documents, preferably using the bounding - 8 - 2 0 772 7~
box method described in U.S Patent No. 5,321,770, issued June 4, 1994, Huttenlocher and Hopcroft, and entitled ~Method for Determining Boundaries of Words in Text". The locations of and spatial relationships between the image units on a page are then determined (step 2S). For example, an English language document image can be seg-mented into word image units based on the relative differ-ence in spacing between characters within a word and the spacing between words. Sentence and paragraph boundaries can be similarly ascertained. Additional region segmenta-tion image analysis can be performed to generate a physi-cal document structure description that divides page images into labelled regions corresponding to auxiliary document elements like figures, tables, footnotes and the like. Figure regions can be distinguished from text regions based on the relative lack of image units arranged in a line within the region, for example. Using this segmentation, knowledge of how the documents being pro-cessed are arranged (e.g., left-to-rig~t, top-to-bottom), and, optionally, other inputted information such as document style, a "reading order" sequence for word images can also be generated. The term "image unit" is thus used herein to denote an identifiable segment of an image such as a number, character, glyph, symbol, word, phrase or other unit that can be reliably extracted. Advanta-geously, for purposes of document review and evaluation, the document image is segmented into sets of signs, symbols or other elements, such as words, which together form a single unit of understanding. Such single units of understanding are generally characterized in an image as being separated by a spacing greater than that which separates the elements forming a unit, or by some prede-termined graphical emphasis, such as, for example, a surrounding box image or other graphical separator, which distinguishes one or more image units from other image units in the scanned document image. Such image units representing single units of understanding will be B
referred to hereinafter as "word units."
Advantageously, a discrimination step 30 is next performed to identify the image units which have insuffi-cient information content to be useful in evaluating the subject matter content of the document being processed.
One preferred method is to use the morphological function or stop word detection techniques disclosed in U.S.
Patent No. 5,455,871, issued October 3, 1995 D. Bloomberg et al., and entitled "Detecting Function Words Without Converting a Scanned Document to Character Codes".
Next, in step 40, selected image units, e.g., the image units not discriminated in step 30, are evaluated, without decoding the image units being classified or reference to decoded image data, based on an evaluation of predetermined morphological (structural) image character-istics of the image units. The evaluation entails a determination (step 41) of the image characteristics and a comparison (step 42) of the determined image character-istics for each image unit with the determined image characteristics of the other image units.
one preferred method for defining the image unit image characteristics to be evaluated is to use the word shape derivation techniques disclosed in the copending Canadian Patent Application No. 2,077,969 filed September 10, 1992 D. Huttenlocher and Hopcroft, and entitled "A Method of Deriving Wordshapes for Subsequent Comparison". As described in the aforesaid application, at least one, one-dimensional signal characterizing the shape of a word unit is derived; or an image function is derived defining a boundary enclosing the word unit, and the image function is augmented so that an edge function representing edges of the character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individuallY
detecting and/or identifying the character or characters making up the word unit.
- lO - 207727i The determined morphological image characteristic(s) or derived image unit shape representations of each sel-ected image unit are compared, as noted above (step 42), either with the determined morphological image character-istic(s) or derived image unit shape representations ofthe other selected image units (step 42A), or with predetermined/user-selected image characteristics to locate specific types of image units (step 42B). The determined morphological image characteristics of the selected image units are advantageously compared with each other for the purpose of identifying equivalence classes of image units such that each equivalence class contains most or all of the instances of a given image unit in the document, and the relative frequencies with lS which image units occur in a document can be determined, as is set forth more fully in U.S. Patent No. 5,325,444, issued June 28, 1994 (Canadian Application No. 2,077,604, filed September 4, 1992) Cass et al., and entitled "Method and Apparatus for Determining the Frequency of Words in a Document with Document Image Decoding". Image units can then be classified or identified as significant according the frequency of their occurrence, as well as other characteristics of the image units, such as their length.
For example, it has been recognized that a useful combina-tion of selection criteria for business communicationswritten in English is to select the medium frequency word units.
It will also be appreciated that the selection process can be extended to phrases comprising identified significant image units and adjacent image units linked together in reading order sequence. The frequency of occurrence of such phrases can also be determined such that the portions of the source document which are selected for summarization correspond with phrases exceeding a predetermined frequency threshold, e.g., five occurrences. A preferred method for determining phrase frequency through image analysis without document decoding is disclosed in U.S. Patent No. 5,369,714, issued November 29, 1994, ~077~7~
Withgott et al., and entitled "Method and Apparatus for Determining the Frequency of Phrases in a Document Without Document Image Decoding".
It will be appreciated that the specification of the image characteristics for titles, headings, captions, linguistic criteria or other significance indicating features of a document image can be predetermined and selected by the user to determine the selection criteria defining a "significant" image unit. For example, titles are usually set off above names or paragraphs in boldface or italic typeface, or are in larger font than the main text. A related convention for titles- is the use of a special location on the page for information such as the main title or headers. Comparing the image characteris-tics of the selected image units of the document image for matches with the image characteristics associated with the selection criteria, or otherwise recognizing those image units having the specified image characteristics permits the significant image units to be readily identified without any document d~co~ing.
Any of a number of different methods of comparison can be used. one technique that can be used, for example, is by correlating the raster images of the extracted image units using decision networks, such technique being described in a Research Report entitled "Unsupervised Con-struction of Decision networks for Pattern Classification"
by Casey et al., IBM Research Report, 1984.
Preferred techniques that can be used to identify equivalence classes of word units are the word shape comparison techniques disclosed in Canadian Patent Application No. 2,077,970, filed September 10, 1992, Huttenlocher and Hopcroft, and entitled ~Optical Word Recognition By Examination of Word Shape", ~077~74 Depending on the particular application, and the relative importance of processing speed versus accuracy, for example, comparisons of different degrees of precision can be performed. For example, useful comparisons can be based on length, width or some other measurement dimension of the image unit (or derived image unit shape representa-tion, e.g., the lar~est figure in a document image); the location or region of the image unit in the document (including any selected fisure or paragraph of a document image, e.g., headings, initial figures, one or more paragraphs or figures), font, typeface, cross-section (a cross-section being a sequence of pixels of similar state in an image unit); the number of ascenders; the number of descenders; the average pixel density; the length of a top line contour, including peaks and troughs; the length of a base contour, including peaks and troughs; the location of image units with respect to neighboring image units;
vertical position; horizontal inter-image unit spacing;
and combinations of such classifiers. Thus, for example, if a selection criteria is chosen to produce a document summary from titles in the document, only title informa-tion in the document need be retrieved by the image analysis processes described above. on the other hand, if a more comprehensive evaluation of the document contents is desired, then more comprehensive identification tech-niques would need to be employed.
In addition, morphological image recognition techniques such as those disclosed in U.S. Patent No.
5,384,363, issued January 24, 1995 (Canadian Application Serial No. 2,077,563, filed September 4, 1992), Bloomberg et al., and entitled "Methods and Apparatus for Automatic Modification of Selected Semantically Significant Portions of a Document Without Document Image Decoding", can be used to recognize specialized fonts and typefaces within the document image.
A salient feature provided by the method of the invention is that the initial processing and identification of significant image units is accomplished without an accompanying requirement that the content of the image B
~077279 units be decoded, or that the information content of the document image otherwise be understood. More particu-larly, to this stage in the process, the actual content of the word units is not required to be specifically deter-mined. Thus, for example, in such applications as copiermachines or electronic printers that can print or repro-duce images directly from one document to another without regard to ASCII or other encoding/decoding requirements, image units can be identified and processed using one or more morphological image characteristics or properties of the image units. The image units of unknown content can then be further optically or electronically processed.
One of the advantages that results from the ability to perform such image unit processing without having to decode the image unit contents at this stage of the process is that the overall speed of image handling and manipulation can be significantly increased.
The second phase of the document analysis of the invention involves processing (step 50) the identified significant image units to produce an auxiliary or supple-mental document image reflective of the contents of the source document image. It will be appreciated that the format in which the identified significant image units are presented can be varied as desired. Thus, the identified significant image units could be presented in reading order to form one or more phrases, or presented in a listing in order of relative frequency of occurrence.
Likewise, the supplemental document image need not be limited to just the identified significant image units.
If desired, the identified significant image units can be presented in the form of phrases including adjacent image units presented in reading order sequence, as determined from the document location information derived during the document segmentation and structure determination steps 20 and 25 described above. Alternatively, a phrase frequency analysis as described above can be conducted to limit the presented phrases to only the most frequently occurring phrases.
2d 7727~
The present invention is similarly not limited with respect to the form of the supplemental document image. One application for which the information retrieval technique of the invention is particularly suited is for use in reading machines for the blind. One embodiment supports the designation by a user of key words, for example, on a key word list, to designate likely points of interest in a document. Using the user designated key words, occurrences of the word can be found in the document of interest, and regions of text forward and behind the key word can be retrieved and processed using the techniques described above. Or, as mentioned above, significant key words can be automatically selected according to prescribed criteria, such as frequency of occurrence, or other similar criteria, using the morpho-logical image recognition techniques described above; and a document automatically summarized using the determined words.
Another embodiment supports an automatic location of significant segments of a document according to other predefined criteria, for example, document segments that are likely to have high informational value such as titles, regions containing special font information such as italics and boldface, or phrases that receive linguis-tic emphasis. The location of significant words orsegments of a document may be accomplished using the morphological image recognition techniques described above. The words thus identified as significant words or word units can then be decoded using optical character recognition techniques, for example, for communication to the blind user in a Braille or other form which the blind user can comprehend. For example, the words which have been identified or selected by the techniques described above can either be printed in Braille form using an appropriate Braille format printer, such as a printer using plastic-based ink; or communicated orally to the user using a speech synthesizer output device.
Once a condensed document is communicated, the user may wish to return to the original source to have printed or hear a full text rendition. This may be achieved in a number of ways. One method is for the associated synthesizer or Braille printer to provide source information, for example, "on top of page 2 is an article entitled ...." The user would then return to point of interest.
Two classes of apparatus extend this capability through providing the possibility of user interaction while the condensed document is being communicated. One ~ type of apparatus is a simple index marker. This can be, for instance, a hand held device with a button that the user depresses whenever he or she hears a title of inter-est, or, for instance, an N-way motion detector in a mouse 19 (Figure 2) for registering a greater variety of com-mands. The reading machine records such marks of interest and returns to the original article after a complete summarization is communicated.
Another type of apparatus makes use of the tech-nology of touch-sensitive screens. Such an apparatus operates by requiring the user to lay down a Braille summarization sheet 41 on a horizontal display. The user then touches the region of interest on the screen 42 in order to trigger either a full printout or synthesized reading. The user would then indicate to the monitor when a new page was to be processed.
It will be appreciated that the method of the invention as applied to a reading machine for the blind reduces the amount of material presented to the user for evaluation, and thus is capable of circumventing many problems inherent in the use of current reading technology for the blind and others, such as the problems associated with efficient browsing of a document corpus, using synthesized speech, and the problems created by the bulk and expense of producing Braille paper translations, and the time and effort required by the user to read such copies.
The present invention is useful for forming ~077274 abbreviated document images for browsing (image gists). A
reduced representation of a document is created using a bitmap image of important terms in the document. This enables a user to quickly browse through a scanned docu-ment library, either electronically, or manually ifsummary cards are printed out on a medium such as paper.
The invention can also be useful for document categori-zation (lexical gists). In this instance, key terms can be automatically associated with a document. The user may then browse through the key terms, or the terms may be ~ further processed, such as by decoding using optical character recognition.
Although the invention has been described and illustrated with a certain degree of particularity, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the combination and arrangement of parts can be resorted to by those skilled in the art without departing from the spirit and scope of the invention, as hereinafter claimed.
hand-crafting a summary of the document for search pur-poses. Of course, document viewing or creation of a document summary require extensive human effort.
On the other hand, current image recognition methods, particularly involving textual material, gen-erally involve dividing an image segment to be analyzed into individual characters which are then deciphered or decoded and matched to characters in a character library.
One general class of such methods includes optical character recognition (OCR) techniques. Typically, OCR
~ techniques enable a word to be recognized only after each of the individual characters of the word have been decoded, and a corresponding word image retrieved from a library.
15Moreover, optical character recognition decoding operations generally require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing, especially with regard to word recogni-tion. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and identi-fied in a decision making process as a distinct character in a predetermined set of characters. Further, the image quality of the original document and noise inherent in the generation of a scanned image contribute to uncertainty regarding the actual appearance of the bitmap for a character. Most character identifying processes assume that a character is an independent set of connected pixels. When this assumption fails due to the quality of the image, identification also fails.
4. References European patent application number 0-361-464 by Doi describes a method and apparatus for producing an abstract of a document with correct meaning precisely indicative of the content of the document. The method includes listing hint words which are preselected words indicative of the presence of significant phrases that can reflect content of the document, searching all the hint ~07727~
words in the document, extracting sentences of the docu-ment in which any one of the listed hint words is found by the search, and producing an abstract of the document by juxtaposing the extracted sentences. Where the number of hint words produces a lengthy excerpt, a morphological language analysis of the abstracted sentences is performed to delete unnecessary phrases and focus on the phrases using the hint words as the right part of speech according to a dictionary containing the hint words.
10"A Business Intelligence System" by Luhn, IBM
~ Journal, October 1958 describes a system which in part, auto-abstracts a document, by ascertaining the most frequently occurring words (significant words) and analyzes all sentences in the text conta-ining such words.
A relative value of the sentence significance is then established by a formula which reflects the number of significant words contained in a sentence and the prox-imity of these words to each other within the sentence.
Several sentences which rank highest in value of signifi-cance are then extracted from the text to constitute theauto-abstract.
SUM~Y OF THE INVENTION
Accordingly, it is an object of an aspect of the invention to provide a method and apparatus for automatically excerpting and summarizing a document image without decoding or otherwise understanding the contents thereof.
It is an object of an aspect of the invention to provide a method and apparatus for automatically generating ancillary document images reflective of the contents of an entire primary document image.
It is an object of an aspect of the invention to provide a method and apparatus for the type described for automatically extracting summaries of material and providing links from the summary back to the original document.
It is an object of an aspect of the invention to provide a method and apparatus of the type described for producing Braille document summaries or speech synthesized summaries of a document.
~ 4 ~ 2077~74 It is an object of an aspect of the invention to provide a method and apparatus of the type described which is useful for enabling document browsing through the development of image gists, or for document categorization through the use of lexical gists.
It is an object of an aspect of the invention to provide a method and apparatus of the type described that does not depend upon statistical properties of large, pre-analyzed document corpora.
10The invention provides a method and apparatus for ~ segmenting an undecoded document image into undecoded image units, identifying semantically significant image units based on an evaluation of predetermined image characteristics of the image units, without decoding the document image or reference to decoded image data, and utilizing the identified significant image units to create an ancillary document image of abbreviated information content which is reflecti~e of the subject matter content of the original document image. In accordance with one aspect of the invention, the ancillary document image is a condensation or summarization of the original document image which facilitates ~rowsing. In accordance with another aspect of the invention, the identified signifi-cant image units are presented as an index of key words, which may be in decoded form, to permit document categori-zation.
Thus, in accordance with one aspect of the inven-tion, a method is presented for excerpting information from a document image containing word image units.
According to the invention, the document image is seg-mented into word image units (word units), and the word units are evaluated in accordance with morphological image properties of the word units, such as word shape. Signif-icant word units are then identified, in accordance with one or more predetermined or user selected significance criteria, and the identified significant word units are outputted.
In accordance with another aspect of the 207727~
lnvention, an apparatus is ~rovided for excerpting information from a document.containing a word unit text.
s Lhe apparatus includes an input means for inputting the document and producing a document lmage electronic representation of the document, and a data processing system for performing data driven processing and ~hich comprises execution processinq means for performing functions by executing program instructions in a predetermined manner contained in a memory means. The program instructions operate the execution processing means to identify significant word units in accordance with a predetermined significance criteria from morpholog-ical properties of the word units, and to output selected ones of the identified significant word units. The output of the selected significant word units can be to an electrostatographic reproduction machine, a speech synthe-sizer means, a Braille printer, a bitmap display, or other appropriate output means.
Other aspects of this invention are a~ follow~:
A method for electronically processing an electronic document image, comprising:
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and creating an abbreviated document image based on said identified significant imaqe units.
A method of excerpting significant informa-tion from an undecoded document image without decoding the document image, comprising:
segmenting the document image into image units without decoding the document image;
- 5a - 2077274 identifying significant ones of said image units in accordance with selected morpholoqical image _haracteristics; and outputting selected ones of said identified significant image units for further processing.
An apparatus for automatically summarizing the information content of an undecoded document image without decoding the document image, comprising:
means for segmenting the document image into image units without decoding the document i~age;
means ~or evaluating selected image units according to at least one morphological image characteris-tic thereof to identify sig~ificant image units, means for creating a supplemental document image based on said identified significant image units.
These and other objects, features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of the inven-tion, when read in conjunction with the accompanying drawings and appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
A preferred embodiment of the invention is illus-trated in the accompanying drawing, in which:
Figure 1 is a flow chart of a method of the invention.
Figure 2 is a block diagram of an apparatus according to the invention for carryinq out the method of Figure l.
207727~
- 5b -DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In contrast to prior techniques, such as those described above, the invention is based upon the recogni-tion that scanned imaqe files and character code files exhibit important differences for imaqe processinq, especially in data retrieval. ~he method of a preferred embodiment of the lnvention capitalizes on the ~isual ~jr i, properties of text contained in paper documents, such as the presence or frequency of linguistic terms (such as words of importance like "important", "significant", "crucial", or the like) used by the author of the text to draw attention to a particular phrase or a region of the text; the structural placement within the document image of section titles and page headers, and the placement of graphics; and so on. A preferred embodiment of the method of the invention is illustrated in the flow chart of Figure 1, and an apparatus for performing the method is shown in Figure 2. For the sake of clarity, the invention will be described with reference to the processing of a single document. However, it will be appreciated that the invention is applicable to the processing of a corpus of documents containing a plurality of documents. M o r e particularly, the invention provides a method and appara-tus for automatically excerpting semantically significant information from the data or text of a document based on certain morphological (structural) image characteristics of image units corresponding to units of understanding contained within the document image. The excerpted information can be used, among other things, to automati-cally create a document index or summary. The selection of image units for summarization can be based on frequency of occurrence, or predetermined or user selected selection criteria, depending upon the particular application in which the method and apparatus of the invention is employed.
The invention is not limited to systems utilizing document scanning. Rather, other systems such as a bitmap workstation (i.e., a workstation with a bitmap display) or a system using both bitmapping and scanning would work equally well for the implementation of the methods and apparatus described herein.
With reference first to Figure 2, the method is performed on an electronic image of an original document 5, which may include lines of text 7, titles, drawings, figures 8, or the like, contained in one or more sheets or pages of paper 10 or other tangible form. The electronic document image to be processed is created in any conven-tional manner, for example, by a conventional scanning means such as those incorporated within a document copier or facsimile machine, a Braille reading machine, or by an electronic beam scanner or the like. Such scanning means are well known in the art, and thus are not described in detail herein. An output derived from the scanning is digitized to produce undecoded bit mapped image data representing the document image for each page of the docu-~ ment, which data is stored, for example, in a memory lS of a special or general purpose digital computer data pro-cessing system 13. The data processing system 13 can be a data driven processing system which comprises sequential execution processing means 16 for performing functions by executing program instructions in a predetermined sequence contained in a memory, such as the memory 15. The output from the data processing system 13 is delivered to an output device 17, such as, for example, a memory or other form of storage unit; an output display 17A as shown, which may be, for instance, a CRT display; a printer device 17B as shown, which may be incorporated in a document copier machine or a Braille or standard form printer; a facsimile machine, speech synthesizer or the like.
Through use of equipment such as illustrated in Figure 2, the identified word units are detected based on significant morphological image characteristics inherent in the image units, without first converting the scanned document image to character codes.
The method by which such image unit identification may be performed is described with reference now to Figure 1. The first phase of the image processing technique of the invention involves a low level document image analysis in which the document image for each page is segmented into undecoded information containing image units (step 20) using conventional image analysis techniques; or, in the case of text documents, preferably using the bounding - 8 - 2 0 772 7~
box method described in U.S Patent No. 5,321,770, issued June 4, 1994, Huttenlocher and Hopcroft, and entitled ~Method for Determining Boundaries of Words in Text". The locations of and spatial relationships between the image units on a page are then determined (step 2S). For example, an English language document image can be seg-mented into word image units based on the relative differ-ence in spacing between characters within a word and the spacing between words. Sentence and paragraph boundaries can be similarly ascertained. Additional region segmenta-tion image analysis can be performed to generate a physi-cal document structure description that divides page images into labelled regions corresponding to auxiliary document elements like figures, tables, footnotes and the like. Figure regions can be distinguished from text regions based on the relative lack of image units arranged in a line within the region, for example. Using this segmentation, knowledge of how the documents being pro-cessed are arranged (e.g., left-to-rig~t, top-to-bottom), and, optionally, other inputted information such as document style, a "reading order" sequence for word images can also be generated. The term "image unit" is thus used herein to denote an identifiable segment of an image such as a number, character, glyph, symbol, word, phrase or other unit that can be reliably extracted. Advanta-geously, for purposes of document review and evaluation, the document image is segmented into sets of signs, symbols or other elements, such as words, which together form a single unit of understanding. Such single units of understanding are generally characterized in an image as being separated by a spacing greater than that which separates the elements forming a unit, or by some prede-termined graphical emphasis, such as, for example, a surrounding box image or other graphical separator, which distinguishes one or more image units from other image units in the scanned document image. Such image units representing single units of understanding will be B
referred to hereinafter as "word units."
Advantageously, a discrimination step 30 is next performed to identify the image units which have insuffi-cient information content to be useful in evaluating the subject matter content of the document being processed.
One preferred method is to use the morphological function or stop word detection techniques disclosed in U.S.
Patent No. 5,455,871, issued October 3, 1995 D. Bloomberg et al., and entitled "Detecting Function Words Without Converting a Scanned Document to Character Codes".
Next, in step 40, selected image units, e.g., the image units not discriminated in step 30, are evaluated, without decoding the image units being classified or reference to decoded image data, based on an evaluation of predetermined morphological (structural) image character-istics of the image units. The evaluation entails a determination (step 41) of the image characteristics and a comparison (step 42) of the determined image character-istics for each image unit with the determined image characteristics of the other image units.
one preferred method for defining the image unit image characteristics to be evaluated is to use the word shape derivation techniques disclosed in the copending Canadian Patent Application No. 2,077,969 filed September 10, 1992 D. Huttenlocher and Hopcroft, and entitled "A Method of Deriving Wordshapes for Subsequent Comparison". As described in the aforesaid application, at least one, one-dimensional signal characterizing the shape of a word unit is derived; or an image function is derived defining a boundary enclosing the word unit, and the image function is augmented so that an edge function representing edges of the character string detected within the boundary is defined over its entire domain by a single independent variable within the closed boundary, without individuallY
detecting and/or identifying the character or characters making up the word unit.
- lO - 207727i The determined morphological image characteristic(s) or derived image unit shape representations of each sel-ected image unit are compared, as noted above (step 42), either with the determined morphological image character-istic(s) or derived image unit shape representations ofthe other selected image units (step 42A), or with predetermined/user-selected image characteristics to locate specific types of image units (step 42B). The determined morphological image characteristics of the selected image units are advantageously compared with each other for the purpose of identifying equivalence classes of image units such that each equivalence class contains most or all of the instances of a given image unit in the document, and the relative frequencies with lS which image units occur in a document can be determined, as is set forth more fully in U.S. Patent No. 5,325,444, issued June 28, 1994 (Canadian Application No. 2,077,604, filed September 4, 1992) Cass et al., and entitled "Method and Apparatus for Determining the Frequency of Words in a Document with Document Image Decoding". Image units can then be classified or identified as significant according the frequency of their occurrence, as well as other characteristics of the image units, such as their length.
For example, it has been recognized that a useful combina-tion of selection criteria for business communicationswritten in English is to select the medium frequency word units.
It will also be appreciated that the selection process can be extended to phrases comprising identified significant image units and adjacent image units linked together in reading order sequence. The frequency of occurrence of such phrases can also be determined such that the portions of the source document which are selected for summarization correspond with phrases exceeding a predetermined frequency threshold, e.g., five occurrences. A preferred method for determining phrase frequency through image analysis without document decoding is disclosed in U.S. Patent No. 5,369,714, issued November 29, 1994, ~077~7~
Withgott et al., and entitled "Method and Apparatus for Determining the Frequency of Phrases in a Document Without Document Image Decoding".
It will be appreciated that the specification of the image characteristics for titles, headings, captions, linguistic criteria or other significance indicating features of a document image can be predetermined and selected by the user to determine the selection criteria defining a "significant" image unit. For example, titles are usually set off above names or paragraphs in boldface or italic typeface, or are in larger font than the main text. A related convention for titles- is the use of a special location on the page for information such as the main title or headers. Comparing the image characteris-tics of the selected image units of the document image for matches with the image characteristics associated with the selection criteria, or otherwise recognizing those image units having the specified image characteristics permits the significant image units to be readily identified without any document d~co~ing.
Any of a number of different methods of comparison can be used. one technique that can be used, for example, is by correlating the raster images of the extracted image units using decision networks, such technique being described in a Research Report entitled "Unsupervised Con-struction of Decision networks for Pattern Classification"
by Casey et al., IBM Research Report, 1984.
Preferred techniques that can be used to identify equivalence classes of word units are the word shape comparison techniques disclosed in Canadian Patent Application No. 2,077,970, filed September 10, 1992, Huttenlocher and Hopcroft, and entitled ~Optical Word Recognition By Examination of Word Shape", ~077~74 Depending on the particular application, and the relative importance of processing speed versus accuracy, for example, comparisons of different degrees of precision can be performed. For example, useful comparisons can be based on length, width or some other measurement dimension of the image unit (or derived image unit shape representa-tion, e.g., the lar~est figure in a document image); the location or region of the image unit in the document (including any selected fisure or paragraph of a document image, e.g., headings, initial figures, one or more paragraphs or figures), font, typeface, cross-section (a cross-section being a sequence of pixels of similar state in an image unit); the number of ascenders; the number of descenders; the average pixel density; the length of a top line contour, including peaks and troughs; the length of a base contour, including peaks and troughs; the location of image units with respect to neighboring image units;
vertical position; horizontal inter-image unit spacing;
and combinations of such classifiers. Thus, for example, if a selection criteria is chosen to produce a document summary from titles in the document, only title informa-tion in the document need be retrieved by the image analysis processes described above. on the other hand, if a more comprehensive evaluation of the document contents is desired, then more comprehensive identification tech-niques would need to be employed.
In addition, morphological image recognition techniques such as those disclosed in U.S. Patent No.
5,384,363, issued January 24, 1995 (Canadian Application Serial No. 2,077,563, filed September 4, 1992), Bloomberg et al., and entitled "Methods and Apparatus for Automatic Modification of Selected Semantically Significant Portions of a Document Without Document Image Decoding", can be used to recognize specialized fonts and typefaces within the document image.
A salient feature provided by the method of the invention is that the initial processing and identification of significant image units is accomplished without an accompanying requirement that the content of the image B
~077279 units be decoded, or that the information content of the document image otherwise be understood. More particu-larly, to this stage in the process, the actual content of the word units is not required to be specifically deter-mined. Thus, for example, in such applications as copiermachines or electronic printers that can print or repro-duce images directly from one document to another without regard to ASCII or other encoding/decoding requirements, image units can be identified and processed using one or more morphological image characteristics or properties of the image units. The image units of unknown content can then be further optically or electronically processed.
One of the advantages that results from the ability to perform such image unit processing without having to decode the image unit contents at this stage of the process is that the overall speed of image handling and manipulation can be significantly increased.
The second phase of the document analysis of the invention involves processing (step 50) the identified significant image units to produce an auxiliary or supple-mental document image reflective of the contents of the source document image. It will be appreciated that the format in which the identified significant image units are presented can be varied as desired. Thus, the identified significant image units could be presented in reading order to form one or more phrases, or presented in a listing in order of relative frequency of occurrence.
Likewise, the supplemental document image need not be limited to just the identified significant image units.
If desired, the identified significant image units can be presented in the form of phrases including adjacent image units presented in reading order sequence, as determined from the document location information derived during the document segmentation and structure determination steps 20 and 25 described above. Alternatively, a phrase frequency analysis as described above can be conducted to limit the presented phrases to only the most frequently occurring phrases.
2d 7727~
The present invention is similarly not limited with respect to the form of the supplemental document image. One application for which the information retrieval technique of the invention is particularly suited is for use in reading machines for the blind. One embodiment supports the designation by a user of key words, for example, on a key word list, to designate likely points of interest in a document. Using the user designated key words, occurrences of the word can be found in the document of interest, and regions of text forward and behind the key word can be retrieved and processed using the techniques described above. Or, as mentioned above, significant key words can be automatically selected according to prescribed criteria, such as frequency of occurrence, or other similar criteria, using the morpho-logical image recognition techniques described above; and a document automatically summarized using the determined words.
Another embodiment supports an automatic location of significant segments of a document according to other predefined criteria, for example, document segments that are likely to have high informational value such as titles, regions containing special font information such as italics and boldface, or phrases that receive linguis-tic emphasis. The location of significant words orsegments of a document may be accomplished using the morphological image recognition techniques described above. The words thus identified as significant words or word units can then be decoded using optical character recognition techniques, for example, for communication to the blind user in a Braille or other form which the blind user can comprehend. For example, the words which have been identified or selected by the techniques described above can either be printed in Braille form using an appropriate Braille format printer, such as a printer using plastic-based ink; or communicated orally to the user using a speech synthesizer output device.
Once a condensed document is communicated, the user may wish to return to the original source to have printed or hear a full text rendition. This may be achieved in a number of ways. One method is for the associated synthesizer or Braille printer to provide source information, for example, "on top of page 2 is an article entitled ...." The user would then return to point of interest.
Two classes of apparatus extend this capability through providing the possibility of user interaction while the condensed document is being communicated. One ~ type of apparatus is a simple index marker. This can be, for instance, a hand held device with a button that the user depresses whenever he or she hears a title of inter-est, or, for instance, an N-way motion detector in a mouse 19 (Figure 2) for registering a greater variety of com-mands. The reading machine records such marks of interest and returns to the original article after a complete summarization is communicated.
Another type of apparatus makes use of the tech-nology of touch-sensitive screens. Such an apparatus operates by requiring the user to lay down a Braille summarization sheet 41 on a horizontal display. The user then touches the region of interest on the screen 42 in order to trigger either a full printout or synthesized reading. The user would then indicate to the monitor when a new page was to be processed.
It will be appreciated that the method of the invention as applied to a reading machine for the blind reduces the amount of material presented to the user for evaluation, and thus is capable of circumventing many problems inherent in the use of current reading technology for the blind and others, such as the problems associated with efficient browsing of a document corpus, using synthesized speech, and the problems created by the bulk and expense of producing Braille paper translations, and the time and effort required by the user to read such copies.
The present invention is useful for forming ~077274 abbreviated document images for browsing (image gists). A
reduced representation of a document is created using a bitmap image of important terms in the document. This enables a user to quickly browse through a scanned docu-ment library, either electronically, or manually ifsummary cards are printed out on a medium such as paper.
The invention can also be useful for document categori-zation (lexical gists). In this instance, key terms can be automatically associated with a document. The user may then browse through the key terms, or the terms may be ~ further processed, such as by decoding using optical character recognition.
Although the invention has been described and illustrated with a certain degree of particularity, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the combination and arrangement of parts can be resorted to by those skilled in the art without departing from the spirit and scope of the invention, as hereinafter claimed.
Claims (18)
1. A method for electronically processing an electronic document image, comprising:
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and creating an abbreviated document image based on said identified significant image units.
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and creating an abbreviated document image based on said identified significant image units.
2. The method of claim 1 wherein said step of identifying significant image units comprises classifying said image units according to frequency of occurrence.
3. The method of claim 1 wherein said step of identifying significant image units comprises classifying said image units according to location within the document image.
4. The method of claim 1 wherein said selected morphological image characteristics include image characteristics defining image units having predetermined linguistic criteria.
5. The method of claim 1 wherein said selected morphological image characteristics include at least one of an image unit shape dimension, font, typeface, number of ascender elements, number of descender elements, pixel density, pixel cross-sectional characteristic, the location of image units with respect to neighboring image units, vertical position, horizontal interimage unit spacing, and contour characteristic of said image units.
6. A method of excerpting significant information from an undecoded document image without decoding the document image, comprising:
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and outputting selected ones of said identified significant image units for further processing.
segmenting the document image into image units without decoding the document image;
identifying significant ones of said image units in accordance with selected morphological image characteristics; and outputting selected ones of said identified significant image units for further processing.
7. The method of claim 6 wherein said step of outputting selected ones of identified significant image units comprises generating a document index based on said selected ones of identified significant image units.
8. The method of claim 6 wherein said step of outputting selected ones of identified significant image units comprises producing a speech synthesized output corresponding to said selected ones of identified significant image units.
9. The method of claim 6 wherein said step of outputting selected ones of identified significant image units comprises producing said selected ones of identified significant image units in printed Braille format.
10. The method of claim 6 wherein said step of outputting said selected ones of identified significant image units comprises generating a document summary from said selected ones of identified significant image units.
11. A method for electronically processing an undecoded document image containing word text, comprising:
segmenting the document image into word image units without decoding the document image;
evaluating selected word image units according to at least one morphological image characteristic thereof without decoding the word image units to identify significant word image units;
forming phrase image units based on selected identified significant word units, said phrase image units each incorporating one of said selected identified significant word units and adjacent word image units linked in reading order sequence; and and outputting said phrase image units.
segmenting the document image into word image units without decoding the document image;
evaluating selected word image units according to at least one morphological image characteristic thereof without decoding the word image units to identify significant word image units;
forming phrase image units based on selected identified significant word units, said phrase image units each incorporating one of said selected identified significant word units and adjacent word image units linked in reading order sequence; and and outputting said phrase image units.
12. An apparatus for automatically summarizing the information content of an undecoded document image without decoding the document image, comprising:
means for segmenting the document image into image units without decoding the document image;
means for evaluating selected image units according to at least one morphological image characteristic thereof to identify significant image units, means for creating a supplemental document image based on said identified significant image units.
means for segmenting the document image into image units without decoding the document image;
means for evaluating selected image units according to at least one morphological image characteristic thereof to identify significant image units, means for creating a supplemental document image based on said identified significant image units.
13. The apparatus of claim 12 wherein said means for segmenting the document image, said means for identifying significant word units, and said means for creating a supplemental document image comprise a programmed digital computer.
14. The apparatus of claim 13 further comprising scanning means for scanning an original document to produce said document image, said scanning means being incorporated in a document copier machine which produces printed document copies; and means for controlling said document copier machine to produce a printed document copy of said supplemental document image.
15. The apparatus of claim 13 further comprising scanning means for scanning an original document to produce said document image, said scanning means being incorporated in a reading machine for the blind having means for communicating data to the user; and means for controlling said reading machine communication means to communicate the contents of said supplemental document image.
16. The apparatus of claim 15 wherein said communicating means comprises a printer for producing document copies in Braille format.
17. The apparatus of claim 15 wherein said communicating means comprises a speech synthesizer for producing synthesized speech output corresponding to said supplemental document image.
18. The apparatus of claim 15 wherein said reading machine includes operator responsive means for accessing the scanned document or a selected portion thereof corresponding to a supplemental document image following communication of the supplemental document image to the user.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US79454391A | 1991-11-19 | 1991-11-19 | |
US794,543 | 1991-11-19 |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2077274A1 CA2077274A1 (en) | 1993-05-20 |
CA2077274C true CA2077274C (en) | 1997-07-15 |
Family
ID=25162943
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002077274A Expired - Fee Related CA2077274C (en) | 1991-11-19 | 1992-09-01 | Method and apparatus for summarizing a document without document image decoding |
Country Status (5)
Country | Link |
---|---|
US (1) | US5491760A (en) |
EP (1) | EP0544432B1 (en) |
JP (1) | JP3292388B2 (en) |
CA (1) | CA2077274C (en) |
DE (1) | DE69229537T2 (en) |
Families Citing this family (67)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5590317A (en) * | 1992-05-27 | 1996-12-31 | Hitachi, Ltd. | Document information compression and retrieval system and document information registration and retrieval method |
US5701500A (en) * | 1992-06-02 | 1997-12-23 | Fuji Xerox Co., Ltd. | Document processor |
ATE279758T1 (en) * | 1992-06-19 | 2004-10-15 | United Parcel Service Inc | METHOD AND DEVICE FOR ADJUSTING A NEURON |
US5850490A (en) * | 1993-12-22 | 1998-12-15 | Xerox Corporation | Analyzing an image of a document using alternative positionings of a class of segments |
EP0677817B1 (en) * | 1994-04-15 | 2000-11-08 | Canon Kabushiki Kaisha | Page segmentation and character recognition system |
DE69525401T2 (en) * | 1994-09-12 | 2002-11-21 | Adobe Systems Inc | Method and device for identifying words described in a portable electronic document |
CA2154952A1 (en) * | 1994-09-12 | 1996-03-13 | Robert M. Ayers | Method and apparatus for identifying words described in a page description language file |
IL113204A (en) * | 1995-03-30 | 1999-03-12 | Advanced Recognition Tech | Pattern recognition system |
US5689716A (en) * | 1995-04-14 | 1997-11-18 | Xerox Corporation | Automatic method of generating thematic summaries |
US5918240A (en) * | 1995-06-28 | 1999-06-29 | Xerox Corporation | Automatic method of extracting summarization using feature probabilities |
US5778397A (en) * | 1995-06-28 | 1998-07-07 | Xerox Corporation | Automatic method of generating feature probabilities for automatic extracting summarization |
US6078915A (en) * | 1995-11-22 | 2000-06-20 | Fujitsu Limited | Information processing system |
US5892842A (en) * | 1995-12-14 | 1999-04-06 | Xerox Corporation | Automatic method of identifying sentence boundaries in a document image |
US5848191A (en) * | 1995-12-14 | 1998-12-08 | Xerox Corporation | Automatic method of generating thematic summaries from a document image without performing character recognition |
US5850476A (en) * | 1995-12-14 | 1998-12-15 | Xerox Corporation | Automatic method of identifying drop words in a document image without performing character recognition |
US7051024B2 (en) * | 1999-04-08 | 2006-05-23 | Microsoft Corporation | Document summarizer for word processors |
JPH09322089A (en) | 1996-05-27 | 1997-12-12 | Fujitsu Ltd | Broadcasting program transmitter, information transmitter, device provided with document preparation function and terminal equipment |
JP3875310B2 (en) * | 1996-05-27 | 2007-01-31 | 富士通株式会社 | Broadcast program information transmitter |
JP3530308B2 (en) | 1996-05-27 | 2004-05-24 | 富士通株式会社 | Broadcast program transmission device and terminal device connected thereto |
US5956468A (en) * | 1996-07-12 | 1999-09-21 | Seiko Epson Corporation | Document segmentation system |
GB9808712D0 (en) * | 1997-11-05 | 1998-06-24 | British Aerospace | Automatic target recognition apparatus and process |
US5999664A (en) * | 1997-11-14 | 1999-12-07 | Xerox Corporation | System for searching a corpus of document images by user specified document layout components |
US6562077B2 (en) | 1997-11-14 | 2003-05-13 | Xerox Corporation | Sorting image segments into clusters based on a distance measurement |
US6665841B1 (en) | 1997-11-14 | 2003-12-16 | Xerox Corporation | Transmission of subsets of layout objects at different resolutions |
US6533822B2 (en) * | 1998-01-30 | 2003-03-18 | Xerox Corporation | Creating summaries along with indicators, and automatically positioned tabs |
JPH11306197A (en) * | 1998-04-24 | 1999-11-05 | Canon Inc | Processor and method for image processing, and computer-readable memory |
US6317708B1 (en) | 1999-01-07 | 2001-11-13 | Justsystem Corporation | Method for producing summaries of text document |
US6337924B1 (en) * | 1999-02-26 | 2002-01-08 | Hewlett-Packard Company | System and method for accurately recognizing text font in a document processing system |
US7475334B1 (en) * | 2000-01-19 | 2009-01-06 | Alcatel-Lucent Usa Inc. | Method and system for abstracting electronic documents |
ES2208164T3 (en) * | 2000-02-23 | 2004-06-16 | Ser Solutions, Inc | METHOD AND APPLIANCE FOR PROCESSING ELECTRONIC DOCUMENTS. |
US6581057B1 (en) | 2000-05-09 | 2003-06-17 | Justsystem Corporation | Method and apparatus for rapidly producing document summaries and document browsing aids |
US6941513B2 (en) | 2000-06-15 | 2005-09-06 | Cognisphere, Inc. | System and method for text structuring and text generation |
US7302637B1 (en) * | 2000-07-24 | 2007-11-27 | Research In Motion Limited | System and method for abbreviating information sent to a viewing device |
US7386790B2 (en) * | 2000-09-12 | 2008-06-10 | Canon Kabushiki Kaisha | Image processing apparatus, server apparatus, image processing method and memory medium |
US7221810B2 (en) * | 2000-11-13 | 2007-05-22 | Anoto Group Ab | Method and device for recording of information |
WO2002099739A1 (en) * | 2001-06-05 | 2002-12-12 | Matrox Electronic Systems Ltd. | Model-based recognition of objects using a calibrated image system |
US6708894B2 (en) | 2001-06-26 | 2004-03-23 | Xerox Corporation | Method for invisible embedded data using yellow glyphs |
US20040034832A1 (en) * | 2001-10-19 | 2004-02-19 | Xerox Corporation | Method and apparatus for foward annotating documents |
US7712028B2 (en) * | 2001-10-19 | 2010-05-04 | Xerox Corporation | Using annotations for summarizing a document image and itemizing the summary based on similar annotations |
JP2003196270A (en) * | 2001-12-27 | 2003-07-11 | Sharp Corp | Document information processing method, document information processor, communication system, computer program and recording medium |
US7136082B2 (en) | 2002-01-25 | 2006-11-14 | Xerox Corporation | Method and apparatus to convert digital ink images for use in a structured text/graphics editor |
US7139004B2 (en) * | 2002-01-25 | 2006-11-21 | Xerox Corporation | Method and apparatus to convert bitmapped images for use in a structured text/graphics editor |
US7590932B2 (en) | 2002-03-16 | 2009-09-15 | Siemens Medical Solutions Usa, Inc. | Electronic healthcare management form creation |
US7734627B1 (en) | 2003-06-17 | 2010-06-08 | Google Inc. | Document similarity detection |
CA2544017A1 (en) * | 2003-10-29 | 2005-05-12 | Michael W. Trainum | System and method for managing documents |
WO2007024216A1 (en) * | 2005-08-23 | 2007-03-01 | The Mazer Corporation | Test scoring system and method |
US7454063B1 (en) * | 2005-09-22 | 2008-11-18 | The United States Of America As Represented By The Director National Security Agency | Method of optical character recognition using feature recognition and baseline estimation |
JP2007304864A (en) * | 2006-05-11 | 2007-11-22 | Fuji Xerox Co Ltd | Character recognition processing system and program |
US7711192B1 (en) * | 2007-08-23 | 2010-05-04 | Kaspersky Lab, Zao | System and method for identifying text-based SPAM in images using grey-scale transformation |
US7706613B2 (en) * | 2007-08-23 | 2010-04-27 | Kaspersky Lab, Zao | System and method for identifying text-based SPAM in rasterized images |
JP5132416B2 (en) * | 2008-05-08 | 2013-01-30 | キヤノン株式会社 | Image processing apparatus and control method thereof |
US8233722B2 (en) * | 2008-06-27 | 2012-07-31 | Palo Alto Research Center Incorporated | Method and system for finding a document image in a document collection using localized two-dimensional visual fingerprints |
US8144947B2 (en) * | 2008-06-27 | 2012-03-27 | Palo Alto Research Center Incorporated | System and method for finding a picture image in an image collection using localized two-dimensional visual fingerprints |
US8233716B2 (en) * | 2008-06-27 | 2012-07-31 | Palo Alto Research Center Incorporated | System and method for finding stable keypoints in a picture image using localized scale space properties |
EP2449531B1 (en) * | 2009-07-02 | 2017-12-20 | Hewlett-Packard Development Company, L.P. | Skew detection |
US8548193B2 (en) * | 2009-09-03 | 2013-10-01 | Palo Alto Research Center Incorporated | Method and apparatus for navigating an electronic magnifier over a target document |
US9003531B2 (en) * | 2009-10-01 | 2015-04-07 | Kaspersky Lab Zao | Comprehensive password management arrangment facilitating security |
US9514103B2 (en) * | 2010-02-05 | 2016-12-06 | Palo Alto Research Center Incorporated | Effective system and method for visual document comparison using localized two-dimensional visual fingerprints |
US8086039B2 (en) * | 2010-02-05 | 2011-12-27 | Palo Alto Research Center Incorporated | Fine-grained visual document fingerprinting for accurate document comparison and retrieval |
EP2383970B1 (en) | 2010-04-30 | 2013-07-10 | beyo GmbH | Camera based method for text input and keyword detection |
US8787673B2 (en) | 2010-07-12 | 2014-07-22 | Google Inc. | System and method of determining building numbers |
US8750624B2 (en) | 2010-10-19 | 2014-06-10 | Doron Kletter | Detection of duplicate document content using two-dimensional visual fingerprinting |
US8554021B2 (en) | 2010-10-19 | 2013-10-08 | Palo Alto Research Center Incorporated | Finding similar content in a mixed collection of presentation and rich document content using two-dimensional visual fingerprints |
US9058352B2 (en) | 2011-09-22 | 2015-06-16 | Cerner Innovation, Inc. | System for dynamically and quickly generating a report and request for quotation |
JP5884560B2 (en) * | 2012-03-05 | 2016-03-15 | オムロン株式会社 | Image processing method for character recognition, and character recognition device and program using this method |
EP2637128B1 (en) | 2012-03-06 | 2018-01-17 | beyo GmbH | Multimodal text input by a keyboard/camera text input module replacing a conventional keyboard text input module on a mobile device |
US11176364B2 (en) * | 2019-03-19 | 2021-11-16 | Hyland Software, Inc. | Computing system for extraction of textual elements from a document |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3659354A (en) * | 1970-10-21 | 1972-05-02 | Mitre Corp | Braille display device |
FR2453451B1 (en) * | 1979-04-04 | 1985-11-08 | Lopez Krahe Jaime | READING MACHINE FOR THE BLIND |
US4685135A (en) * | 1981-03-05 | 1987-08-04 | Texas Instruments Incorporated | Text-to-speech synthesis system |
JPS57199066A (en) * | 1981-06-02 | 1982-12-06 | Toshiyuki Sakai | File forming system for cutting of newspaper and magazine |
JPS5998283A (en) * | 1982-11-27 | 1984-06-06 | Hitachi Ltd | Pattern segmenting and recognizing system |
JPS59135576A (en) * | 1983-01-21 | 1984-08-03 | Nippon Telegr & Teleph Corp <Ntt> | Registering and retrieving device of document information |
JPS60114967A (en) * | 1983-11-28 | 1985-06-21 | Hitachi Ltd | Picture file device |
JPH07120355B2 (en) * | 1986-09-26 | 1995-12-20 | 株式会社日立製作所 | Image information memory retrieval method |
US4972349A (en) * | 1986-12-04 | 1990-11-20 | Kleinberger Paul J | Information retrieval system and method |
JPS63223964A (en) * | 1987-03-13 | 1988-09-19 | Canon Inc | Retrieving device |
US4752772A (en) * | 1987-03-30 | 1988-06-21 | Digital Equipment Corporation | Key-embedded Braille display system |
US4994987A (en) * | 1987-11-20 | 1991-02-19 | Minnesota Mining And Manufacturing Company | Image access system providing easier access to images |
JPH01150973A (en) * | 1987-12-08 | 1989-06-13 | Fuji Photo Film Co Ltd | Method and device for recording and retrieving picture information |
JP2783558B2 (en) * | 1988-09-30 | 1998-08-06 | 株式会社東芝 | Summary generation method and summary generation device |
JPH0371380A (en) * | 1989-08-11 | 1991-03-27 | Seiko Epson Corp | Character recognizing device |
JPH03218569A (en) * | 1989-11-28 | 1991-09-26 | Oki Electric Ind Co Ltd | Index extraction device |
US5181255A (en) * | 1990-12-13 | 1993-01-19 | Xerox Corporation | Segmentation of handwriting and machine printed text |
US5202933A (en) * | 1989-12-08 | 1993-04-13 | Xerox Corporation | Segmentation of text and graphics |
US5131049A (en) * | 1989-12-08 | 1992-07-14 | Xerox Corporation | Identification, characterization, and segmentation of halftone or stippled regions of binary images by growing a seed to a clipping mask |
US5048109A (en) * | 1989-12-08 | 1991-09-10 | Xerox Corporation | Detection of highlighted regions |
US5216725A (en) * | 1990-10-31 | 1993-06-01 | Environmental Research Institute Of Michigan | Apparatus and method for separating handwritten characters by line and word |
US5384863A (en) * | 1991-11-19 | 1995-01-24 | Xerox Corporation | Methods and apparatus for automatic modification of semantically significant portions of a document without document image decoding |
CA2077604C (en) * | 1991-11-19 | 1999-07-06 | Todd A. Cass | Method and apparatus for determining the frequency of words in a document without document image decoding |
-
1992
- 1992-09-01 CA CA002077274A patent/CA2077274C/en not_active Expired - Fee Related
- 1992-11-12 JP JP30272692A patent/JP3292388B2/en not_active Expired - Lifetime
- 1992-11-16 DE DE69229537T patent/DE69229537T2/en not_active Expired - Fee Related
- 1992-11-16 EP EP92310433A patent/EP0544432B1/en not_active Expired - Lifetime
-
1994
- 1994-05-09 US US08/240,284 patent/US5491760A/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
US5491760A (en) | 1996-02-13 |
CA2077274A1 (en) | 1993-05-20 |
EP0544432A2 (en) | 1993-06-02 |
EP0544432A3 (en) | 1993-12-22 |
DE69229537D1 (en) | 1999-08-12 |
JPH05242142A (en) | 1993-09-21 |
JP3292388B2 (en) | 2002-06-17 |
DE69229537T2 (en) | 1999-11-25 |
EP0544432B1 (en) | 1999-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2077274C (en) | Method and apparatus for summarizing a document without document image decoding | |
US5748805A (en) | Method and apparatus for supplementing significant portions of a document selected without document image decoding with retrieved information | |
CA2077313C (en) | Methods and apparatus for selecting semantically significant images in a document image without decoding image content | |
JP3282860B2 (en) | Apparatus for processing digital images of text on documents | |
EP0544433B1 (en) | Method and apparatus for document image processing | |
US5384863A (en) | Methods and apparatus for automatic modification of semantically significant portions of a document without document image decoding | |
Mao et al. | Document structure analysis algorithms: a literature survey | |
US5164899A (en) | Method and apparatus for computer understanding and manipulation of minimally formatted text documents | |
JP2973944B2 (en) | Document processing apparatus and document processing method | |
US7712028B2 (en) | Using annotations for summarizing a document image and itemizing the summary based on similar annotations | |
Lu et al. | Information retrieval in document image databases | |
Chen et al. | Summarization of imaged documents without OCR | |
EP1304625B1 (en) | Method and apparatus for forward annotating documents and for generating a summary from a document image | |
WO2007070010A1 (en) | Improvements in electronic document analysis | |
CN100444194C (en) | Automatic extraction device, method and program of essay title and correlation information | |
JP3841318B2 (en) | Icon generation method, document search method, and document server | |
EP0692768A2 (en) | Full text storage and retrieval in image at OCR and code speed | |
Andreev et al. | Hausdorff distances for searching in binary text images | |
JPH0589279A (en) | Character recognizing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKLA | Lapsed |