WO2001013279A2 - Word searchable database from high volume scanning of newspaper data - Google Patents

Word searchable database from high volume scanning of newspaper data Download PDF

Info

Publication number
WO2001013279A2
WO2001013279A2 PCT/US2000/022492 US0022492W WO0113279A2 WO 2001013279 A2 WO2001013279 A2 WO 2001013279A2 US 0022492 W US0022492 W US 0022492W WO 0113279 A2 WO0113279 A2 WO 0113279A2
Authority
WO
WIPO (PCT)
Prior art keywords
ocr
image
data
text
newsprint
Prior art date
Application number
PCT/US2000/022492
Other languages
French (fr)
Other versions
WO2001013279A9 (en
WO2001013279A3 (en
Inventor
John R. Yokley
Don Nissen
Erik Schwartz
Bryan Kornele
Ed Lee
Kevin Kapel
Original Assignee
Ptfs, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ptfs, Inc. filed Critical Ptfs, Inc.
Priority to AU70605/00A priority Critical patent/AU7060500A/en
Publication of WO2001013279A2 publication Critical patent/WO2001013279A2/en
Publication of WO2001013279A9 publication Critical patent/WO2001013279A9/en
Publication of WO2001013279A3 publication Critical patent/WO2001013279A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the invention relates generally to the fields of data storage and retrieval, including databases, scanning and digitizing, image processing, and searching techniques, and relates more particularly to the various processes involved in the high volume processing of newspaper material to create a word searchable database with key fields.
  • a typical process first involves cutting newspaper articles from a newspaper, stamping a date stamp which contains the date of the paper and the source newspaper from which the article was clipped, and identifying the main subject of the article by circling, underlining, or writing on the clippings themselves. These and other markings on the clippings are referred to as marks or markings. The various markings on the clippings are invariably on the text of the article due to the small margins on newspapers, although they need not be so restricted. The process then involves collecting these articles according to subject and putting the related articles in a properly marked envelope. The envelopes, perhaps millions of them, are then stored.
  • a paper index or card catalog of the subjects is created and the envelopes are made available for research and other purposes to interested parties.
  • the creation of a fully searchable system is made difficult by the sheer volume of envelopes. Further, the articles are prone to deterioration from excessive handling. These factors can combine to limit the number of fully searchable, public accessible systems.
  • microfilm or microfiche also currently provides a method by which old news information can be stored and manually retrieved.
  • the microfilmed information can only be accessed by date or special indices that may have been developed for special collections of news events or articles. Even when these indices have been created, the process of obtaining the correct microfilm or microfiche, mounting it on the optical reader/printer, and locating the correct article by the frame of the microfilm is manual and time intensive.
  • An embodiment of the present invention is directed to overcoming or reducing the effects of at least one of the problems set forth above.
  • Embodiments of the present invention are directed toward providing, to varying degrees, (i) a technique for removing markings made on printed material, including newspapers, (ii) a vacuum fed, belt driven bitonal, grayscale or color scanner having auto detect sensors for start and stop of the belt, and an exit tray which obviates the need to handle the source material after scanning, (iii) a process and tool for recognizing the flow of articles in newspapers, and jumping to additional pages, and of following the story-line while performing OCR, (iv) a process and tool for performing OCR on newspaper articles which results in a quality which is acceptable for word searching, (v) a process and flow for de-columnization of newspaper articles, (vi) a process and flow for creating clipped articles from an image of the full newspaper scanned from microfilm, microfiche, or original copy, (vii) a process and tool for performing custom spell checks meeting the particular needs of newspaper articles by discerning commonly confused letters, recognizing common terms, and rejecting terms which are known never to appear in newspapers, (
  • a method of processing newsprint data which has been scanned into a digital image includes removing marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the method further includes performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • OCR optical character recognition
  • the method further includes storing the OCR output in a digital storage medium and controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
  • a method of retrieving digitally stored newsprint data includes providing a database of newsprint information, the database having been created using the method of the previous paragraph.
  • the method further includes searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
  • a computer program product including computer readable program code for processing newsprint data which has been scanned into a digital image.
  • the computer readable program code includes a first program part, a second program part, a third program part, and a fourth program part.
  • the first program part is for removing marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the second program part is for performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • OCR optical character recognition
  • the third program part is for storing the OCR output in a digital storage medium.
  • the fourth program part is for controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
  • a computer program product including computer readable program code for retrieving digitally stored newsprint data.
  • the computer readable program code includes a first program part and a second program part.
  • the first program part is for providing a database of newsprint information, the database having been created using the method described above.
  • the second program part is for searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
  • a device for processing newsprint data which has been scanned into a digital image includes an image cleaner, an OCR unit, a digital storage medium, and a coordinator.
  • the image cleaner removes marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the OCR unit performs optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • the digital storage medium is for storing the OCR output.
  • the coordinator controls the work flow between the image cleaner, the OCR unit, and the digital storage medium.
  • a retrieval system including a database and a searching system.
  • the database includes newsprint information, the database having been created using the method described earlier.
  • the searching system is capable of performing adaptive pattern recognition processing and morphology such that text which does not exactly match the search string can be retrieved.
  • a method of utilizing digitally stored newsprint data includes searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the method further includes producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the method further includes displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • a computer program product including computer readable program code for utilizing digitally stored newsprint data.
  • the computer readable program code includes a first program part, a second program part, and a third program part.
  • the first program part is for searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the second program part is for producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the third program part is for displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • a retrieval system including a search engine and a user interface.
  • the search engine searches text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the search engine also produces a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the user interface displays the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • FIG. 1 shows a high level diagram of an embodiment of the present invention.
  • FIG. 2 shows a high level modular diagram of a data capture and conversion system according to an embodiment of the present invention.
  • FIG. 3 shows a flow diagram for a newspaper digitization process according to an embodiment of the present invention.
  • FIG. 4 shows a flow diagram for a newspaper digitization process with remote indexing according to an embodiment of the present invention.
  • FIG. 5 lists process steps in a clipping digitization process according to an embodiment of the present invention.
  • FIG. 6 shows a clipping envelope barcode
  • FIG. 7 shows part of the preparation of newspaper clippings.
  • FIGS. 8A-8B show part of the process of newspaper clipping scanning.
  • FIG. 8C shows an exit tray for a scanner.
  • FIG. 8D shows a scanning production line
  • FIG. 9 shows a Zeutchel digital scanner.
  • FIGS. 1 0A-1 0B show part of the screen and process for key field indexing.
  • FIG. 1 1 lists several key fields.
  • FIG. 1 2A shows a newspaper obituary entered into a digital system.
  • FIG. 1 2B contains a flow for remote data entry.
  • FIG. 1 3 depicts the process for three-step Grayscale Enhancement and OCR voting.
  • FIG. 14 shows an example of grayscale image enhancement.
  • FIGS. 1 5A-, 1 5B, and 1 6 depict screens used in quality control.
  • FIG. 1 7 is a high level block diagram showing a cataloging system.
  • FIG. 1 8 is a high level block diagram showing a retrieval system.
  • FIGS. 1 9-20 show screens used in an editor for a cataloging system.
  • FIG. 21 shows a high-level block diagram of a device for processing newsprint data which has been scanned into a digital image.
  • the high level block diagram of FIG. 1 shows the three main components 10, 12, 14 of an embodiment of the present invention, as well as the interface 16 between them.
  • the data capture and conversion system 10 contains the functionality for taking newspaper articles and full newspaper pages and creating a word searchable database from them in high volume production.
  • the retrieval system 12 contains the functionality for storing the created database and allowing authorized users to access the database and perform searches.
  • the cataloging system 14 contains the functionality for editing and maintaining the database after it is created and sent to the retrieval system 12.
  • the interface 1 6 is a generic interface and its implementation can be quite varied, as indicated by the non-limiting examples which follow.
  • the interface 1 6 can comprise, for example, an electronic realtime connection over the Internet or other data network, including a wide area network ("WAN") or a metropolitan area network (“MAN”) .
  • the interface 1 6 could also comprise carrying a hard disk or other storage medium from the data capture and conversion system 1 0 to the retrieval system 1 2 or cataloging system 14.
  • the interface 1 6 can comprise, for example, a local area network (“LAN”) or the internal bus of a common computer system.
  • LAN local area network
  • the data capture and conversion system 1 0 is shown in greater detail in
  • FIG. 2 This system 10 performs the functionality required to take either a newspaper clipping of an article, or create a clipping and associated jump from microfiche or microfilm or hardcopy, and create: an electronic image of the article, searchable text from the image, and a word searchable key field "meta data" database entry for the article.
  • a different "re-keyed" process is also automated when the image data obtained is not of sufficient quality to produce reasonable quality text from an OCR process.
  • newsprint data is defined to include, without limitation, any data from a newspaper, whether stored on paper, microfiche, microfilm, a digital storage medium such as a hard disk, an analog storage medium such as a tape.
  • Newsprint data can, therefore, include text, pictures, and other types of data.
  • FIG. 2 depicts the modular nature of the data capture and conversion system 10.
  • a variety of work performing AVATAR software modules 1 5 perform various work functions. Work is served out to the work performing software modules 1 5 by the AVATAR centra! coordinator 20 which controls the flow of data between them based upon the variable work flow definition defined for a particular customer's image processing project.
  • the coordinator 20 of the preferred embodiment utilizes each of the work performing modules 1 5, also referred to as processes, as is necessary for a particular application.
  • the flow of the process can also be changed for different imaging work flow applications.
  • the modules Manual Article Crop and Jump Connector are used to process articles obtained when digitizing microfilm but are not required when processing images obtained by scanning the hard copy clipping articles themselves.
  • One advantage of a computer controlled automated process flow is that the electronic work tasks can be tracked using a database or other computer software and these tasks are properly routed based upon data processing requirements to eliminate human error during repetitive processes.
  • the work performing modules 1 5 are client modules and login to the coordinator 20 which is a server based module.
  • the coordinator 20 can be, for example, a processor such as a Pentium processor coupled to appropriate memory, or a complete computer system including PCs, mainframes, general purpose systems, and application specific systems.
  • the client operations for obtaining work from the coordinator 20 are as follows:
  • the coordinator 20, along with the correct work performing software modules 1 5, comprises software, and is therefore easily configurable to accommodate different applications and conditions.
  • the modular nature of the data capture and conversion system 1 0 is demonstrated by the preferred mechanism for interfacing each of the modules 1 5 to the coordinator 20.
  • the modules 1 5 preferably use standardized function calls to communicate with the coordinator 20.
  • the coordinator 20 When called by a software work performing software module 1 5, the coordinator 20 provides work on an as requested work type or on a next available basis. When requesting work, the work performing software 1 5 can request work by project or folder.
  • the coordinator 20 also has a security capability which will allow only authorized users or accounts to acquire work when requested or as it is available.
  • the coordinator 20 can also serve out work for different customer's jobs concurrently, for example projects can be performed for two different newspapers simultaneously.
  • FIG. 3 The process flow for an embodiment of the invention for a newspaper digitization application is shown in FIG. 3, and the process steps are listed in FIG. 5.
  • the process steps listed in FIG. 5 list the physical process which occurs during newspaper clipping digitization as opposed to the AVATAR software process flow in FIG. 3. For that reason, FIGS. 3 and 5 provide slightly different information.
  • the discussion of the application will follow the outline of the process flow of FIG. 3, and will also address the process steps of FIG. 5 from within that outline.
  • the portions of the process can be performed in different physical locations. When this is performed a high bandwidth wide area telecommunications connection is preferably required. When this is not possible information can be transferred from one location to another via, for example, high density magnetic tape or other digital storage medium.
  • FIG. 4 shows one alternate flow process in which key field indexing is being performed remotely, image data is being transferred from the primary location via tape, and key field data, as well as any image fixes or rejections performed by the Index Operator, is streamed back in real-time over the telecommunications connection.
  • the work flow is controlled by the coordinator 20.
  • the work flow concerns the tracking, collecting, moving, and otherwise organizing and controlling of files and other data and information among the various stations or processes where it is used, generated, stored, etc.
  • This information includes the images at all stages of the process, including the scanned image, the enhanced image, the OCR'd image, the bitonal image, etc.
  • This information also includes all data which is generated such as indications of jumps, OCR accuracy, errors, etc.
  • This information also includes non-electronic data, such as the original document that is scanned.
  • Controlling the work flow includes such processes as sending the digital image to a process, sending the digital image to a station, automatically rejecting the digital image if quality is insufficient, presenting the digital image to an operator at a station, checking the digital image in after a process has been performed, checking the digital image out to a station or a process, associating a barcode with non-electronic data relating to the digital image, tracking nonelectronic data associated with the digital image using a barcode, and collecting and storing data relating to the digital image.
  • the newspapers clippings have already been cut, date stamped and hand marked, and sorted into envelopes by category, as described in the background section. These envelopes are then delivered to the data capture and conversion system 10, of FIG. 1 , so that the digitization process can begin.
  • the first step in the digitization process is to capture the category information from the envelope and use it to create a folder for the clippings in each envelope.
  • a large folder that can physically house the unfolded clippings is utilized so that the clippings can be unfolded and laid flat.
  • an electronic folder can also be created for the digitized images of the clippings in each envelope. This is done in the bar-coding step 40 (see FIG. 3) by typing in the category information and creating a barcode label which can be placed on the envelope or folder and used to track it and its contents through the process.
  • FIG. 6 shows a barcode label 60 which has been created for an envelope.
  • Meta data information 62 is typically subject data and can include the subject and the dates of the clippings in the envelope, as well as a biographical proper name meta data when the envelope contains information about a specific individual.
  • the meta data 62 is typed in by an operator utilizing the barcode module 40 of FIG. 3 and appears on the printed barcode label 60. Capturing the identifying information 62 electronically is referred to as envelope indexing.
  • the barcode label 60 can also be used to indicate a variety of other information.
  • the barcode label 60 can indicate in which box or shipment that envelope was received and on what date the envelope was received.
  • the label 60 can also be used to indicate whether or not there are additional envelopes, folders, or other information associated with that envelope. In the present application some of the envelopes will have an oversized folder associated with them to contain clippings which are unusually large, and this information is indicated on the barcode label 60.
  • two barcode labels are printed by the barcode module 40 after the envelope subject meta data 62 has been entered.
  • One barcode is affixed to the original envelope itself and a second is placed inside the envelope to be affixed later to a scanning separator sheet, as discussed in the section on scanning below.
  • the use of a barcode label 60 helps to eliminate human operator errors and, thus, to ensure positive envelope control during scanning and repacking of the clippings. Errors are also prevented by having the data capture and conversion system 1 0 oversee the flow of electronic folders. This is done by the coordinator 20 of FIG.
  • the scanning step 41 is performed after the bar-coding step 40 is completed. Before the clippings can be scanned, they must be unfolded and flattened, referred to as clipping preparation in FIG. 5.
  • FIG. 7 shows part of the process of unfolding and flattening the clippings from an envelope. A special process is used to flatten the clippings, which can range in size from about one inch square to 1 5" x 25" or larger. Due to the fact that the clippings have been folded to fit into small envelopes for a period of up to one hundred years, a physical process is required to prepare and flatten the clippings prior to the scanning step. At this point the clippings are unfolded and taped with acid free tape if torn.
  • any continuations of a newspaper article from one page and column to another are marked. These continuations are termed jumps.
  • Red markers are used to indicate the various pages of an article. One red marker is used for the first page of the article if it contains a jump and a second, third, and fourth marker, etc. is used for the continuations. Markers are placed as near as possible to the upper right hand corner of the clipping to provide key field operators a visual clue as to which articles contain jump continuations (see the discussion of Key Field Indexing below).
  • the articles that have continuations (jumps) are also inserted into a second folded separator sheet inside the large preparation folders. The special separation, in addition to the markers, serves to notify the scanner operator that these pages must be kept together during the scanning process.
  • the complete prepared folder is run through one of several different commercial configuration laminators set to the correct temperature and pressure to flatten the clippings and remove folds without any adverse damage.
  • FIG. 8A shows a portion of the scanner 80.
  • FIG. 8A also shows a portion of a barcode reader 82.
  • the separator sheet is used for several functions: it contains the barcode, it has affixed the original clipping envelope that is used later in the repacking step, and it separates the clippings in one envelope from another in the specially designed removable exit tray.
  • the scanner itself can process the barcode and a separate barcode scanner is not necessary.
  • the scanner 80 used to convert the newspaper clippings is a specially customized version of a commercially available scanner, the Banctec 4530. However, several scanners are used in the process and other types of scanners are used for image conversion projects using different materials.
  • the customized scanner 80 clearly performs a critical task in the process by reliably and efficiently scanning a high volume of clippings of various sizes, without damaging them.
  • the scanner 80 possesses specific features which make it particularly suited to this application. These include a vacuum feed for the clippings, a belt drive with auto-detect sensors for start and stop of the belt, and a custom-built exit tray for collecting the clippings as they exit the scanner 80. Referring to FIG. 8C, it can be seen that the exit tray allows the clippings to be retrieved without being handled another time and thus helps to increase work flow time and motion efficiency in addition to helping preserve the clippings. Scanning processes are run in parallel as shown in FIG. 8D.
  • Grayscale or color scanning would allow a variety of image processing techniques to be used to remove the date stamp and handwriting from the clippings. Therefore, an interface scanning card was built for the scanner 80 to enable grayscale scanning.
  • An IPT/GS Gray Scale Capture Board (the "grayscale board") was utilized to connect to the specially created 4530 interface card in the Banctec 4530 scanner.
  • the grayscale board is a PCI-based scanner interface card that provides grayscale, capture capability to a scanner once an appropriate interface has been built.
  • the grayscale board has a maximum data throughput of 1 32 MB/s, a maximum scanner speed of 25 MHz, a maximum PCI bus speed of 33 MHz, and a maximum image size of 65,536 by 65,536.
  • the grayscale board can perform the processing features of binarization, cropping, inversion, scaling, and smoothing 2 x 1 , and supports image formats of 1 bit bitonal, 8 or 16 bit grayscale, and combined grayscale and black-and-white.
  • the scanning interface hardware and software allows the scanning operator to operate the scanner from PC based controls (keyboard and mouse) in addition to foot pedals used to rotate the clipping image if its configuration dictates scanning the image upside down or reversed.
  • the scanning step 41 may also be performed with an oversize scanner if the clipping is large and not suited for the high volume production scanner 80.
  • a German built Zeutchel overhead digital scanner shown in FIG. 9, is used in the present application to rapidly scan large clipping images in a production environment.
  • Special application programmer interface (“API") software was written to allow the specially designed AVATAR workflow software to operate the Zeutchel scanner.
  • Other scanners of this type and configuration may also be used in the embodiment.
  • the oversize scanner is not intended to be used for high volume clippings smaller than 1 1 "x1 7", which is the maximum size which the Banctec 4530 scanner can scan.
  • Other types of special purpose scanners can also be integrated into the process flow.
  • Enhancement is performed in two places in the flow process for newspaper digitization, but depending upon the requirements, specific image enhancement tasks can be performed at any location.
  • the first enhancement routine is used to improve the quality of the scanned image and to reduce the file size of the image.
  • the operations performed preferably include deskew, despeckle, and removing lines and black portions. These functions are performed in the cropping enhancement step 42. These functions reduce the file size of the image in addition to removing unwanted image data and noise. This makes the grayscale image easier to visually process by the operator in the next work flow process, key fielding.
  • the work flow process 42 includes a Deskew function for JPEG images. This function must be performed prior to cropping Since JPEG images are virtually impossible to deskew with any accuracy and reliability, the embodiment includes the following process for deskew of JPEG Images:
  • Bitonal image deskewed using convential deskew software and angle of deskew is saved.
  • Original JPEG image is rotated by the angle determined in above.
  • JPEG cropping is performed on the properly deskewed JPEG image.
  • step 42 the quality of the grayscale image is automatically detected and assessed and if it is not adequate, then the image is failed, as noted by the Fail Grayscale Image box 43, and sent to the rescan module 50. This assessment is automatic. 4. Key Field Entry
  • a variety of information can be entered during the key field entry process 44, which is illustrated in FIGS . 1 0A- 1 0B.
  • Key fielding can utilize two methods, key-from-image or key off hard copy.
  • the information which is entered in this process 44 includes the publication date and the source of the newspaper, which are taken from the date stamp, and the headline or title, as shown in FIG . 1 0A via the key-from-image approach.
  • Additional key fields can be utilized for a variety of useful information, such as the envelope category, which is preferably captured during the bar-coding process 40.
  • the process also allows data to be keyed from hard copy as is performed when it is more convenient or cost effective to do so or when the image is not required in the final searchable database.
  • FIG . 1 1 shows the meta data 1 00 for a particular clipping .
  • This process involves a computer search for specific text strings. When located, the software copies a defined variable number of characters which make up the byline.
  • the work flow software allows the key fields to be entered while the scanned grayscale image is on the screen. This process is efficient in terms of time and obviates any need for handling the actual clippings. The process can be made even more efficient by allowing the operator to drag and highlight the appropriate text with a mouse, or otherwise indicate the relevant text, and then use a simple OCR program to enter the text directly. The operator then only needs to verify that the OCR result is accurate.
  • the key field step 44 also provides another opportunity to review the image.
  • the operator can fail the image, as indicated by the Fail Enhance box 45, and send it to the rescan module 50.
  • there is an "error flag” box which an operator can check if the image has an error of any kind.
  • the operator can also indicate that the image needs to be rescanned by checking the "rescan page” box. Additionally, the operator can indicate that the image has a jump by removing the "first page” check in the check box for the second and further pages. Because most of the clippings are a single page, the "first page" default is currently set to checked. If the image is rejected, the operator can key in the specific reason.
  • the comment will be included. If a rejected image has been through OCR processing then the % OCR confidence will be shown. The Project and Folder are also indicated so that the operators will know what job they are currently processing.
  • the electronic clipping is ready for the multi-step grayscale enhancement process 46, which includes grayscale enhance functions.
  • This process 46 is further illustrated in FIG. 1 3. As shown in FIG. 1 3, four separate processes are performed and the output with the best results are selected for the final OCR voting step 48. More or less than four (4) processes can be utilized.
  • the first process involved is threshold enhancement 131 . Since it is difficult to develop on computer algorithm that will effectively clean up and remove unwanted hand written material from the image without removing significant parts of the original image itself, multiple algorithms were written which complete against to obtain the best results when processing an image. Four different algorithms are described. A JPEG image is the input, however, other formats can be used in different embodiments.
  • the leftmost processing line of process 1 31 in FIG. 1 3 selects a grayscale threshold level and then performs a simple decision based on the threshold level. All values equal to or 5 larger than the grayscale threshold level are classified in one category, either black or white, and all values less than the grayscale threshold are classified in the other category. The end result is a black and white, bitonal, image. Step 1 32 is effectively performed along with step 1 31 for the leftmost processing line.
  • FIG. 21 shows an image i s cleaner 1 81 , as shown in FIG. 21 , which comprises a processor or computer system which can run image enhancement routines or grayscale enhance functions.
  • FIG. 21 also shows the coordinator 20, a digital storage medium 1 83, and an OCR unit 1 82 which is described further in a section below.
  • the interconnections in FIG. 21 are intended to be illustrative and are not intended to
  • the coordinator 20, image cleaner 1 81 , and OCR unit 1 82 all comprise a common processor (not shown) .
  • All processing columns in FIG. 1 3 can provide a rejection for an unrecognizable JPEG image, indicated by the Fail Grayscale Enhance box 47. This automatic process rejects to rescan 50.
  • the max value of black (B max ) is set to 10 on a scale of
  • the max value of white (W max ) is set to 236 on a scale of 0-255 colors.
  • a nTotaloflntensity nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections. ⁇ The Averaged sampled intensity value (l avg ) is equal to nTotaloflntensity/n.
  • ⁇ G avg Average value of the gray pixel intensities.
  • step ii For every pixel in the image. ⁇ Get the number of White Pixels (W n ), intensity values of 255, number of Black Pixels (B n -), intensity values of 0, and the number of Gray Pixels (G , intensity values other than 255 or 0.
  • Step # 2 (dilation of gray image) again.
  • Convertlmage_ex3() Changes image to bitonal. This algorithm converts all pixels that have an intensity value greater than 1 to white (255) and less than or equal to 1 to black (0).
  • Newspaper BK6 Enhancement performs newspaper clipping image enhancement as follows: Gets the average intensity value of the image (l avg ) by averaging all the pixel values.
  • the computed intensity value for () is computed by the following.
  • the max value of black (B max ) is set to 1 0 on a scale of 0-
  • the image is divided up into 21 separate sections.
  • Section is a valid section
  • nTotaloflntensity nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections.
  • the computed intensity value is equal to nTotaloflntensity/n.
  • call Convertlmage_ex5() algorithm which converts image to bitonal but only the pixels with values less than 254 converts black pixels and all others to white.
  • Step 1 32 of FIG. 1 3 is a bitonal conversion process.
  • the leftmost processing line of step 1 31 in FIG. 1 3 has already performed the bitonal conversion, so nothing is done in that processing line at step 1 32.
  • the end product of those respective enhancement processes is converted into a bitonal image.
  • the gray pixels that are close enough to black are converted to black and everything else is dropped out.
  • the level of gray that is used to become black is a variable that can be adjusted for specific requirements.
  • Sample before and after shots are shown in FIG. 1 4.
  • the before shot 1 41 is a scanned clipping which is a grayscale image.
  • the before shot 1 41 is the input into the threshold enhancement process 1 31 of FIG. 1 3.
  • the after shot 142 shows the result after threshold enhancement 131 has been successfully used to remove the handwritten word and circle mark, as well as the date stamp.
  • the threshold enhancement 1 31 has also lightened up the background making the entire clipping easier to read.
  • step 1 33 is the same for each processing line.
  • the bitonal images, which are typically different for each processing line, are prepared for OCR.
  • OCR can be performed by an OCR unit 1 82, shown in FIG. 21 , which contains a processor or computer system which can execute software and perform the various functions described herein.
  • the images have been treated as one single image up to this point.
  • the text zoning step 1 33 provides three different consecutive options, one of which is selected and applied to the next step (two-pass OCR processing 1 34).
  • a custom newspaper decolumnization program is run, if this program is successful then the two-pass OCR 1 34 is initiated. If newspaper decolumnization fails, standard autozoning is next implemented, however if this process fails then the image is treated as one zone and the total image without zoning is sent for OCR processing 1 34.
  • the standard auto-zoning process is a commercially available software routine that provides autozoning but is not specifically created or tuned to newspapers.
  • the newspaper decolumnization process is developed to recognize columns. This decolumnization process groups the text of a column so that the OCR module performs its functions within the specified zone only. Without zoning, the OCR software reads the image from left to right and scans across column breaks as if the sentence continues in that direction as opposed to down the column. This creates groupings of words from left to right but does not maintain the original sentence. format of the newspaper. This presents a problem if the text is to be imported into another system for re-use or if a word has been hyphenated and continued on the next line. It also may present a problem if the lines of the columns are not aligned well.
  • Steps 1 34 and 1 35 of FIG. 1 3 are only implemented to provide OCR confidence statistics. These steps are also the same for each of the processing lines of FIG. 1 3.
  • the zoned, bitonal image files are passed through two separate OCR processes to yield two separate result files for each processing column of FIG. 1 3.
  • the two OCR processes use different OCR applications and therefore can produce different results.
  • Each of the OCR processes attempts to recognize each character in the bitonal image and each also produces a confidence level output indicating how well the OCR process thinks that it has done on each character.
  • the winning OCR engine's output is used for each character in the bitonal image.
  • the total confidence level output for the combination of the two engine results is given as a percentage between 0 and 100.
  • step 1 36 If one of the two OCR engines fail, the character and confidence output from the remaining OCR engine is used. These three confidence level outputs are then compared in step 1 36, and the highest one is selected. The processing line that produced the winning confidence level output is flagged, and the zoned, bitonal image file from the output of the zoning process 1 33 of that winning processing column is used in step 1 37.
  • the folder can be automatically flagged for manual re-work by the QCR module (see FIG. 3 49).
  • QCR 49 an operator can manually perform the newspaper threshold enhancement process of step 1 31 in FIG. 1 3. If the image cannot be adequately corrected, then the image is rejected to the rescan module 50.
  • step 1 37 the bitonal image file that gave rise to the winning confidence level is passed through five different OCR applications. More or less than 5 OCR applications can be used. Each OCR application gives a confidence level on each character of output and voting is performed in step 1 38 on every character and the highest confidence character is selected and put into a separate file. The file containing this OCR text, which gave rise to the highest confidence level for each character, is then output from the voting OCR module 48 (see FIG. 3).
  • the character and confidence output from the remaining OCR engines is used to determine the winner and to determine the overall confidence rating for the ASCII text file created from the multi-pass OCR voting process.
  • the voting can be done in a variety of ways, and the threshold can be set at appropriate values to yield acceptable results.
  • the OCR engines may compute a confidence level for each individual character and the OCR engine with the highest confidence for that character can be used for that character. The overall confidence of the resulting file is based on the confidence of each character.
  • a simple vote can be used between the OCR engines' outputs, and the most common output for a given character can be selected, with confidence levels computed from the level of agreement for a given character, and with a suitable decision process being used to resolve ties.
  • Other voting schemes will be apparent to those of ordinary skill in the art.
  • the confidence output is below a specified output (currently set at 75% confidence) the image is flagged for repair in the QCR module described below.
  • QCR Quality Control and Repair
  • the quality control and repair process 49 of FIG, 3 is used to verify all data (imaging, OCR, and meta-data) that has been assembled from the data capture and conversion processes and to fix any data or image that is determined to be inferior unless the image needs to be rescanned. If QCR cannot fix the incorrect data or poorly enhanced image the particular image and problem is noted and the folder is rejected to re-scan 50.
  • the QCR operator, step 49 has the capability to view the grayscale and bitonal image, the OCR text file, and the meta data, as indicated in FIG. 1 5A - FIG.1 6.
  • the leftmost window has the scanned grayscale image
  • the middle window has the enhanced bitonal image with the date stamp and other markings removed
  • the rightmost window has the OCR text.
  • the quality control module allows the operator to view thumbnails of all images in each folder with a quick visual scan of the basic integrity of all the images, as seen in FIG. 1 5B. This is useful if 1 00% QC, as opposed to QC sampling, is required.
  • the QCR module also contains a "capsule view" FIG.
  • the QCR module capsule view provides the operator information concerning the number of images in a folder left to QC.
  • the a variable setting in the software allows this to be set at 1 00% or at some leveling of statistical sampling like 5 per folder.
  • the capsule view tells the operator how many images are left in a folder to QC to meet the required QC sampling level.
  • the QCR module 49 can perform a variety of functions to repair images, such as manual cropping or the grayscale enhancement should the automated process discusses above not be sufficient to create a high quality image.
  • the image is sent back into the work flow to voting OCR 48. If the image is not repairable, then it is sent to the rescan module 50 and the folder with the appropriate clipping is directed to that location. The clipping is then rescanned.
  • the rescan module 50 uses a different scanner from the high volume, high speed scanner 80 described earlier.
  • the inspection is manual, but other embodiments may automate all or part of the process. Because this is designed as the final quality check, the end result, which is the electronic folder associated with that clipping, can be failed for any reason.
  • the key field data is incorrect for any reason, it is corrected immediately by the QCR operator 49. If the image is incorrect for any reason, then it is repaired as described and the electronic folder is sent back to the OCR process 48.
  • FIG 1 5B allows the operator to view and edit the OCRed text.
  • the bundling process 51 bundles the multiple images that exist, when articles include continuation "jumps," into one multipage image file. This process also combines the text of these multiple files into a single text file. As described earlier, any images that are jumps from the first page articles have been so indicated by the indexing operator by removing the "first page" indicator in step 44. This flag triggers the work flow software to join the images and text files into one file when the electronic folder is processed by the bundle jump software in step 51 .
  • the bitonal image and ASCII file is also associated with the meta data record stored in the workflow database.
  • the format of the image file can be any image file format such as the TIFF format.
  • composite file formats such as Adobe PDF can be utilized.
  • PDF Adobe Portable Document Format
  • PDF allows image and text files to be combined into one single file which can contain a single page or make up an entire document.
  • the final product is ready to be exported to the retrieval system 1 2 (see FIG. 1 ) .
  • the export can be set up to provide data in a variety of formats for many types of manufacturer's retrieval systems. However, in the current integrated AVATAR Digital Asset Management System (ADAMS) the data is exported using a direct database and file system connection to the ADAMS ArchivalWare retrieval server 53 (see FIG. 3).
  • the final product preferably includes the ASCII text, any associated images or other digital objects (audio, video etc.), and the meta data describing these objects. Depending on the file format and the needs of the database, this information can be in one file or separated into multiple files.
  • the clippings are no longer needed and are sent back to the customer.
  • the process has been optimized to protect the clippings from excessive handling and from inadvertent misplacement. This helps to ensure that the original source of this valuable data is not lost or destroyed.
  • the software to perform this task includes the following functions: (i) the ability to manually electronically zone/mark each article on the full newspaper page, whether the article is rectangular in shape or is made up of a series of rectangles, (ii) when an article has been manually zoned, it is marked in a colored outline so that the operator can determine when each article on the page has been zoned, and (iii) if an article has a jump, the operator has the ability to select/deselect a jump flag indicating to the work flow software in the next process that the article requires a mate.
  • the jump connector software is required when using the manual article jump software previously described.
  • the software to perform this task includes the following functions: (i) automatic selection of pages that have been flagged with jumps, (ii) a split screen that auto selects a jump first page and allows the operator to move to the correct page of the paper while viewing the jumped page on the other half of the screen, (iii) with the jump displayed on one half of the computer screen, the jump article can be effectively zoned as with the manual article crop software described above, and (iv) the article images viewed on both halves of the computer screen can be electronically connected for processing by the image and text bundle module 51 .
  • the manual process described above for zoning articles from the full page can be automated.
  • the process involves invoking computer algorithms that attempt to understand the structure of the page layout and separate articles on the page.
  • the algorithms may have fixed input and tuning parameters which produce more accurate results.
  • Auto connection can be accomplished by defining a fixed methodology a newspaper or other publication uses to provide a "jump" on a different publication page. This methodology such as “continued from page # "ARTICLE NAME" allows the correct jump process to become automated.
  • the data capture and conversion system 1 0 can produce a large variety of reports. Particularly useful are the production reports on the throughput and efficiency of the process.
  • the system 1 0 can also input digital images directly with the image/object import utility.
  • the system 1 0 does not need to scan information in order for it to be entered.
  • the system 10 can bundle both images and text into a common file using the PDF format.
  • FIG. 1 2A shows an example of an obituary that has been entered. It includes the key field information in the meta data, as well as the text in a separate field below.
  • the process flow for microfilm data capture and conversion is different from that of the present newspaper scanning application.
  • the process control software can send the electronic data directly to the QC module 49 (see FIG. 3).
  • the system utilizes a real time spell check system 1 21 which incorporates a custom dictionary for newspaper applications.
  • the recognized and correctable entries of the preferred embodiment include (i) states in the United States and state abbreviations, (ii) cities, (iii) counties in a particular state, (iv) cemeteries, (v) churches, (vi) first names, and (vii) hard to recognize names.
  • states in the United States and state abbreviations include (i) states in the United States and state abbreviations, (ii) cities, (iii) counties in a particular state, (iv) cemeteries, (v) churches, (vi) first names, and (vii) hard to recognize names.
  • a misspelling the operator is prompted in real time that a potential error has occurred. The operator must then check the keyed word against the original (image or hard copy data) and invoke or ignore the suggested correction.
  • the list of entries is expandable and modifiable to meet a particular application.
  • the data entry 1 20 is performed remotely with labor which is cost effective such as in offshore locations or in penal institutions.
  • the remote keying software is designed for download 1 24 to the final QC 1 25 and processing work flow software 1 27.
  • the next remote process is performed by remote QC personnel.
  • the keyed data is run against a specially designed keyed-data-test software program 1 22.
  • the program implements the following tests and when a potential problem is found the test operator is prompted and the problem is inserted into the operator's viewing window.
  • the validity and format of the data is checked against the correct format for the application.
  • the field format delimiters contained in each keyed article (for data input) will preferably be:
  • the program checks to see that all required fields for a given article are present, that correct meta-data delimiters inserted by the Macros of 1 20 have not been accidentally deleted and that the data is in the correct format for export/loading software 1 28.
  • the keyed-data-test software 1 22 includes functions which help address problems of differentiating between I, J, L, and T. These letters are difficult to distinguish on small font-type, microfilmed, newspaper death notice data. The feature attempts to recognize when these particular letters have been confused and places the potential problem in the viewing window for the test operator to repair.
  • the keyed-data-test software 1 22 also includes functions that flag when words occur that should never be in the typed text. When data is being keyed by penal institution inmates this software effectively provides a stop list of words that the test operator can verify are indeed in the newspaper. Any class of words could be put into this list, and it is often used to ensure that offensive language is not entered, whether purposely or accidentally. This gives the ultimate purveyor of the database some assurance that it will not offend others nor be embarrassed.
  • the keyed-data-test software 1 22 also includes a function that automatically places a period after "Mr”, “Mrs”, “Ms”, and "Dr”. Further, the keyed-data-test software 1 22 also includes functions that report statistics like the number of characters and articles typed. This data is used for invoicing and for tracking productivity.
  • the data is downloaded 1 24 from the remote keying facilities to the data processing location. This is normally performed via telecommunications methods but the data can be moved via magnetic disk or tape.
  • Final QC 1 25 is performed at the data processing facility using the same keyed test program used by the remote facilities. If the data has been "cleaned up” at the remote facilities the test program will flag few or no errors.
  • a sample of the data is reviewed manually 127 to make sure that text has not been omitted when the re-keying effort has been performed. Because the keying software 1 20 allows the data to be entered in the same format as the original, the QC technician scans the left and right sides of the textual columns to compare them to the original. With this technique it is easy to spot missing information.
  • the keyed-data-load software module 122 is executed the data is loaded into the retrieval system 128.
  • the software un-wraps the columns of text which were keyed "as is" for QC purposes.
  • hyphenated words are reconnected.
  • the software uses a look up table to ensure that words that are to remain hyphenated do so.
  • the text verification system offers a number of benefits over spell checkers as well as other data entry verification systems.
  • the text verification system speeds up data entry by recognizing and entering words before they are completely entered, it provides automatic correction of errors, it assists those for whom English is a second language, and it screens the errant entry from a disgruntled data entry person.
  • the text verification system also can include a variety of other utilities and features which may be customized to the particular application.
  • the cataloging system 1 4 (see FIG. 1 ) is delivered to the database maintainer. It is used for a variety of functions, including updating, modifying, repairing, replacing, and editing of the database entries. This can be useful, for example, in cataloging special entries by adding terms to unique enhancement term key fields.
  • the cataloging includes an image/text editor 1 62 and a database editor 1 64.
  • the image/text editor 1 62 allows editing of the database entries themselves and the database editor 1 64 allows the cataloging technician to clean up dirty OCRed text to a 100% correct format. This is done by comparison of the image to the keyed text and manually editing and re-saving the corrected text. Small image editing functions can preferably be performed, such as image de-skew and cropping.
  • Both the image/text editor 1 62 and the database editor 1 64 are preferably contained in an editing program called AVATAR EDIT.
  • the user enters a search string in this editor, using natural language or specific terms for example, and the retrieval system 1 2 (described below) delivers the search results in this editor as well.
  • the meta data appears in the left-hand side of the screen
  • the scanned article appears on the top right-hand side of the screen
  • the search results appear on the bottom right-hand side of the screen.
  • FIG. 20 shows another configuration in which the OCR text appears along with the actual scanned article and the search results. This is a screen which can be used to edit the OCR text.
  • the retrieval system 1 2 of FIG. 1 includes a number of different features and modules, as shown in FIG . 1 8.
  • the interconnection between the elements of FIG . 1 8 is intended to indicate the integrated nature of the system 1 2, rather than physical or logical connections between the elements.
  • the user interface 1 71 is the primary user interface to the retrieval system 1 2.
  • the user interface 1 71 is a Web interface which allows access to the retrieval system 1 2 over the world wide web, but other interfaces can also be used, such as an MS-Windows application interface.
  • the user interface 1 71 can be implemented with any display device controlled, for example, by a processor.
  • Security features 1 77 such as password protection, are in place to ensure that only registered and authorized users gain access. Users could be given access for limited periods of time or to limited parts of the database. The security features 1 77 can also be used for billing purposes.
  • the SQL/ODBC database 1 75 can hold the meta data and the actual word searchable text files and the associated image files containing the digitized clippings, however the system also allows the text and images to be stored under the operating systems file structure 1 73.
  • the database 1 75 and file structure 1 73 can include any storage medium.
  • a digital storage medium such as a hard disk is used.
  • the search engine 1 79 allows the user to search both the SQL/ODBC database 1 75 and the text files 1 73 using sophisticated searching techniques especially configured for this application.
  • the search engine 1 79 preferably includes a processor running appropriate software to perform the necessary functions.
  • Searching is preferably performed by entering a search string (not shown).
  • the search engine 1 79 preferably uses the search string to search various text files in the SQL/ODBC database 1 75 and/or the text files 1 73 and produces at least one search result. Sample search results are listed in the bottom right of FIG. 1 9, and correspond in this embodiment to different newspaper articles which have been scanned.
  • the retrieval system 1 2 preferably includes a computer system running software which performs one or more of the functions described.
  • the search engine 1 79 incorporates Adaptive Pattern Recognition Processing (APRP).
  • APRP Adaptive Pattern Recognition Processing
  • fuzzy searching provides techniques which are fault tolerant. It finds patterns within the search string, and within words, and matches those patterns with patterns in the meta data or the text data. This processing technique also allows for user feedback to help refine the search. Fuzzy searching provides the ability to retrieve approximations of search queries and has a natural tolerance for errors in both input data and query terms. It eliminates the need for OCR clean up, which is especially useful in applications that handle large volumes of scanned documents. High precision and recall gives end-users a high level of confidence that their queries will return all of the requested information regardless of errors in spelling or in the "dirty data" which they may be searching.
  • APRP Adaptive Pattern Recognition Processing
  • the search engine 1 79 also provides semantic expansion and semantic search capability.
  • Preferable features of the Semantic Network include:
  • the baseline Semantic Network is preferably created from complete dictionaries, a thesaurus, and other semantic resources, and gives users a built-in knowledgebase of 400,000 word meanings and over 1 .6 million word relationships.
  • Natural Language Processing Users can preferably simply enter straightforward, plain English queries, which are then automatically enhanced by a rich set of related terms and concepts, to find information targeted to their specific context.
  • Morphology The Network preferably recognizes words at the root level, which is a much more accurate approach than the simple stemming techniques characteristic of other text retrieval software. This minimizes word misses which are caused by irregular or variant spellings.
  • Idioms The Network preferably recognizes idioms for more accurate searches, and processes phrases like "real estate” and “kangaroo court” as single units of meaning, not as individual words.
  • Semantics The Network preferably recognizes multiple meanings of words and allows users to simply point and click to choose the meaning appropriate to their queries.
  • Multi-layered dictionary The baseline Semantic Network preferably supports multi-layered dictionary structures that add even greater depth and flexibility. This enables integration of specialized reference works for legal, medical, finance, engineering, and other disciplines. End users can also preferably add personalized definitions and concepts without affecting the integrity of the baseline knowledgebase.
  • the functionality disclosed in this application can be, at least partially, implemented by hardware, software, or a combination of both. This may be done, for example, with a Pentium-based computer system running database and editing software, or other programs.
  • this functionality may be embodied in computer readable media or computer program products to be used in programming an information- processing apparatus to perform in accordance with the invention. Such media or products may include magnetic, magnetic-optical, optical, and other types of media, including for example 3.5 inch diskettes and other digital storage media.
  • This functionality may also be embodied in computer readable media such as a transmitted waveform to be used in transmitting the information or functionality.
  • software implementations can be written in any suitable language, including without limitation high-level programming languages such as C + + , mid-level and low-level languages, assembly languages, and application-specific or device-specific languages.
  • Such software can run on a general purpose computer such as a 486 or a Pentium, an application specific piece of hardware, or other suitable device.
  • the required logic may also be performed by an application specific integrated circuit ("ASIC") or other device.
  • ASIC application specific integrated circuit
  • the technique may use analog circuitry, digital circuitry, or a combination of both.
  • Embodiments may also include various hardware components which are well known in the art, such as connectors, cables, and the like.

Abstract

A process for digitizing newsprint information from a newspaper includes scanning the information into a digital image format and then processing the image to produce searchable text. The processing includes removing data stamps and other marks that are written over the newsprint, enhancing the image using a library of image processing functions, and performing voting-OCR to select an optimal OCR output. The OCR output yields highly accurate text which can be word searched using adaptive pattern recognition processing, fuzzy logic, morphology, and other techniques to provide a word searchable database of newsprint information from newspapers. The process is software controlled so that the work flow, both electronic and non-electronic, between various processes or stations is tracked and sequenced, and appropriate data is collected and stored.

Description

WORD SEARCHABLE DATABASE FROM HIGH VOLUME SCANNING
OF NEWSPAPER DATA
RELATED APPLICATION
This application claims priority of U.S. provisional application No. 60/149,222 filed in the United States Patent and Trademark Office on August 1 7, 1 999.
BACKGROUND OF THE INVENTION
FIELD OF THE INVENTION
The invention relates generally to the fields of data storage and retrieval, including databases, scanning and digitizing, image processing, and searching techniques, and relates more particularly to the various processes involved in the high volume processing of newspaper material to create a word searchable database with key fields.
DESCRIPTION OF THE RELATED ART
Data storage and retrieval have long been an important part of society, and have always been time consuming processes needing to be repeated at regular intervals on the same data in order to preserve it. The advent of digital electronic storage mediums and word searchable tools have revolutionized the storage and retrieval of newly produced data. Savings in physical space and increased ease in retrieval are two of the principal advantages. Digital technologies have also held much promise for the preservation and use of non-digital data, but the processes of converting the data into digital form are laborious and the quality of the stored images are often disappointing.
The storage and retrieval of newspaper articles present a case in point. Old newspaper articles have been categorized, sorted, and stored for over a century. A typical process first involves cutting newspaper articles from a newspaper, stamping a date stamp which contains the date of the paper and the source newspaper from which the article was clipped, and identifying the main subject of the article by circling, underlining, or writing on the clippings themselves. These and other markings on the clippings are referred to as marks or markings. The various markings on the clippings are invariably on the text of the article due to the small margins on newspapers, although they need not be so restricted. The process then involves collecting these articles according to subject and putting the related articles in a properly marked envelope. The envelopes, perhaps millions of them, are then stored. Ideally, a paper index or card catalog of the subjects is created and the envelopes are made available for research and other purposes to interested parties. However, the creation of a fully searchable system is made difficult by the sheer volume of envelopes. Further, the articles are prone to deterioration from excessive handling. These factors can combine to limit the number of fully searchable, public accessible systems.
The storage of complete newspapers on microfilm or microfiche also currently provides a method by which old news information can be stored and manually retrieved. However, the microfilmed information can only be accessed by date or special indices that may have been developed for special collections of news events or articles. Even when these indices have been created, the process of obtaining the correct microfilm or microfiche, mounting it on the optical reader/printer, and locating the correct article by the frame of the microfilm is manual and time intensive.
The prospect of change has been held out by new digital technologies which can be used to scan the newspaper articles or microfilm thus preserving the images in a digital form. However, due to the poor quality of newspaper print in general, in which the individual letters are not always well formed, the small type, and the markings on the clippings, it is impossible to perform complete optical character recognition ("OCR") on the scanned images. The process is further hampered by the flow of a newspaper article, which typically uses single or multiple columns and jumps to additional pages. One advantage of the scanned images over data from microfilm is that they may have been filed by subject or by the name of an individual about whom the articles were written. There is a need, therefore, for a high volume system of creating a word searchable database of newspaper articles, regardless of the source of the original information, whether it be clippings, microfilm, microfiche, the original full sized hard copy paper itself, or another source.
An embodiment of the present invention is directed to overcoming or reducing the effects of at least one of the problems set forth above.
SUMMARY OF THE INVENTION
Embodiments of the present invention are directed toward providing, to varying degrees, (i) a technique for removing markings made on printed material, including newspapers, (ii) a vacuum fed, belt driven bitonal, grayscale or color scanner having auto detect sensors for start and stop of the belt, and an exit tray which obviates the need to handle the source material after scanning, (iii) a process and tool for recognizing the flow of articles in newspapers, and jumping to additional pages, and of following the story-line while performing OCR, (iv) a process and tool for performing OCR on newspaper articles which results in a quality which is acceptable for word searching, (v) a process and flow for de-columnization of newspaper articles, (vi) a process and flow for creating clipped articles from an image of the full newspaper scanned from microfilm, microfiche, or original copy, (vii) a process and tool for performing custom spell checks meeting the particular needs of newspaper articles by discerning commonly confused letters, recognizing common terms, and rejecting terms which are known never to appear in newspapers, (viii) a modular process for automating and controlling the flow of data through the various processes involved in creating a word searchable database of newspaper articles, (ix) a word searchable database of newspaper articles, and (x) the application of advanced search techniques to digitized newspaper articles.
Briefly, in accordance with one aspect of the present invention, there is provided a method of processing newsprint data which has been scanned into a digital image. The method includes removing marks in the digital image of scanned newsprint data using a grayscale enhance function. The method further includes performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output. The method further includes storing the OCR output in a digital storage medium and controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
Briefly, in accordance with another aspect of the present invention, there is provided a method of retrieving digitally stored newsprint data. The method includes providing a database of newsprint information, the database having been created using the method of the previous paragraph. The method further includes searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
Briefly, in accordance with another aspect of the present invention, there is provided a computer program product including computer readable program code for processing newsprint data which has been scanned into a digital image. The computer readable program code includes a first program part, a second program part, a third program part, and a fourth program part. The first program part is for removing marks in the digital image of scanned newsprint data using a grayscale enhance function. The second program part is for performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output. The third program part is for storing the OCR output in a digital storage medium. The fourth program part is for controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
Briefly, in accordance with another aspect of the present invention, there is provided a computer program product including computer readable program code for retrieving digitally stored newsprint data. The computer readable program code includes a first program part and a second program part. The first program part is for providing a database of newsprint information, the database having been created using the method described above. The second program part is for searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
Briefly, in accordance with another aspect of the present invention, there is provided a device for processing newsprint data which has been scanned into a digital image. The device includes an image cleaner, an OCR unit, a digital storage medium, and a coordinator. The image cleaner removes marks in the digital image of scanned newsprint data using a grayscale enhance function. The OCR unit performs optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output. The digital storage medium is for storing the OCR output. The coordinator controls the work flow between the image cleaner, the OCR unit, and the digital storage medium.
Briefly, in accordance with another aspect of the present invention, there is provided a retrieval system including a database and a searching system. The database includes newsprint information, the database having been created using the method described earlier. The searching system is capable of performing adaptive pattern recognition processing and morphology such that text which does not exactly match the search string can be retrieved.
Briefly, in accordance with another aspect of the present invention, there is provided a method of utilizing digitally stored newsprint data. The method includes searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data. The method further includes producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image. The method further includes displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
Briefly, in accordance with another aspect of the present invention, there is provided a computer program product including computer readable program code for utilizing digitally stored newsprint data. The computer readable program code includes a first program part, a second program part, and a third program part. The first program part is for searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data. The second program part is for producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image. The third program part is for displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
Briefly, in accordance with another aspect of the present invention, there is provided a retrieval system including a search engine and a user interface. The search engine searches text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data. The search engine also produces a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image. The user interface displays the particular scanned image of newsprint data which corresponds to the text which produced the search result.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention will be explained with the aid of the following figures.
FIG. 1 shows a high level diagram of an embodiment of the present invention. FIG. 2 shows a high level modular diagram of a data capture and conversion system according to an embodiment of the present invention.
FIG. 3 shows a flow diagram for a newspaper digitization process according to an embodiment of the present invention.
FIG. 4 shows a flow diagram for a newspaper digitization process with remote indexing according to an embodiment of the present invention.
FIG. 5 lists process steps in a clipping digitization process according to an embodiment of the present invention.
FIG. 6 shows a clipping envelope barcode.
FIG. 7 shows part of the preparation of newspaper clippings.
FIGS. 8A-8B show part of the process of newspaper clipping scanning.
FIG. 8C shows an exit tray for a scanner.
FIG. 8D shows a scanning production line.
FIG. 9 shows a Zeutchel digital scanner.
FIGS. 1 0A-1 0B show part of the screen and process for key field indexing.
FIG. 1 1 lists several key fields.
FIG. 1 2A shows a newspaper obituary entered into a digital system.
FIG. 1 2B contains a flow for remote data entry.
FIG. 1 3 depicts the process for three-step Grayscale Enhancement and OCR voting.
FIG. 14 shows an example of grayscale image enhancement. FIGS. 1 5A-, 1 5B, and 1 6 depict screens used in quality control.
FIG. 1 7 is a high level block diagram showing a cataloging system.
FIG. 1 8 is a high level block diagram showing a retrieval system.
FIGS. 1 9-20 show screens used in an editor for a cataloging system.
FIG. 21 shows a high-level block diagram of a device for processing newsprint data which has been scanned into a digital image.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Various embodiments are described in this section. The description sets forth the best mode known as of the filing of this application and does so in enough detail to enable one of ordinary skill in the relevant art to make and use the invention. The embodiments of the invention which are described are illustrative, rather than exhaustive, and accordingly do not limit the invention as claimed. The ordinary artisan will recognize many variations that can be made to the disclosed embodiments, and such variations are within the spirit and scope of this disclosure.
I. OVERALL SYSTEM
The high level block diagram of FIG. 1 shows the three main components 10, 12, 14 of an embodiment of the present invention, as well as the interface 16 between them. The data capture and conversion system 10 contains the functionality for taking newspaper articles and full newspaper pages and creating a word searchable database from them in high volume production. The retrieval system 12 contains the functionality for storing the created database and allowing authorized users to access the database and perform searches. The cataloging system 14 contains the functionality for editing and maintaining the database after it is created and sent to the retrieval system 12. Each of these systems is described in detail in the following sections. The particular selection and arrangement of functions in this embodiment is not restrictive and other embodiments may have varying configurations.
The interface 1 6 is a generic interface and its implementation can be quite varied, as indicated by the non-limiting examples which follow. In an embodiment in which two or more of the components 1 0, 1 2, 1 4 are remote from each other, the interface 1 6 can comprise, for example, an electronic realtime connection over the Internet or other data network, including a wide area network ("WAN") or a metropolitan area network ("MAN") . The interface 1 6 could also comprise carrying a hard disk or other storage medium from the data capture and conversion system 1 0 to the retrieval system 1 2 or cataloging system 14. In an embodiment in which two or more of the components 10, 12, 1 4 are not remote from each other, the interface 1 6 can comprise, for example, a local area network ("LAN") or the internal bus of a common computer system.
II. DATA CAPTURE AND CONVERSION
The data capture and conversion system 1 0 is shown in greater detail in
FIG. 2. This system 10 performs the functionality required to take either a newspaper clipping of an article, or create a clipping and associated jump from microfiche or microfilm or hardcopy, and create: an electronic image of the article, searchable text from the image, and a word searchable key field "meta data" database entry for the article. A different "re-keyed" process is also automated when the image data obtained is not of sufficient quality to produce reasonable quality text from an OCR process. For the purposes of this application, newsprint data is defined to include, without limitation, any data from a newspaper, whether stored on paper, microfiche, microfilm, a digital storage medium such as a hard disk, an analog storage medium such as a tape. Newsprint data can, therefore, include text, pictures, and other types of data.
A. WORK FLOW
FIG. 2 depicts the modular nature of the data capture and conversion system 10. As can be seen from FIG. 2, a variety of work performing AVATAR software modules 1 5 perform various work functions. Work is served out to the work performing software modules 1 5 by the AVATAR centra! coordinator 20 which controls the flow of data between them based upon the variable work flow definition defined for a particular customer's image processing project. The coordinator 20 of the preferred embodiment utilizes each of the work performing modules 1 5, also referred to as processes, as is necessary for a particular application. The flow of the process can also be changed for different imaging work flow applications. For example, the modules Manual Article Crop and Jump Connector are used to process articles obtained when digitizing microfilm but are not required when processing images obtained by scanning the hard copy clipping articles themselves. One advantage of a computer controlled automated process flow is that the electronic work tasks can be tracked using a database or other computer software and these tasks are properly routed based upon data processing requirements to eliminate human error during repetitive processes.
The work performing modules 1 5 are client modules and login to the coordinator 20 which is a server based module. The coordinator 20 can be, for example, a processor such as a Pentium processor coupled to appropriate memory, or a complete computer system including PCs, mainframes, general purpose systems, and application specific systems. The client operations for obtaining work from the coordinator 20 are as follows:
Client Module Login to Coordinator
• Client Module Optional Steps (Select customer/project default is next available)
Acquire Folder
By project/folder if Administration rights allow
Open folder manually based on folder path Perform specific client module work task
Release or Reject Folder A particular application of digitizing newspaper articles will be described below. However, that application does not utilize all of the features of the data capture and conversion system 1 0 which are shown in FIG. 2. Therefore, after the description, several of the remaining modules of FIG. 2 will also be described.
The coordinator 20, along with the correct work performing software modules 1 5, comprises software, and is therefore easily configurable to accommodate different applications and conditions. The modular nature of the data capture and conversion system 1 0 is demonstrated by the preferred mechanism for interfacing each of the modules 1 5 to the coordinator 20. The modules 1 5 preferably use standardized function calls to communicate with the coordinator 20. When called by a software work performing software module 1 5, the coordinator 20 provides work on an as requested work type or on a next available basis. When requesting work, the work performing software 1 5 can request work by project or folder. The coordinator 20 also has a security capability which will allow only authorized users or accounts to acquire work when requested or as it is available. The coordinator 20 can also serve out work for different customer's jobs concurrently, for example projects can be performed for two different newspapers simultaneously.
B. NEWSPAPER DIGITIZATION APPLICATION
The process flow for an embodiment of the invention for a newspaper digitization application is shown in FIG. 3, and the process steps are listed in FIG. 5. The process steps listed in FIG. 5 list the physical process which occurs during newspaper clipping digitization as opposed to the AVATAR software process flow in FIG. 3. For that reason, FIGS. 3 and 5 provide slightly different information. The discussion of the application will follow the outline of the process flow of FIG. 3, and will also address the process steps of FIG. 5 from within that outline. It should be noted that the portions of the process can be performed in different physical locations. When this is performed a high bandwidth wide area telecommunications connection is preferably required. When this is not possible information can be transferred from one location to another via, for example, high density magnetic tape or other digital storage medium. FIG. 4 shows one alternate flow process in which key field indexing is being performed remotely, image data is being transferred from the primary location via tape, and key field data, as well as any image fixes or rejections performed by the Index Operator, is streamed back in real-time over the telecommunications connection.
As will be seen, the work flow is controlled by the coordinator 20. The work flow concerns the tracking, collecting, moving, and otherwise organizing and controlling of files and other data and information among the various stations or processes where it is used, generated, stored, etc. This information includes the images at all stages of the process, including the scanned image, the enhanced image, the OCR'd image, the bitonal image, etc. This information also includes all data which is generated such as indications of jumps, OCR accuracy, errors, etc. This information also includes non-electronic data, such as the original document that is scanned.
Controlling the work flow includes such processes as sending the digital image to a process, sending the digital image to a station, automatically rejecting the digital image if quality is insufficient, presenting the digital image to an operator at a station, checking the digital image in after a process has been performed, checking the digital image out to a station or a process, associating a barcode with non-electronic data relating to the digital image, tracking nonelectronic data associated with the digital image using a barcode, and collecting and storing data relating to the digital image. These will be further explained throughout the disclosure.
1 . Bar-coding and Envelope Indexing
In the embodiment for this application, the newspapers clippings have already been cut, date stamped and hand marked, and sorted into envelopes by category, as described in the background section. These envelopes are then delivered to the data capture and conversion system 10, of FIG. 1 , so that the digitization process can begin.
The first step in the digitization process is to capture the category information from the envelope and use it to create a folder for the clippings in each envelope. A large folder that can physically house the unfolded clippings is utilized so that the clippings can be unfolded and laid flat. Additionally, an electronic folder can also be created for the digitized images of the clippings in each envelope. This is done in the bar-coding step 40 (see FIG. 3) by typing in the category information and creating a barcode label which can be placed on the envelope or folder and used to track it and its contents through the process.
FIG. 6 shows a barcode label 60 which has been created for an envelope. Meta data information 62 is typically subject data and can include the subject and the dates of the clippings in the envelope, as well as a biographical proper name meta data when the envelope contains information about a specific individual. The meta data 62 is typed in by an operator utilizing the barcode module 40 of FIG. 3 and appears on the printed barcode label 60. Capturing the identifying information 62 electronically is referred to as envelope indexing.
The barcode label 60 can also be used to indicate a variety of other information. As an example, the barcode label 60 can indicate in which box or shipment that envelope was received and on what date the envelope was received. The label 60 can also be used to indicate whether or not there are additional envelopes, folders, or other information associated with that envelope. In the present application some of the envelopes will have an oversized folder associated with them to contain clippings which are unusually large, and this information is indicated on the barcode label 60.
Preferably, two barcode labels are printed by the barcode module 40 after the envelope subject meta data 62 has been entered. One barcode is affixed to the original envelope itself and a second is placed inside the envelope to be affixed later to a scanning separator sheet, as discussed in the section on scanning below. The use of a barcode label 60 helps to eliminate human operator errors and, thus, to ensure positive envelope control during scanning and repacking of the clippings. Errors are also prevented by having the data capture and conversion system 1 0 oversee the flow of electronic folders. This is done by the coordinator 20 of FIG. 2, in part, by releasing an electronic folder to a particular work performing module 1 5 to perform a specified function, and then checking it back into the coordinator 20 again after the work is finished so that the coordinator 20 can pass the electronic folder onto the next Image Processing work flow task as defined by a given customer's requirements.
2. Scan
The scanning step 41 is performed after the bar-coding step 40 is completed. Before the clippings can be scanned, they must be unfolded and flattened, referred to as clipping preparation in FIG. 5. FIG. 7 shows part of the process of unfolding and flattening the clippings from an envelope. A special process is used to flatten the clippings, which can range in size from about one inch square to 1 5" x 25" or larger. Due to the fact that the clippings have been folded to fit into small envelopes for a period of up to one hundred years, a physical process is required to prepare and flatten the clippings prior to the scanning step. At this point the clippings are unfolded and taped with acid free tape if torn.
At this preparation stage, any continuations of a newspaper article from one page and column to another are marked. These continuations are termed jumps. Red markers are used to indicate the various pages of an article. One red marker is used for the first page of the article if it contains a jump and a second, third, and fourth marker, etc. is used for the continuations. Markers are placed as near as possible to the upper right hand corner of the clipping to provide key field operators a visual clue as to which articles contain jump continuations (see the discussion of Key Field Indexing below). The articles that have continuations (jumps) are also inserted into a second folded separator sheet inside the large preparation folders. The special separation, in addition to the markers, serves to notify the scanner operator that these pages must be kept together during the scanning process.
After the clippings have been unfolded, marked for jumps, and replaced in the preparation folders they are ready to be flattened via a combination of heat and pressure. The complete prepared folder is run through one of several different commercial configuration laminators set to the correct temperature and pressure to flatten the clippings and remove folds without any adverse damage.
Once the clippings have been unfolded and flattened, they can be scanned, as shown in FIGS. 8A-8B. FIG. 8A shows a portion of the scanner 80. FIG. 8A also shows a portion of a barcode reader 82. To create and open an electronic folder to house the scanned clippings the scanner operator scans the barcode on the separator sheet in the prepared scanning folder. The separator sheet is used for several functions: it contains the barcode, it has affixed the original clipping envelope that is used later in the repacking step, and it separates the clippings in one envelope from another in the specially designed removable exit tray. In some configurations, the scanner itself can process the barcode and a separate barcode scanner is not necessary.
The scanner 80 used to convert the newspaper clippings is a specially customized version of a commercially available scanner, the Banctec 4530. However, several scanners are used in the process and other types of scanners are used for image conversion projects using different materials. The customized scanner 80 clearly performs a critical task in the process by reliably and efficiently scanning a high volume of clippings of various sizes, without damaging them. The scanner 80 possesses specific features which make it particularly suited to this application. These include a vacuum feed for the clippings, a belt drive with auto-detect sensors for start and stop of the belt, and a custom-built exit tray for collecting the clippings as they exit the scanner 80. Referring to FIG. 8C, it can be seen that the exit tray allows the clippings to be retrieved without being handled another time and thus helps to increase work flow time and motion efficiency in addition to helping preserve the clippings. Scanning processes are run in parallel as shown in FIG. 8D.
Commercially available scanners of this quality only provided bitonal (black and white) scanning. It is important, however, to have grayscale or color scanning of the clippings. Grayscale or color scanning would allow a variety of image processing techniques to be used to remove the date stamp and handwriting from the clippings. Therefore, an interface scanning card was built for the scanner 80 to enable grayscale scanning. An IPT/GS Gray Scale Capture Board (the "grayscale board") was utilized to connect to the specially created 4530 interface card in the Banctec 4530 scanner. The grayscale board is a PCI-based scanner interface card that provides grayscale, capture capability to a scanner once an appropriate interface has been built. The grayscale board has a maximum data throughput of 1 32 MB/s, a maximum scanner speed of 25 MHz, a maximum PCI bus speed of 33 MHz, and a maximum image size of 65,536 by 65,536. The grayscale board can perform the processing features of binarization, cropping, inversion, scaling, and smoothing 2 x 1 , and supports image formats of 1 bit bitonal, 8 or 16 bit grayscale, and combined grayscale and black-and-white. In addition, the scanning interface hardware and software allows the scanning operator to operate the scanner from PC based controls (keyboard and mouse) in addition to foot pedals used to rotate the clipping image if its configuration dictates scanning the image upside down or reversed.
The scanning step 41 may also be performed with an oversize scanner if the clipping is large and not suited for the high volume production scanner 80. A German built Zeutchel overhead digital scanner, shown in FIG. 9, is used in the present application to rapidly scan large clipping images in a production environment. Special application programmer interface ("API") software was written to allow the specially designed AVATAR workflow software to operate the Zeutchel scanner. Other scanners of this type and configuration may also be used in the embodiment. The oversize scanner is not intended to be used for high volume clippings smaller than 1 1 "x1 7", which is the maximum size which the Banctec 4530 scanner can scan. Other types of special purpose scanners can also be integrated into the process flow.
3. Enhancement
Enhancement is performed in two places in the flow process for newspaper digitization, but depending upon the requirements, specific image enhancement tasks can be performed at any location. For newspaper digitization, after a clipping is scanned, the first enhancement routine is used to improve the quality of the scanned image and to reduce the file size of the image. The operations performed preferably include deskew, despeckle, and removing lines and black portions. These functions are performed in the cropping enhancement step 42. These functions reduce the file size of the image in addition to removing unwanted image data and noise. This makes the grayscale image easier to visually process by the operator in the next work flow process, key fielding.
The work flow process 42 includes a Deskew function for JPEG images. This function must be performed prior to cropping Since JPEG images are virtually impossible to deskew with any accuracy and reliability, the embodiment includes the following process for deskew of JPEG Images:
JPEG image converted to bitonal using newspaper clean algoritm of process 131 FIG 13.
Bitonal image deskewed using convential deskew software and angle of deskew is saved. Original JPEG image is rotated by the angle determined in above. Next the JPEG cropping is performed on the properly deskewed JPEG image.
The scanned image is not reviewed by an operator until the cropping enhancement step 42 has been performed. In step 42, the quality of the grayscale image is automatically detected and assessed and if it is not adequate, then the image is failed, as noted by the Fail Grayscale Image box 43, and sent to the rescan module 50. This assessment is automatic. 4. Key Field Entry
A variety of information can be entered during the key field entry process 44, which is illustrated in FIGS . 1 0A- 1 0B. Key fielding can utilize two methods, key-from-image or key off hard copy. Preferably, the information which is entered in this process 44 includes the publication date and the source of the newspaper, which are taken from the date stamp, and the headline or title, as shown in FIG . 1 0A via the key-from-image approach. Additional key fields can be utilized for a variety of useful information, such as the envelope category, which is preferably captured during the bar-coding process 40. The process also allows data to be keyed from hard copy as is performed when it is more convenient or cost effective to do so or when the image is not required in the final searchable database.
The key field data forms part of the meta data. FIG . 1 1 shows the meta data 1 00 for a particular clipping . In addition to the key fields mentioned above, and the barcode and file number, which are entered automatically, and the byline which is entered using a special text parsing technique prior to export process 52 This process involves a computer search for specific text strings. When located, the software copies a defined variable number of characters which make up the byline.
As shown in FIG. 1 0A, the work flow software allows the key fields to be entered while the scanned grayscale image is on the screen. This process is efficient in terms of time and obviates any need for handling the actual clippings. The process can be made even more efficient by allowing the operator to drag and highlight the appropriate text with a mouse, or otherwise indicate the relevant text, and then use a simple OCR program to enter the text directly. The operator then only needs to verify that the OCR result is accurate.
The key field step 44 also provides another opportunity to review the image. The operator can fail the image, as indicated by the Fail Enhance box 45, and send it to the rescan module 50. As shown in FIG. 10A, there is an "error flag" box which an operator can check if the image has an error of any kind. The operator can also indicate that the image needs to be rescanned by checking the "rescan page" box. Additionally, the operator can indicate that the image has a jump by removing the "first page" check in the check box for the second and further pages. Because most of the clippings are a single page, the "first page" default is currently set to checked. If the image is rejected, the operator can key in the specific reason. If the module has been rejected previously by a human or a computer the comment will be included. If a rejected image has been through OCR processing then the % OCR confidence will be shown. The Project and Folder are also indicated so that the operators will know what job they are currently processing.
Most client modules have the ability to reject work. If work is rejected prior to the OCR step 48 then it is sent to the rescan module 50. If work is rejected after OCR 48, it is sent to Quality Control and Repair (QCR) module 49. Clearly, the number of opportunities to fail the image, and the mechanism for repairing or reworking the image will vary in different embodiments.
5. Multi-Step Grayscale Enhance
After the key field data is entered in step 44, the electronic clipping is ready for the multi-step grayscale enhancement process 46, which includes grayscale enhance functions. This process 46 is further illustrated in FIG. 1 3. As shown in FIG. 1 3, four separate processes are performed and the output with the best results are selected for the final OCR voting step 48. More or less than four (4) processes can be utilized.
a. Threshold Enhancement
The first process involved is threshold enhancement 131 . Since it is difficult to develop on computer algorithm that will effectively clean up and remove unwanted hand written material from the image without removing significant parts of the original image itself, multiple algorithms were written which complete against to obtain the best results when processing an image. Four different algorithms are described. A JPEG image is the input, however, other formats can be used in different embodiments. The leftmost processing line of process 1 31 in FIG. 1 3 selects a grayscale threshold level and then performs a simple decision based on the threshold level. All values equal to or 5 larger than the grayscale threshold level are classified in one category, either black or white, and all values less than the grayscale threshold are classified in the other category. The end result is a black and white, bitonal, image. Step 1 32 is effectively performed along with step 1 31 for the leftmost processing line.
ιo The second (from Left) left processing line of step 1 31 in FIG. 1 3
(Newspaper Clean) performs a newspaper threshold enhancement process. This process is configured based upon the newspaper type but is largely automatic in removing unwanted red date stamps and handwriting from the clipping images prior to conversion to bitonal. This process can be performed by an image i s cleaner 1 81 , as shown in FIG. 21 , which comprises a processor or computer system which can run image enhancement routines or grayscale enhance functions. FIG. 21 also shows the coordinator 20, a digital storage medium 1 83, and an OCR unit 1 82 which is described further in a section below. The interconnections in FIG. 21 are intended to be illustrative and are not intended to
20 limit the architecture of embodiments of the system shown. Additionally, in some embodiments the different elements may share physical devices, and in one embodiment the coordinator 20, image cleaner 1 81 , and OCR unit 1 82 all comprise a common processor (not shown) .
b. Newspaper Clean
25 Preferable procedures for the Newspaper Clean enhancement process include:
Does a cleaning of the jpeg images pixels by replacing pixels with the color white pixels based on the color of its neighbors. • The computed intensity value is computed the following way.
The average intensity lavg of the entire image is computed.
Computed intensity value (CI) = (I * I ) / 1 70.
CI = CI - (L av„g - 64)
• If the focus pixel being replaced and its neighbors are all above the computed intensity value then the pixel being compared (not its neighbors) is set to white.
• If the focus pixel or any of its neighbors is less than the average intensity then the pixel remains unchanged.
• Converts to Bitonal based on the lavg turning all pixels less than lavg to black or (0) and greater than or equal to lavg to white.
All processing columns in FIG. 1 3 can provide a rejection for an unrecognizable JPEG image, indicated by the Fail Grayscale Enhance box 47. This automatic process rejects to rescan 50.
c. Newspaper BK3
The third (from left) processing line of step 1 31 in FIG . 1 3
Newspaper BKA Enhancement performs newspaper clipping image enhancement as follows:
Does a cleaning of the jpeg images pixels by replacing pixels with the color white pixels based on the color if its neighbors. -
• The computed intensity value for () is computed by the following.
The max value of black (Bmax) is set to 10 on a scale of
0-255 colors The max value of white (Wmax) is set to 236 on a scale of 0-255 colors.
The image is divided up into 21 separate sections.
Each sections average intensity value is computed. If a section has all white pixels of value 255 or all black pixels of value 0 then the section is ignored. Go to iv..
If the sections average intensity value is less than Brnax or greater than Wmax then the section is ignored. Go to iv.
Section is a valid section
A nTotaloflntensity = nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections. The Averaged sampled intensity value (lavg) is equal to nTotaloflntensity/n.
Computed intensity value (CI) = (lavg * lavg) / 1 70. CI = CI - (lavg - 64)
• If the focus pixel being replaced and its neighbors are all above the computed intensity value then the pixel being compared (not its neighbors) is set to white.
• If the focus pixel or any of its neighbors is less than the average intensity then the pixel remains unchanged. Performs dilation of gray image. This algorithm looks takes each pixel and sets it to its darkest neighbor.
Performs a sand (turning pixels white) and fill (turning pixels an intensity value other than 255). • Takes a pixel and compares its color value to its neighbors. If its neighbors intensity values are greater or equal to 1 28 then set the pixel value to 255 (white) .
• Fill routine. Take the minimum color value of the pixels neighbors and set it to that value.
• Sand routine. If the pixel and its neighbor fits of one of the following patterns where the pixel in center is the focus pixel and D is a pixel value with intensity value less than focus pixel, W is a pixel value greater than focus pixel, and X is a pixel value that is not equal to white (255) . Then set the focus pixel to white:
Patterns: XW DDW XW
WX WDD WX
W WDW XDX
XDX
WDW
W
DarkenGrayToPercentage Algorithm. This algorithm changes the focus pixels color to black if its intensity value falls below the average value of black pixels. • Get the number of White Pixels (Wn), intensity values of 255, number of Black Pixels (Bn), intensity values of 0, and the number of Gray Pixels (Gn) , intensity values other than 255 or 0.
• While Gn > (Bn * .05) do
Gavg = Average value of the gray pixel intensities.
If the focus pixel > = Gavg then set the focus pixel to black or intensity value of 0.
Perform step ii. For every pixel in the image. Get the number of White Pixels (Wn), intensity values of 255, number of Black Pixels (Bn-), intensity values of 0, and the number of Gray Pixels (G , intensity values other than 255 or 0.
Go to b. FillBlackByRatio Algorithm. The algorithm goes through a 21 x 21 pixel pixel block process and counts the number of grays and blacks in an area. A ratio is figured per block from the number of Grays added to Blacks then this value divided by the size of the array (which is 21 x 21 in this case). If the ratio falls below some defined fraction of .3 then change all the grays to black for the pixel block.
Perform Step # 2 (dilation of gray image) again.
Convertlmage_ex3() Algorithm. Changes image to bitonal. This algorithm converts all pixels that have an intensity value greater than 1 to white (255) and less than or equal to 1 to black (0).
d. Newspaper BK6
The far right processing line of step 1 31 in FIG. 1 3 Newspaper BK6 Enhancement performs newspaper clipping image enhancement as follows: Gets the average intensity value of the image (lavg) by averaging all the pixel values.
Does a cleaning of a jpeg image by replacing pixels with the color white pixels based on the color if its neighbors. If the pixel being replaced and its neighbors are all above the computed intensity value then the pixel being compared (not its neighbors) is set to white.
The computed intensity value for () is computed by the following.
• The max value of black (Bmax) is set to 1 0 on a scale of 0-
255 colors
• The max value of white (Wmax) is set to 236 on a scale of 0-
255 colors.
The image is divided up into 21 separate sections.
• Each sections average intensity value is computed.
If a section has all white pixels of value 255 or all black pixels of value 0 then the section is ignored. Go to d.
If the sections average intensity value is less than Bmax or greater than Wmax then the section is ignored. Go to d.
Section is a valid section
A nTotaloflntensity = nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections.
The computed intensity value is equal to nTotaloflntensity/n.
If lavg has a value is greater than 1 28 then call NewsPaper threshold algorithm without the conversion to bitonal.
If lavg has a value of less than or equal to 100 then call SubtractPercentageOflmage (1 ) which subtracts five percent of the grays from the entire image via 1 pass starting at the white intensity side of the histogram. If lavg has a value of greater than 100 or less than or equal to 1 28 then call SubtractPercentageOflmage (3) which subtracts five percent of the grays from the entire image via 3 passes starting at the white intensity side of the histogram.
If lavg has a value of greater than 1 28 or less than or equal to 1 70 then call SubtractPercentageOflmage (6) which subtracts five percent of the grays from the entire image via 6 passes starting at the white intensity side of the histogram.
If lavg has a value of greater than 1 70 or less than or equal to 21 3 then call SubtractPercentageOflmage (8) which subtracts five percent of the grays from the entire image via 8 passes starting at the white intensity side of the histogram.
If lavg has a value of greater than 21 3 then call SubtractPercentageOflmage (1 0) which subtracts five percent of the grays from the entire image via 10 passes starting at the white intensity side of the histogram.
If the lavg is greater than 1 28 than call Algorithm NewsPaperBK3
If the lavg is less or equal to 1 28 than call Convertlmage_ex5() algorithm which converts image to bitonal but only the pixels with values less than 254 converts black pixels and all others to white.
Other threshold conversion algorithm's can also be utilized. The four described here are presented for illustrative purposes.
e. Bitonal Conversion
Step 1 32 of FIG. 1 3 is a bitonal conversion process. As stated earlier, the leftmost processing line of step 1 31 in FIG. 1 3 has already performed the bitonal conversion, so nothing is done in that processing line at step 1 32. For both the middle and the rightmost processing lines, however, the end product of those respective enhancement processes is converted into a bitonal image. To perform bitonal conversion, the gray pixels that are close enough to black are converted to black and everything else is dropped out. The level of gray that is used to become black is a variable that can be adjusted for specific requirements. Sample before and after shots are shown in FIG. 1 4. The before shot 1 41 is a scanned clipping which is a grayscale image. The before shot 1 41 is the input into the threshold enhancement process 1 31 of FIG. 1 3. The after shot 142 shows the result after threshold enhancement 131 has been successfully used to remove the handwritten word and circle mark, as well as the date stamp. The threshold enhancement 1 31 has also lightened up the background making the entire clipping easier to read.
6. Voting OCR
a. Text Zoning
The rest of the process shown in FIG. 1 3 can be associated, as indicated in FIG. 1 3, with the voting OCR step 48 of FIG. 3. This division, however, is somewhat artificial because the threshold enhancement and removal of the date stamp and other markings also improves the quality of the OCR result.
As indicated in FIG. 1 3, step 1 33 is the same for each processing line. The bitonal images, which are typically different for each processing line, are prepared for OCR. OCR can be performed by an OCR unit 1 82, shown in FIG. 21 , which contains a processor or computer system which can execute software and perform the various functions described herein.
The images have been treated as one single image up to this point. The text zoning step 1 33 provides three different consecutive options, one of which is selected and applied to the next step (two-pass OCR processing 1 34). First, a custom newspaper decolumnization program is run, if this program is successful then the two-pass OCR 1 34 is initiated. If newspaper decolumnization fails, standard autozoning is next implemented, however if this process fails then the image is treated as one zone and the total image without zoning is sent for OCR processing 1 34.
The standard auto-zoning process is a commercially available software routine that provides autozoning but is not specifically created or tuned to newspapers. The newspaper decolumnization process, as its name suggests, is developed to recognize columns. This decolumnization process groups the text of a column so that the OCR module performs its functions within the specified zone only. Without zoning, the OCR software reads the image from left to right and scans across column breaks as if the sentence continues in that direction as opposed to down the column. This creates groupings of words from left to right but does not maintain the original sentence. format of the newspaper. This presents a problem if the text is to be imported into another system for re-use or if a word has been hyphenated and continued on the next line. It also may present a problem if the lines of the columns are not aligned well.
b. Initial OCR Pass and Confidence Output
Steps 1 34 and 1 35 of FIG. 1 3 are only implemented to provide OCR confidence statistics. These steps are also the same for each of the processing lines of FIG. 1 3. The zoned, bitonal image files are passed through two separate OCR processes to yield two separate result files for each processing column of FIG. 1 3. The two OCR processes use different OCR applications and therefore can produce different results. Each of the OCR processes attempts to recognize each character in the bitonal image and each also produces a confidence level output indicating how well the OCR process thinks that it has done on each character. The winning OCR engine's output is used for each character in the bitonal image. The total confidence level output for the combination of the two engine results is given as a percentage between 0 and 100. If one of the two OCR engines fail, the character and confidence output from the remaining OCR engine is used. These three confidence level outputs are then compared in step 1 36, and the highest one is selected. The processing line that produced the winning confidence level output is flagged, and the zoned, bitonal image file from the output of the zoning process 1 33 of that winning processing column is used in step 1 37.
However, if none of the confidence levels equals a variable number like 75% or greater, the folder can be automatically flagged for manual re-work by the QCR module (see FIG. 3 49). Using QCR 49, an operator can manually perform the newspaper threshold enhancement process of step 1 31 in FIG. 1 3. If the image cannot be adequately corrected, then the image is rejected to the rescan module 50.
c. Five-Pass OCR and Voting
In step 1 37, the bitonal image file that gave rise to the winning confidence level is passed through five different OCR applications. More or less than 5 OCR applications can be used. Each OCR application gives a confidence level on each character of output and voting is performed in step 1 38 on every character and the highest confidence character is selected and put into a separate file. The file containing this OCR text, which gave rise to the highest confidence level for each character, is then output from the voting OCR module 48 (see FIG. 3).
If one of the OCR engines fail, the character and confidence output from the remaining OCR engines is used to determine the winner and to determine the overall confidence rating for the ASCII text file created from the multi-pass OCR voting process.
The voting can be done in a variety of ways, and the threshold can be set at appropriate values to yield acceptable results. The OCR engines may compute a confidence level for each individual character and the OCR engine with the highest confidence for that character can be used for that character. The overall confidence of the resulting file is based on the confidence of each character. Alternatively, a simple vote can be used between the OCR engines' outputs, and the most common output for a given character can be selected, with confidence levels computed from the level of agreement for a given character, and with a suitable decision process being used to resolve ties. Other voting schemes will be apparent to those of ordinary skill in the art.
If the confidence output is below a specified output (currently set at 75% confidence) the image is flagged for repair in the QCR module described below.
7. Quality Control and Repair ("QCR")
The quality control and repair process 49 of FIG, 3 is used to verify all data (imaging, OCR, and meta-data) that has been assembled from the data capture and conversion processes and to fix any data or image that is determined to be inferior unless the image needs to be rescanned. If QCR cannot fix the incorrect data or poorly enhanced image the particular image and problem is noted and the folder is rejected to re-scan 50. The QCR operator, step 49, has the capability to view the grayscale and bitonal image, the OCR text file, and the meta data, as indicated in FIG. 1 5A - FIG.1 6. The leftmost window has the scanned grayscale image, the middle window has the enhanced bitonal image with the date stamp and other markings removed, and the rightmost window has the OCR text. Additionally, by looking at the bottom portion of the screen of FIG. 1 5A, the operator is able to see if the key field data has been entered properly, to inspect the OCR confidence, to tell whether or not there are jumps in the article, and to inspect the other available information. In addition, the quality control module allows the operator to view thumbnails of all images in each folder with a quick visual scan of the basic integrity of all the images, as seen in FIG. 1 5B. This is useful if 1 00% QC, as opposed to QC sampling, is required. The QCR module also contains a "capsule view" FIG. 1 6 that permits the operator to quickly determine information about images in a folder to include but not limited to: missing text or image files, needs OCR, flagged for manual QC, missing meta-data such as headline or date, etc. The QCR module capsule view provides the operator information concerning the number of images in a folder left to QC. The a variable setting in the software allows this to be set at 1 00% or at some leveling of statistical sampling like 5 per folder. The capsule view tells the operator how many images are left in a folder to QC to meet the required QC sampling level. The QCR module 49 can perform a variety of functions to repair images, such as manual cropping or the grayscale enhancement should the automated process discusses above not be sufficient to create a high quality image. After manual enhancement, then the image is sent back into the work flow to voting OCR 48. If the image is not repairable, then it is sent to the rescan module 50 and the folder with the appropriate clipping is directed to that location. The clipping is then rescanned. The rescan module 50 uses a different scanner from the high volume, high speed scanner 80 described earlier.
The inspection is manual, but other embodiments may automate all or part of the process. Because this is designed as the final quality check, the end result, which is the electronic folder associated with that clipping, can be failed for any reason.
If the key field data is incorrect for any reason, it is corrected immediately by the QCR operator 49. If the image is incorrect for any reason, then it is repaired as described and the electronic folder is sent back to the OCR process 48.
FIG 1 5B. allows the operator to view and edit the OCRed text.
8. Jump Bundling and Copyright Embossing
After the electronic folder passes quality control 50, it is sent to the bundling and Copyright Embossing process 51 . The bundling process 51 bundles the multiple images that exist, when articles include continuation "jumps," into one multipage image file. This process also combines the text of these multiple files into a single text file. As described earlier, any images that are jumps from the first page articles have been so indicated by the indexing operator by removing the "first page" indicator in step 44. This flag triggers the work flow software to join the images and text files into one file when the electronic folder is processed by the bundle jump software in step 51 .
The bitonal image and ASCII file is also associated with the meta data record stored in the workflow database. The format of the image file can be any image file format such as the TIFF format. In addition, composite file formats such as Adobe PDF can be utilized. The Adobe Portable Document Format (PDF) allows image and text files to be combined into one single file which can contain a single page or make up an entire document.
9. Export Data to Retrieval System
After the bundling, if necessary, is performed, the final product is ready to be exported to the retrieval system 1 2 (see FIG. 1 ) . The export can be set up to provide data in a variety of formats for many types of manufacturer's retrieval systems. However, in the current integrated AVATAR Digital Asset Management System (ADAMS) the data is exported using a direct database and file system connection to the ADAMS ArchivalWare retrieval server 53 (see FIG. 3). The final product preferably includes the ASCII text, any associated images or other digital objects (audio, video etc.), and the meta data describing these objects. Depending on the file format and the needs of the database, this information can be in one file or separated into multiple files.
10. Ship Back Clippings
After exporting the electronic information, the clippings are no longer needed and are sent back to the customer. Through the use of a variety of techniques, including bar-coding, work flow software to track and route the clippings, and the exit tray, the process has been optimized to protect the clippings from excessive handling and from inadvertent misplacement. This helps to ensure that the original source of this valuable data is not lost or destroyed.
C. Additional Modules As mentioned above in the discussion of FIG. 2, the newspaper digitization application described does not use all of the capabilities of the data capture and conversion system 1 0. Some of the additional features are described below.
1 . Manual Article Crop
Many newspapers have archived their materials onto microfilm or microfiche and have not performed the subject clipping over the years or have not maintained the integrity of the clipping files. For these clients a different process is used to "electronically clip" the articles from a scanned image of the complete newspaper page taken from microfilm. The operator using this client software marks/zones an entire paper based upon customer rules (for example, news only and disregard ads). The software to perform this task includes the following functions: (i) the ability to manually electronically zone/mark each article on the full newspaper page, whether the article is rectangular in shape or is made up of a series of rectangles, (ii) when an article has been manually zoned, it is marked in a colored outline so that the operator can determine when each article on the page has been zoned, and (iii) if an article has a jump, the operator has the ability to select/deselect a jump flag indicating to the work flow software in the next process that the article requires a mate.
2. Jump Connector
The jump connector software is required when using the manual article jump software previously described. When the operator of this software requests work (each task is typically a full day of the newspaper) from the coordinator, the software brings up a requested days paper or the next available day. The software to perform this task includes the following functions: (i) automatic selection of pages that have been flagged with jumps, (ii) a split screen that auto selects a jump first page and allows the operator to move to the correct page of the paper while viewing the jumped page on the other half of the screen, (iii) with the jump displayed on one half of the computer screen, the jump article can be effectively zoned as with the manual article crop software described above, and (iv) the article images viewed on both halves of the computer screen can be electronically connected for processing by the image and text bundle module 51 .
3. Auto-Zoning/ Auto Connection
The manual process described above for zoning articles from the full page can be automated. The process involves invoking computer algorithms that attempt to understand the structure of the page layout and separate articles on the page. The algorithms may have fixed input and tuning parameters which produce more accurate results. Auto connection can be accomplished by defining a fixed methodology a newspaper or other publication uses to provide a "jump" on a different publication page. This methodology such as "continued from page # "ARTICLE NAME" allows the correct jump process to become automated.
4. Report Writer
This feature is utilized by the newspaper application, although it was not discussed above. The data capture and conversion system 1 0 can produce a large variety of reports. Particularly useful are the production reports on the throughput and efficiency of the process.
5. Image/Object Import
The system 1 0 can also input digital images directly with the image/object import utility. The system 1 0 does not need to scan information in order for it to be entered.
6. Imaoe/Text/PDF Bundler
As described above, the system 10 can bundle both images and text into a common file using the PDF format.
7. Text/Records Import Similar to the importing of digital images, the system 10 can also input digital data directly, without needing to scan it first. This includes the meta data and text for the obituaries and front page applications described earlier.
D. Additional Applications
Key field entry is particularly important in many additional applications, such as entering front page headlines or obituaries which are stored on microfilm. In these applications, the actual headlines from the front page, and the text of the obituaries is entered manually rather than attempting to scan the material. One reason for this is that the quality of the image stored on microfilm is too poor to perform an accurate OCR. FIG. 1 2A shows an example of an obituary that has been entered. It includes the key field information in the meta data, as well as the text in a separate field below.
The process flow for microfilm data capture and conversion is different from that of the present newspaper scanning application. After key entry of the front page and/or an obituary or other data type, the process control software can send the electronic data directly to the QC module 49 (see FIG. 3).
Because both of these applications require more intensive key field entry, a special text verification system was developed to verify the data entry. The text verification system, portions of which are illustrated in FIG. 1 2B, performs several different functions.
As the data is entered 1 20, the system utilizes a real time spell check system 1 21 which incorporates a custom dictionary for newspaper applications. The recognized and correctable entries of the preferred embodiment include (i) states in the United States and state abbreviations, (ii) cities, (iii) counties in a particular state, (iv) cemeteries, (v) churches, (vi) first names, and (vii) hard to recognize names. When a misspelling occurs, the operator is prompted in real time that a potential error has occurred. The operator must then check the keyed word against the original (image or hard copy data) and invoke or ignore the suggested correction. Clearly, the list of entries is expandable and modifiable to meet a particular application. Typically due to the laborious nature of the re-keying effort, the data entry 1 20 is performed remotely with labor which is cost effective such as in offshore locations or in penal institutions. The remote keying software is designed for download 1 24 to the final QC 1 25 and processing work flow software 1 27.
The next remote process is performed by remote QC personnel. The keyed data is run against a specially designed keyed-data-test software program 1 22. The program implements the following tests and when a potential problem is found the test operator is prompted and the problem is inserted into the operator's viewing window.
The validity and format of the data is checked against the correct format for the application. The field format delimiters contained in each keyed article (for data input) will preferably be:
HEADLINE:
BYLINE:
TEXT:
PHOTO_CAPTION:
PHOTO_CREDIT: ENHANCEMENT ΕRMS:
The program checks to see that all required fields for a given article are present, that correct meta-data delimiters inserted by the Macros of 1 20 have not been accidentally deleted and that the data is in the correct format for export/loading software 1 28.
The keyed-data-test software 1 22 includes functions which help address problems of differentiating between I, J, L, and T. These letters are difficult to distinguish on small font-type, microfilmed, newspaper death notice data. The feature attempts to recognize when these particular letters have been confused and places the potential problem in the viewing window for the test operator to repair. The keyed-data-test software 1 22 also includes functions that flag when words occur that should never be in the typed text. When data is being keyed by penal institution inmates this software effectively provides a stop list of words that the test operator can verify are indeed in the newspaper. Any class of words could be put into this list, and it is often used to ensure that offensive language is not entered, whether purposely or accidentally. This gives the ultimate purveyor of the database some assurance that it will not offend others nor be embarrassed.
The keyed-data-test software 1 22 also includes a function that automatically places a period after "Mr", "Mrs", "Ms", and "Dr". Further, the keyed-data-test software 1 22 also includes functions that report statistics like the number of characters and articles typed. This data is used for invoicing and for tracking productivity.
Once the keyed-data-test software operation 1 22 is complete, the data is downloaded 1 24 from the remote keying facilities to the data processing location. This is normally performed via telecommunications methods but the data can be moved via magnetic disk or tape.
Final QC 1 25 is performed at the data processing facility using the same keyed test program used by the remote facilities. If the data has been "cleaned up" at the remote facilities the test program will flag few or no errors.
In addition to the test program, a sample of the data is reviewed manually 127 to make sure that text has not been omitted when the re-keying effort has been performed. Because the keying software 1 20 allows the data to be entered in the same format as the original, the QC technician scans the left and right sides of the textual columns to compare them to the original. With this technique it is easy to spot missing information.
After the final QC of the data is performed the keyed-data-load software module 122 is executed the data is loaded into the retrieval system 128. When the text data is loaded, the software un-wraps the columns of text which were keyed "as is" for QC purposes. When the old newspaper width columns are unwrapped, hyphenated words are reconnected. The software uses a look up table to ensure that words that are to remain hyphenated do so.
In general, the text verification system offers a number of benefits over spell checkers as well as other data entry verification systems. The text verification system speeds up data entry by recognizing and entering words before they are completely entered, it provides automatic correction of errors, it assists those for whom English is a second language, and it screens the errant entry from a disgruntled data entry person. The text verification system also can include a variety of other utilities and features which may be customized to the particular application.
III. CATALOGING SYSTEM AND AVATAR EDIT
The cataloging system 1 4 (see FIG. 1 ) is delivered to the database maintainer. It is used for a variety of functions, including updating, modifying, repairing, replacing, and editing of the database entries. This can be useful, for example, in cataloging special entries by adding terms to unique enhancement term key fields. As shown in FIG. 1 7, the cataloging includes an image/text editor 1 62 and a database editor 1 64. The image/text editor 1 62 allows editing of the database entries themselves and the database editor 1 64 allows the cataloging technician to clean up dirty OCRed text to a 100% correct format. This is done by comparison of the image to the keyed text and manually editing and re-saving the corrected text. Small image editing functions can preferably be performed, such as image de-skew and cropping. Both the image/text editor 1 62 and the database editor 1 64 are preferably contained in an editing program called AVATAR EDIT.
To operate AVATAR EDIT, the user enters a search string in this editor, using natural language or specific terms for example, and the retrieval system 1 2 (described below) delivers the search results in this editor as well. As shown in FIG. 1 9, in one configuration the meta data appears in the left-hand side of the screen, the scanned article appears on the top right-hand side of the screen, and the search results appear on the bottom right-hand side of the screen. This is a screen which can be used to edit the meta data. FIG. 20 shows another configuration in which the OCR text appears along with the actual scanned article and the search results. This is a screen which can be used to edit the OCR text.
IV. RETRIEVAL SYSTEM
The retrieval system 1 2 of FIG. 1 includes a number of different features and modules, as shown in FIG . 1 8. The interconnection between the elements of FIG . 1 8 is intended to indicate the integrated nature of the system 1 2, rather than physical or logical connections between the elements.
The user interface 1 71 is the primary user interface to the retrieval system 1 2. Preferably, the user interface 1 71 is a Web interface which allows access to the retrieval system 1 2 over the world wide web, but other interfaces can also be used, such as an MS-Windows application interface. The user interface 1 71 can be implemented with any display device controlled, for example, by a processor. Security features 1 77, such as password protection, are in place to ensure that only registered and authorized users gain access. Users could be given access for limited periods of time or to limited parts of the database. The security features 1 77 can also be used for billing purposes.
The SQL/ODBC database 1 75 can hold the meta data and the actual word searchable text files and the associated image files containing the digitized clippings, however the system also allows the text and images to be stored under the operating systems file structure 1 73. The database 1 75 and file structure 1 73 can include any storage medium. Preferably, a digital storage medium, such as a hard disk is used.
The search engine 1 79 allows the user to search both the SQL/ODBC database 1 75 and the text files 1 73 using sophisticated searching techniques especially configured for this application. The search engine 1 79 preferably includes a processor running appropriate software to perform the necessary functions.
Searching is preferably performed by entering a search string (not shown). The search engine 1 79 preferably uses the search string to search various text files in the SQL/ODBC database 1 75 and/or the text files 1 73 and produces at least one search result. Sample search results are listed in the bottom right of FIG. 1 9, and correspond in this embodiment to different newspaper articles which have been scanned. In this embodiment, the retrieval system 1 2 preferably includes a computer system running software which performs one or more of the functions described.
Because of the potential that the OCR will not be 1 00% accurate, the search engine 1 79 incorporates Adaptive Pattern Recognition Processing (APRP). Adaptive pattern recognition processing is commonly known as fuzzy searching and provides techniques which are fault tolerant. It finds patterns within the search string, and within words, and matches those patterns with patterns in the meta data or the text data. This processing technique also allows for user feedback to help refine the search. Fuzzy searching provides the ability to retrieve approximations of search queries and has a natural tolerance for errors in both input data and query terms. It eliminates the need for OCR clean up, which is especially useful in applications that handle large volumes of scanned documents. High precision and recall gives end-users a high level of confidence that their queries will return all of the requested information regardless of errors in spelling or in the "dirty data" which they may be searching.
The search engine 1 79 also provides semantic expansion and semantic search capability. Preferable features of the Semantic Network include:
• Multiple Lexical Sources: The baseline Semantic Network is preferably created from complete dictionaries, a thesaurus, and other semantic resources, and gives users a built-in knowledgebase of 400,000 word meanings and over 1 .6 million word relationships.
• Natural Language Processing: Users can preferably simply enter straightforward, plain English queries, which are then automatically enhanced by a rich set of related terms and concepts, to find information targeted to their specific context.
• Morphology: The Network preferably recognizes words at the root level, which is a much more accurate approach than the simple stemming techniques characteristic of other text retrieval software. This minimizes word misses which are caused by irregular or variant spellings.
• Idioms: The Network preferably recognizes idioms for more accurate searches, and processes phrases like "real estate" and "kangaroo court" as single units of meaning, not as individual words.
• Semantics: The Network preferably recognizes multiple meanings of words and allows users to simply point and click to choose the meaning appropriate to their queries.
• Multi-layered dictionary: The baseline Semantic Network preferably supports multi-layered dictionary structures that add even greater depth and flexibility. This enables integration of specialized reference works for legal, medical, finance, engineering, and other disciplines. End users can also preferably add personalized definitions and concepts without affecting the integrity of the baseline knowledgebase.
V. VARIATIONS
In accordance with the present invention, the functionality disclosed in this application can be, at least partially, implemented by hardware, software, or a combination of both. This may be done, for example, with a Pentium-based computer system running database and editing software, or other programs. Moreover, this functionality may be embodied in computer readable media or computer program products to be used in programming an information- processing apparatus to perform in accordance with the invention. Such media or products may include magnetic, magnetic-optical, optical, and other types of media, including for example 3.5 inch diskettes and other digital storage media. This functionality may also be embodied in computer readable media such as a transmitted waveform to be used in transmitting the information or functionality.
Additionally, software implementations can be written in any suitable language, including without limitation high-level programming languages such as C + + , mid-level and low-level languages, assembly languages, and application-specific or device-specific languages. Such software can run on a general purpose computer such as a 486 or a Pentium, an application specific piece of hardware, or other suitable device.
In addition to using discrete hardware components in a logic circuit, the required logic may also be performed by an application specific integrated circuit ("ASIC") or other device. The technique may use analog circuitry, digital circuitry, or a combination of both. Embodiments may also include various hardware components which are well known in the art, such as connectors, cables, and the like.
The principles, preferred embodiments, and modes of operation of the present invention have been described in the foregoing specification. The invention is not to be construed as limited to the particular forms disclosed, because these are regarded as illustrative rather than restrictive. Moreover, variations and changes may be made by those of ordinary skill in the art without departing from the spirit and scope of the invention.

Claims

WE CLAIM:
1 . A method of processing newsprint data which has been scanned into a digital image, the method comprising:
removing marks in the digital image of scanned newsprint data using a grayscale enhance function;
performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output;
storing the OCR output in a digital storage medium; and
controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
2. The method of claim 1 , wherein the digital image comprises at least a sixty-four level grayscale image at the beginning of the process of removing marks.
3. The method of claim 1 , wherein removing marks in the digital image comprises:
dividing the digital image into grids;
determining an average grayscale density over all grids which are not either all black or all white; and
thresholding the digital image based on the average grayscale density.
4. The method of claim 1 ,
wherein performing OCR comprises performing OCR on the digital image with a plurality of OCR engines after removing marks, to produce an OCR output for each of the plurality of OCR engines, and further comprising selecting a particular OCR output from among the plurality of OCR outputs.
5. The method of claim 4, wherein selecting the particular OCR output comprises:
voting to determine an OCR output having highest confidence; and
designating the OCR output having highest confidence as the particular OCR output.
6. The method of claim 4, wherein controlling the work flow comprises:
controlling, using software, a first work flow of the digital image as it moves among the processes of removing marks and performing OCR; and
controlling, using software, a second work flow of the plurality of OCR outputs as they move among the processes of performing OCR and selecting a particular OCR output.
7. The method of claim 4, wherein performing OCR with a plurality of OCR engines comprises using at least five OCR engines.
8. The method of claim 4, wherein:
performing OCR with a plurality of OCR engines comprises producing a confidence indicator for the OCR output of each of the plurality of OCR engines; and
selecting the particular OCR output comprises
voting to determine one of the plurality of OCR outputs for which the confidence indicator is not exceeded by the other confidence indicators, and
designating the one of the plurality of OCR outputs for which the confidence indicator is not exceeded as the particular OCR output.
9. The method of claim 4, wherein selecting the particular OCR output is performed for each character of OCR input separately.
1 0. The method of claim 1 ,
wherein the OCR of the newspaper data is automatically parsed to obtain specific pieces of metadata such as title, author, date.
1 1 . The method of claim 1 , wherein controlling the work flow comprises at least one of sending the digital image to a process, sending the digital image to a station, automatically rejecting the digital image if quality is insufficient, presenting the digital image to an operator at a station, checking the digital image in after a process has been performed, checking the digital image out to a station or a process, associating a barcode with non-electronic data relating to the digital image, tracking non-electronic data associated with the digital image using a barcode, and collecting and storing data relating to the digital image.
1 2. The method of claim 8, wherein controlling the work flow comprises collecting and storing data relating to the digital image, wherein the data comprises an indicator of whether the input has a jump and an indicator of the quality of the OCR output, and wherein the data is collected from operator input or is automatically generated.
1 3. The method of claim 1 , wherein controlling the work flow comprises:
controlling, using software, a first work flow of the digital image as it moves among the processes of removing marks and performing OCR; and
controlling, using software, a second work flow of the OCR output as it moves among the processes of performing OCR and storing.
14. A method of retrieving digitally stored newsprint data, the method comprising: providing a database of newsprint information, the database having been created using the method of claim 1 ; and
searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
1 5. The method of claim 1 where the image is taken from original archive hard copy newspaper.
1 6. The method of claim 1 where the image is taken from microfilm or microfiche as opposed to hardcopy newspaper or newspaper clippings.
1 7. The method of claim 1 6 where an article image is manually or automatically produced from the full page image.
1 8. The method of claim 1 7 where the "jump" continued on image is connected to the first page of the article via automated or manual processes.
19. The method of claim 1 where the paper preparation process includes the use of commercially available lamination devices to press the folded news clipping flat prior to scanning.
20. A retrieval system comprising:
a database of newsprint information, the database having been created using the method of claim 1 ; and
a search engine capable of performing adaptive pattern recognition processing and morphology such that text which does not exactly match the search string can be retrieved.
21 . A method of utilizing digitally stored newsprint data, the method comprising: searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data;
producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image; and
displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
22. The method of claim 21 , further comprising displaying the text which produced the search result.
23. The method of claim 21 , wherein the word searchable database has been created using the following steps: removing marks in the digital image of scanned newsprint data using a grayscale enhance function;
performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output;
storing the OCR output in a digital storage medium; and
controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
24. The method of claim 21 , wherein searching the text in the word searchable database comprises using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
25. The method of claim 24 such that digital data produced through voting to determine the OCR output having highest confidence is utilized to limit and provide boundaries for the full text.
26. A computer program product comprising computer readable program code for processing newsprint data which has been scanned into a digital image, the computer readable program code comprising:
first program part for removing marks in the digital image of scanned newsprint data using a grayscale enhance function;
second program part for performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output;
third program part for storing the OCR output in a digital storage medium; and
fourth program part for controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
27. A computer program product comprising computer readable program code for retrieving digitally stored newsprint data, the computer readable program code comprising:
first program part for providing a database of newsprint information, the database having been created using the following steps:
removing marks in the digital image of scanned newsprint data using a grayscale enhance function,
performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output,
storing the OCR output in a digital storage medium, and
controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output; and
second program part for searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
28. A computer program product comprising computer readable program code for utilizing digitally stored newsprint data, the computer readable program code comprising:
first program part for searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data;
second program part for producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image; and
third program part for displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
29. A device for processing newsprint data which has been scanned into a digital image, the device comprising:
a scanner which produces a digital image from hard copy, microfiche, or microfilm.
an image cleaner which removes marks in the digital image of scanned newsprint data using a grayscale enhance function;
an OCR unit which performs optical character recognition ("OCR") - on the digital image, after removing marks, to produce an OCR output;
a digital storage medium for storing the OCR output; and
a coordinator which controls the work flow between the image cleaner, the OCR unit, and the digital storage medium.
30. The method of claim 28. where the digital clipping scanner uses a vacuum fed belt driven feed mechanism to allow variable size newspaper clipping to be processed.
31 . The method of claim 28. where the digital clipping scanner produces gray scale images.
32. The method of claim 28. where the digital clipping scanner incorporates a removable paper exist tray.
33. A retrieval system comprising:
a search engine, wherein the search engine
searches text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data, and
produces a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image; and
a user interface which displays the particular scanned image of newsprint data which corresponds to the text which produced the search result.
PCT/US2000/022492 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data WO2001013279A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU70605/00A AU7060500A (en) 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14922299P 1999-08-17 1999-08-17
US60/149,222 1999-08-17

Publications (3)

Publication Number Publication Date
WO2001013279A2 true WO2001013279A2 (en) 2001-02-22
WO2001013279A9 WO2001013279A9 (en) 2001-06-14
WO2001013279A3 WO2001013279A3 (en) 2004-02-19

Family

ID=22529293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/022492 WO2001013279A2 (en) 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data

Country Status (2)

Country Link
AU (1) AU7060500A (en)
WO (1) WO2001013279A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1552466A1 (en) * 2002-10-18 2005-07-13 Olive Software, Inc. System and method for automatic preparation of data repositories from microfilm-type materials
EP1684199A2 (en) * 2005-01-19 2006-07-26 Olive Software, Inc. Digitization of microfiche
WO2006125831A1 (en) * 2005-05-27 2006-11-30 Thomas Henry Devices and methods allowing a user to manage a plurality of objects in particular paper documents
WO2007024392A1 (en) * 2005-08-24 2007-03-01 Hewlett-Packard Development Company, L.P. Classifying regions defined within a digital image
US20130300562A1 (en) * 2012-05-11 2013-11-14 Sap Ag Generating delivery notification
US9386877B2 (en) 2007-05-18 2016-07-12 Kraft Foods R & D, Inc. Beverage preparation machines and beverage cartridges
US10380554B2 (en) 2012-06-20 2019-08-13 Hewlett-Packard Development Company, L.P. Extracting data from email attachments
US10445617B2 (en) 2018-03-14 2019-10-15 Drilling Info, Inc. Extracting well log data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0539106A2 (en) * 1991-10-24 1993-04-28 AT&T Corp. Electronic information delivery system
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
EP0539106A2 (en) * 1991-10-24 1993-04-28 AT&T Corp. Electronic information delivery system
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1552466A4 (en) * 2002-10-18 2007-07-25 Olive Software Inc System and method for automatic preparation of data repositories from microfilm-type materials
EP1552466A1 (en) * 2002-10-18 2005-07-13 Olive Software, Inc. System and method for automatic preparation of data repositories from microfilm-type materials
EP1684199A2 (en) * 2005-01-19 2006-07-26 Olive Software, Inc. Digitization of microfiche
EP1684199A3 (en) * 2005-01-19 2008-07-09 Olive Software, Inc. Digitization of microfiche
FR2886429A1 (en) * 2005-05-27 2006-12-01 Thomas Henry SYSTEM FOR USER TO MANAGE A PLURALITY OF PAPER DOCUMENTS
WO2006125831A1 (en) * 2005-05-27 2006-11-30 Thomas Henry Devices and methods allowing a user to manage a plurality of objects in particular paper documents
WO2007024392A1 (en) * 2005-08-24 2007-03-01 Hewlett-Packard Development Company, L.P. Classifying regions defined within a digital image
US7539343B2 (en) 2005-08-24 2009-05-26 Hewlett-Packard Development Company, L.P. Classifying regions defined within a digital image
US9386877B2 (en) 2007-05-18 2016-07-12 Kraft Foods R & D, Inc. Beverage preparation machines and beverage cartridges
US10952562B2 (en) 2007-05-18 2021-03-23 Koninklijke Douwe Egberts B.V. Beverage preparation machines and beverage cartridges
US20130300562A1 (en) * 2012-05-11 2013-11-14 Sap Ag Generating delivery notification
US10380554B2 (en) 2012-06-20 2019-08-13 Hewlett-Packard Development Company, L.P. Extracting data from email attachments
US10445617B2 (en) 2018-03-14 2019-10-15 Drilling Info, Inc. Extracting well log data
US10565467B2 (en) 2018-03-14 2020-02-18 Drilling Info, Inc. Extracting well log data

Also Published As

Publication number Publication date
WO2001013279A9 (en) 2001-06-14
AU7060500A (en) 2001-03-13
WO2001013279A3 (en) 2004-02-19

Similar Documents

Publication Publication Date Title
US6243501B1 (en) Adaptive recognition of documents using layout attributes
US7050630B2 (en) System and method of locating a non-textual region of an electronic document or image that matches a user-defined description of the region
US7773822B2 (en) Apparatus and methods for management of electronic images
US5628003A (en) Document storage and retrieval system for storing and retrieving document image and full text data
US7081975B2 (en) Information input device
US5923792A (en) Screen display methods for computer-aided data entry
EP1473641B1 (en) Information processing apparatus, method, storage medium and program
US6909805B2 (en) Detecting and utilizing add-on information from a scanned document image
Papadopoulos et al. The IMPACT dataset of historical document images
US6178417B1 (en) Method and means of matching documents based on text genre
US20050289182A1 (en) Document management system with enhanced intelligent document recognition capabilities
US7548916B2 (en) Calculating image similarity using extracted data
US20010042083A1 (en) User-defined search template for extracting information from documents
US20030152277A1 (en) Method and system for interactive ground-truthing of document images
US20100128922A1 (en) Automated generation of form definitions from hard-copy forms
US20090116746A1 (en) Systems and methods for parallel processing of document recognition and classification using extracted image and text features
US20040015775A1 (en) Systems and methods for improved accuracy of extracted digital content
WO2007117334A2 (en) Document analysis system for integration of paper records into a searchable electronic database
Kim et al. Automated labeling in document images
WO2011051815A2 (en) System and method of using dynamic variance networks
US20060167899A1 (en) Meta-data generating apparatus
WO2001013279A2 (en) Word searchable database from high volume scanning of newspaper data
KR20060001392A (en) Document image storage method of content retrieval base to use ocr
US20030101199A1 (en) Electronic document processing system
US20060176521A1 (en) Digitization of microfiche

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-43, DESCRIPTION, REPLACED BY NEW PAGES 1-43; PAGES 44-51, CLAIMS, REPLACED BY NEW PAGES 44-51; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP