WO2001013279A3 - Word searchable database from high volume scanning of newspaper data - Google Patents

Word searchable database from high volume scanning of newspaper data Download PDF

Info

Publication number
WO2001013279A3
WO2001013279A3 PCT/US2000/022492 US0022492W WO0113279A3 WO 2001013279 A3 WO2001013279 A3 WO 2001013279A3 US 0022492 W US0022492 W US 0022492W WO 0113279 A3 WO0113279 A3 WO 0113279A3
Authority
WO
WIPO (PCT)
Prior art keywords
newsprint
information
image
processing
word
Prior art date
Application number
PCT/US2000/022492
Other languages
French (fr)
Other versions
WO2001013279A2 (en
WO2001013279A9 (en
Inventor
John R Yokley
Don Nissen
Erik Schwartz
Bryan Kornele
Ed Lee
Kevin Kapel
Original Assignee
Ptfs Inc
John R Yokley
Don Nissen
Erik Schwartz
Bryan Kornele
Ed Lee
Kevin Kapel
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ptfs Inc, John R Yokley, Don Nissen, Erik Schwartz, Bryan Kornele, Ed Lee, Kevin Kapel filed Critical Ptfs Inc
Priority to AU70605/00A priority Critical patent/AU7060500A/en
Publication of WO2001013279A2 publication Critical patent/WO2001013279A2/en
Publication of WO2001013279A9 publication Critical patent/WO2001013279A9/en
Publication of WO2001013279A3 publication Critical patent/WO2001013279A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Abstract

A process for digitizing newsprint information from a newspaper includes scanning the information into a digital image format and then processing the image to produce searchable text. The processing includes removing data stamps and other marks that are written over the newsprint, enhancing the image using a library of image processing functions, and performing voting-OCR to select an optimal OCR output. The OCR output yields highly accurate text which can be word searched using adaptive pattern recognition processing, fuzzy logic, morphology, and other techniques to provide a word searchable database of newsprint information from newspapers. The process is software controlled so that the work flow, both electronic and non-electronic, between various processes or stations is tracked and sequenced, and appropriate data is collected and stored.
PCT/US2000/022492 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data WO2001013279A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU70605/00A AU7060500A (en) 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14922299P 1999-08-17 1999-08-17
US60/149,222 1999-08-17

Publications (3)

Publication Number Publication Date
WO2001013279A2 WO2001013279A2 (en) 2001-02-22
WO2001013279A9 WO2001013279A9 (en) 2001-06-14
WO2001013279A3 true WO2001013279A3 (en) 2004-02-19

Family

ID=22529293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/022492 WO2001013279A2 (en) 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data

Country Status (2)

Country Link
AU (1) AU7060500A (en)
WO (1) WO2001013279A2 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810136B2 (en) * 2002-10-18 2004-10-26 Olive Software Inc. System and method for automatic preparation of data repositories from microfilm-type materials
US20060176521A1 (en) * 2005-01-19 2006-08-10 Olive Software Inc. Digitization of microfiche
FR2886429B1 (en) * 2005-05-27 2007-08-10 Thomas Henry SYSTEM FOR USER TO MANAGE A PLURALITY OF PAPER DOCUMENTS
US7539343B2 (en) * 2005-08-24 2009-05-26 Hewlett-Packard Development Company, L.P. Classifying regions defined within a digital image
GB2449213B (en) 2007-05-18 2011-06-29 Kraft Foods R & D Inc Improvements in or relating to beverage preparation machines and beverage cartridges
US20130300562A1 (en) * 2012-05-11 2013-11-14 Sap Ag Generating delivery notification
US10380554B2 (en) 2012-06-20 2019-08-13 Hewlett-Packard Development Company, L.P. Extracting data from email attachments
US10445617B2 (en) 2018-03-14 2019-10-15 Drilling Info, Inc. Extracting well log data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0539106A2 (en) * 1991-10-24 1993-04-28 AT&T Corp. Electronic information delivery system
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
EP0539106A2 (en) * 1991-10-24 1993-04-28 AT&T Corp. Electronic information delivery system
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Also Published As

Publication number Publication date
AU7060500A (en) 2001-03-13
WO2001013279A2 (en) 2001-02-22
WO2001013279A9 (en) 2001-06-14

Similar Documents

Publication Publication Date Title
ZA200402928B (en) Digital ink database searching using handwriting feature synthesis.
EP0965943A3 (en) Optical character reading method and system for a document with ruled lines and their application
EP0841630A3 (en) Apparatus for recognizing input character strings by inference
WO2001084374A3 (en) Information access method
EP0990998A3 (en) Information search apparatus and method
EP0851382A3 (en) Apparatus and method for extracting management information from image
CN101140617A (en) Electronic equipments and text inputting method
EP1266286A4 (en) Method and system for detecting viruses on handheld computers
MY114036A (en) Character recognizing and translating system and voice recognizing and translating system
EP1403777A3 (en) Method and system for identifying a paper form using a digital pen
CA2373568A1 (en) Method of searching similar document, system for performing the same and program for processing the same
EP1197950A3 (en) Hierarchized dictionaries for speech recognition
WO2001013279A3 (en) Word searchable database from high volume scanning of newspaper data
ATE322051T1 (en) SYSTEM AND METHOD FOR AUTOMATICALLY PROCESSING AND SEARCHING SCANED DOCUMENTS
EP1519279A3 (en) Document transformation system
Maloo et al. Gujarati script recognition: a review
EP1530195A3 (en) Song search system and song search method
US6567548B2 (en) Handwriting recognition system and method using compound characters for improved recognition accuracy
JPH0675994A (en) Device for collating character string
CN104123527A (en) Mask-based image table document identification method
CA2313496A1 (en) Method of standardizing address data
CN115203474A (en) Automatic database classification and extraction technology
Couasnon et al. A real-world evaluation of a generic document recognition method applied to a military form of the 19th century
Roth An approach to recognition of printed music
Al-Khatib et al. Digital library framework for Arabic manuscripts

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-43, DESCRIPTION, REPLACED BY NEW PAGES 1-43; PAGES 44-51, CLAIMS, REPLACED BY NEW PAGES 44-51; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP