WO2008009991A1 - Document similarity system - Google Patents
Document similarity system Download PDFInfo
- Publication number
- WO2008009991A1 WO2008009991A1 PCT/GB2007/050419 GB2007050419W WO2008009991A1 WO 2008009991 A1 WO2008009991 A1 WO 2008009991A1 GB 2007050419 W GB2007050419 W GB 2007050419W WO 2008009991 A1 WO2008009991 A1 WO 2008009991A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- electronic
- words
- phrases
- dividing
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An electronic system for automatically comparing the similarity of a first electronic document with a second electronic document, the system comprising elements arranged for electronically processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark; (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
Description
DOCUMENT SIMILARITY SYSTEM
The present invention relates to a system for electronically comparing documents.
With the ever increasing dependency of organisations on computers to organise, store and communicate documentation, it is increasingly desirable to be able to control the distribution of electronic documentation. For example, it is desirable for an organisation to know who has received confidential documents, and/or to prevent unauthorised persons receiving them.
Electronic documents are usually transmitted over a network in a packet form so the actual data content of the document is mixed in each packet with other electronic information such as that which determines the type of data, routing information, check data and timing information. Thus, to determine the data content, the packets must be decoded.
One method currently used to monitor the electronic transmission of specified documents in data traffic over a network is to attach a packet sniffer (also known as network or protocol analyzer or Ethernet sniffer) to the network. Such a packet sniffer copies each of the data packets transmitted over the network, stores the packets and subsequently decodes and analyses the content for comparison to one or more predetermined files, looking for a match.
A problem with known packet sniffers is that they are only able to match identical documents. If a document is modified or paraphrased in any way then the transmitted document would not match the watched document and therefore no action will be taken. In addition, known packet sniffers are slow and generally unable to operate fully in real time.
WO 02/010967 compares documents using a hash algorithm comprising a single hash value for each document and collection statistics.
The present invention seeks to provide a system of comparing documents and determining how similar their content is to one or more pre-selected documents.
According to the first aspect of the present invention there is provided an electronic system for automatically comparing the similarity of a first electronic document with a second electronic document, the system comprising elements arranged for electronically processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark; (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
Preferably each document is divided at electronic signals representing one or more of the punctuation symbols such as full stop, comma, semi-colon, colon, single quote, double quote, question mark and exclamation mark. The term "words" is intended to mean any continuous sequence of alphanumeric characters such as letters A to Z and numbers 0 to 9.
The invention has particular application in computer networks for comparing documents transferred over such networks in real time.
The comparison value may be generated as a percentage for ease of interpretation and is preferably converted to a logarithmic scale. The system can be adapted automatically to take action if the comparison value is higher than a predetermined value.
A corresponding method, computer program and computer are also provided.
The invention generates a hash value for each phrase in each document and achieves a higher accuracy than WO 02/01967 which only uses a single value for each document.
The invention will now be described, by way of example, with reference to the accompanying drawing in which the single figure is a block diagram of a method according to the present invention for comparing the similarity of electronic documents according to the present invention.
The single figure illustrates the steps involved in comparing electronic documents according to the system of the present invention. Electronic data representing a document to be tested is presented in a comparison step 10. The electronic data may be captured as raw packet data transmitted across a network, then decoded and recombined into its original form. If the document originates from network traffic then the comparison system should ideally be capable of processing each document in real-time, so that a backlog of documents does not build up. However the invention can be used in stand alone mode to compare documents.
The data representing the document is then divided in step 12 by dividing the text into phrases at punctuation symbols preferably including: full stop, comma, semi-colon, colon, single quotation marks, double quotation marks, question mark and exclamation mark.
Each of the phrases is then sub-divided into words in step 14, where words are defined as any continuous sequence of alphanumeric characters, i.e. of letters A to Z, and/or numbers 0 to 9. The phrases will thus be split into words at any character outside these ranges and all such characters are considered white-space and discarded in dividing the phrases into words.
Then at step 16 each of the words in each of the phrases is examined and all glue words are discarded. Glue words are those which do not add any intrinsic subject matter, such as, for example: a, the, and it. This advantageously reduces the number of words without significantly affecting the content.
Next at stage 18, the remaining words in each of the phrases are sorted into alphabetical order so that their position in the phrase is no longer important.
Each alphabetically sorted phrase is then used to generate a hash code in stage 20. Any hash algorithm could be used to create the hash code, for example the MD4 algorithm would be suitable. It is advantageous to use a hash code to represent the contents of the phrase because it requires significantly less processing time to compare hash codes than to compare the alphabetically sorted phrases.
Once the hash codes for each phrase in a document have been created, they can be compared with the hash codes of one or more other documents at stage 22. For example with pre-selected documents where hash codes are already stored in the system such pre-selected documents may consist of documents that the administrator of the system has identified, for example confidential or classified documents. The pre-selected documents are processed in the same way as the document being tested before the system receives the first document to be tested.
The result of the comparison between documents is preferably displayed in the form of a percentage value representing the number of matching phrases in the two documents. The more matching hash codes there are in two documents being compared, the more likely it is that the documents are related. It is advantageous to display the percentages on a logarithmic scale since this makes it easier to visually identify similarities. It is also possible for the system to be configured to flag any documents which have a percentage match over a given threshold, so that appropriate action can be taken quickly, either manually, or automatically by the system, for example to block further dissemination of the document, to block future transmission of it, or to trace and log the source of the document.
Claims
1. An electronic system for automatically comparing the similarity of a first electronic document with a second electronic document, the system comprising elements arranged for electronically processing the electronic data representing each document by:
(a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark;
(b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words;
(d) within each phrase sorting the words into alphabetical order;
(e) generating a hash code for each alphabetically ordered phrase; and
(f) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
2. A system according to claim 1 wherein each document is divided into phrases by dividing the electronic document text at each of the electronic signals representing the punctuation symbols: full stop, comma, semi-colon, colon, single quote, double quote, question mark and exclamation mark.
3. A system according to claim 1 or 2 wherein words are defined as any continuous sequence of alphanumeric characters comprising letters A to Z and numbers 0 to 9.
4. A system according to any one of the preceding claims wherein prior to dividing the document into phrases, the document is captured from a network.
5. A system according to claim 4 adapted to perform the comparing in real-time.
6. A system according to any one of the preceding claims comprising an additional element for taking action if the value of a document comparison is higher than a predetermined value.
7. A system according to any one of the preceding claims wherein the value is generated as a percentage.
8. A system according claim 7 comprising an additional element for converting the percentage value results to a logarithmic scale.
9. A method of comparing the similarity of a first electronic document with a second electronic document comprising the steps of: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark;
(b) dividing each of the phrases into words;
(c) discarding electronic signals representing glue words;
(d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase; and
(f) comparing the hash codes for the first document with the hash codes of the second document; and
(g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
10. A computer adapted to compare the similarity of a first electronic document with a second electronic document, comprising processing elements arrayed for processing the electronic data representing each document by:
(a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark;
(b) dividing each of the phrases into words;
(c) discarding electronic signals representing glue words;
(d) within each phrase sorting the words into alphabetical order;
(e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes for the first document with the hash codes of the second document; and
(g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
11. A computer program arranged so that when loaded on a computer the computer will automatically compare the similarity of a first electronic document with a second electronic document by performing the steps of:
(a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark;
(b) dividing each of the phrases into words;
(c) discarding electronic signals representing glue words;
(d) within each phrase sorting the words into alphabetical order;
(e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes for the first document with the hash codes of the second document; and
(g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0614332.5 | 2006-07-19 | ||
GB0614332A GB2440174A (en) | 2006-07-19 | 2006-07-19 | Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008009991A1 true WO2008009991A1 (en) | 2008-01-24 |
Family
ID=36998334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2007/050419 WO2008009991A1 (en) | 2006-07-19 | 2007-07-19 | Document similarity system |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2440174A (en) |
WO (1) | WO2008009991A1 (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100172547A1 (en) * | 2007-07-17 | 2010-07-08 | Toyota Jidosha Kabushiki Kaisha | On-vehicle image processing device |
US20110083187A1 (en) * | 2009-10-01 | 2011-04-07 | Aleksey Malanov | System and method for efficient and accurate comparison of software items |
US8200635B2 (en) | 2009-03-27 | 2012-06-12 | Bank Of America Corporation | Labeling electronic data in an electronic discovery enterprise system |
US8224924B2 (en) | 2009-03-27 | 2012-07-17 | Bank Of America Corporation | Active email collector |
US8244767B2 (en) | 2009-10-09 | 2012-08-14 | Stratify, Inc. | Composite locality sensitive hash based processing of documents |
US8250037B2 (en) | 2009-03-27 | 2012-08-21 | Bank Of America Corporation | Shared drive data collection tool for an electronic discovery system |
US8364681B2 (en) | 2009-03-27 | 2013-01-29 | Bank Of America Corporation | Electronic discovery system |
US8417716B2 (en) | 2009-03-27 | 2013-04-09 | Bank Of America Corporation | Profile scanner |
US8504489B2 (en) | 2009-03-27 | 2013-08-06 | Bank Of America Corporation | Predictive coding of documents in an electronic discovery system |
US8549327B2 (en) | 2008-10-27 | 2013-10-01 | Bank Of America Corporation | Background service process for local collection of data in an electronic discovery system |
US8572227B2 (en) | 2009-03-27 | 2013-10-29 | Bank Of America Corporation | Methods and apparatuses for communicating preservation notices and surveys |
US8572376B2 (en) | 2009-03-27 | 2013-10-29 | Bank Of America Corporation | Decryption of electronic communication in an electronic discovery enterprise system |
US8806358B2 (en) | 2009-03-27 | 2014-08-12 | Bank Of America Corporation | Positive identification and bulk addition of custodians to a case within an electronic discovery system |
US9053454B2 (en) | 2009-11-30 | 2015-06-09 | Bank Of America Corporation | Automated straight-through processing in an electronic discovery system |
US9330374B2 (en) | 2009-03-27 | 2016-05-03 | Bank Of America Corporation | Source-to-processing file conversion in an electronic discovery enterprise system |
US9355171B2 (en) | 2009-10-09 | 2016-05-31 | Hewlett Packard Enterprise Development Lp | Clustering of near-duplicate documents |
US9721227B2 (en) | 2009-03-27 | 2017-08-01 | Bank Of America Corporation | Custodian management system |
US11797486B2 (en) | 2022-01-03 | 2023-10-24 | Bank Of America Corporation | File de-duplication for a distributed database |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11177945B1 (en) | 2020-07-24 | 2021-11-16 | International Business Machines Corporation | Controlling access to encrypted data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6547829B1 (en) * | 1999-06-30 | 2003-04-15 | Microsoft Corporation | Method and system for detecting duplicate documents in web crawls |
US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6493709B1 (en) * | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US7660819B1 (en) * | 2000-07-31 | 2010-02-09 | Alion Science And Technology Corporation | System for similar document detection |
US7356188B2 (en) * | 2001-04-24 | 2008-04-08 | Microsoft Corporation | Recognizer of text-based work |
-
2006
- 2006-07-19 GB GB0614332A patent/GB2440174A/en not_active Withdrawn
-
2007
- 2007-07-19 WO PCT/GB2007/050419 patent/WO2008009991A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6547829B1 (en) * | 1999-06-30 | 2003-04-15 | Microsoft Corporation | Method and system for detecting duplicate documents in web crawls |
US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
Non-Patent Citations (2)
Title |
---|
ALEXANDER BOGDANOVSKI: "An Automatic Text Summarizer", COM3010, 3 May 2006 (2006-05-03), Sheffield, UK, XP002452207, Retrieved from the Internet <URL:http://www.dcs.shef.ac.uk/intranet/teaching/projects/archive/ug2006/pdf/u2ab.pdf> [retrieved on 20070924] * |
YERRA R ET AL: "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach", GRANULAR COMPUTING, 2005 IEEE INTERNATIONAL CONFERENCE ON BEIJING, CHINA 25-27 JULY 2005, PISCATAWAY, NJ, USA,IEEE, 25 July 2005 (2005-07-25), pages 693 - 699, XP010886118, ISBN: 0-7803-9017-2 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100172547A1 (en) * | 2007-07-17 | 2010-07-08 | Toyota Jidosha Kabushiki Kaisha | On-vehicle image processing device |
US8549327B2 (en) | 2008-10-27 | 2013-10-01 | Bank Of America Corporation | Background service process for local collection of data in an electronic discovery system |
US8688648B2 (en) | 2009-03-27 | 2014-04-01 | Bank Of America Corporation | Electronic communication data validation in an electronic discovery enterprise system |
US9721227B2 (en) | 2009-03-27 | 2017-08-01 | Bank Of America Corporation | Custodian management system |
US9934487B2 (en) | 2009-03-27 | 2018-04-03 | Bank Of America Corporation | Custodian management system |
US8250037B2 (en) | 2009-03-27 | 2012-08-21 | Bank Of America Corporation | Shared drive data collection tool for an electronic discovery system |
US8364681B2 (en) | 2009-03-27 | 2013-01-29 | Bank Of America Corporation | Electronic discovery system |
US8417716B2 (en) | 2009-03-27 | 2013-04-09 | Bank Of America Corporation | Profile scanner |
US8504489B2 (en) | 2009-03-27 | 2013-08-06 | Bank Of America Corporation | Predictive coding of documents in an electronic discovery system |
US8200635B2 (en) | 2009-03-27 | 2012-06-12 | Bank Of America Corporation | Labeling electronic data in an electronic discovery enterprise system |
US8572227B2 (en) | 2009-03-27 | 2013-10-29 | Bank Of America Corporation | Methods and apparatuses for communicating preservation notices and surveys |
US8572376B2 (en) | 2009-03-27 | 2013-10-29 | Bank Of America Corporation | Decryption of electronic communication in an electronic discovery enterprise system |
US8805832B2 (en) | 2009-03-27 | 2014-08-12 | Bank Of America Corporation | Search term management in an electronic discovery system |
US8224924B2 (en) | 2009-03-27 | 2012-07-17 | Bank Of America Corporation | Active email collector |
US8903826B2 (en) | 2009-03-27 | 2014-12-02 | Bank Of America Corporation | Electronic discovery system |
US8868561B2 (en) | 2009-03-27 | 2014-10-21 | Bank Of America Corporation | Electronic discovery system |
US8806358B2 (en) | 2009-03-27 | 2014-08-12 | Bank Of America Corporation | Positive identification and bulk addition of custodians to a case within an electronic discovery system |
US9547660B2 (en) | 2009-03-27 | 2017-01-17 | Bank Of America Corporation | Source-to-processing file conversion in an electronic discovery enterprise system |
US9171310B2 (en) | 2009-03-27 | 2015-10-27 | Bank Of America Corporation | Search term hit counts in an electronic discovery system |
US9330374B2 (en) | 2009-03-27 | 2016-05-03 | Bank Of America Corporation | Source-to-processing file conversion in an electronic discovery enterprise system |
US9542410B2 (en) | 2009-03-27 | 2017-01-10 | Bank Of America Corporation | Source-to-processing file conversion in an electronic discovery enterprise system |
US20110083187A1 (en) * | 2009-10-01 | 2011-04-07 | Aleksey Malanov | System and method for efficient and accurate comparison of software items |
US9355171B2 (en) | 2009-10-09 | 2016-05-31 | Hewlett Packard Enterprise Development Lp | Clustering of near-duplicate documents |
US8244767B2 (en) | 2009-10-09 | 2012-08-14 | Stratify, Inc. | Composite locality sensitive hash based processing of documents |
US9053454B2 (en) | 2009-11-30 | 2015-06-09 | Bank Of America Corporation | Automated straight-through processing in an electronic discovery system |
US11797486B2 (en) | 2022-01-03 | 2023-10-24 | Bank Of America Corporation | File de-duplication for a distributed database |
Also Published As
Publication number | Publication date |
---|---|
GB0614332D0 (en) | 2006-08-30 |
GB2440174A (en) | 2008-01-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008009991A1 (en) | Document similarity system | |
US8625642B2 (en) | Method and apparatus of network artifact indentification and extraction | |
Breitinger et al. | FRASH: A framework to test algorithms of similarity hashing | |
US8484152B2 (en) | Fuzzy hash algorithm | |
US8788583B2 (en) | Sharing form training result utilizing a social network | |
EP2506154B1 (en) | Text, character encoding and language recognition | |
US20150207704A1 (en) | Public opinion information display system and method | |
JP2011129161A (en) | Duplicate document detection and presentation functions | |
CN110674529A (en) | Document auditing method and document auditing device based on data security information | |
CN110019640A (en) | Confidential document inspection method and device | |
US9235624B2 (en) | Document similarity evaluation system, document similarity evaluation method, and computer program | |
JPWO2007029348A1 (en) | Data extraction system, terminal device, terminal device program, server device, and server device program | |
Shannon | Forensic relative strength scoring: ASCII and entropy scoring | |
CN110956123A (en) | Rich media content auditing method and device, server and storage medium | |
CN113992668B (en) | Information real-time transmission method, device, equipment and medium based on multiple concurrences | |
CN108038124B (en) | PDF document acquisition and processing method, system and device based on big data | |
US7921126B2 (en) | Patent summarization systems and methods | |
CN113688240B (en) | Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium | |
CN114900492A (en) | Abnormal mail detection method, device, system and computer readable storage medium | |
AU2017248417A1 (en) | Fuzzy hash algorithm | |
US20120016890A1 (en) | Assigning visual characteristics to records | |
JP7140268B2 (en) | WARNING DEVICE, CONTROL METHOD AND PROGRAM | |
CN117494224A (en) | File analysis method, device, equipment and medium based on information intelligent check | |
CN111967240B (en) | Text parsing method, text parsing device, terminal equipment and computer readable storage medium | |
CN112256889B (en) | Knowledge graph construction method, device, equipment and medium for security entity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07766460 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS EPO FORM 1205A DATED 18.05.2009. |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07766460 Country of ref document: EP Kind code of ref document: A1 |