US20090063470A1 - Document management using business objects - Google Patents

Document management using business objects Download PDF

Info

Publication number
US20090063470A1
US20090063470A1 US12/199,043 US19904308A US2009063470A1 US 20090063470 A1 US20090063470 A1 US 20090063470A1 US 19904308 A US19904308 A US 19904308A US 2009063470 A1 US2009063470 A1 US 2009063470A1
Authority
US
United States
Prior art keywords
data objects
document
data
objects
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/199,043
Inventor
Ariel Peled
Gilad Savion
Elad Reznikov
Yizhar Regev
Izhak Shmulewitz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nogacom Ltd
Original Assignee
Nogacom Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nogacom Ltd filed Critical Nogacom Ltd
Priority to US12/199,043 priority Critical patent/US20090063470A1/en
Assigned to NOGACOM LTD. reassignment NOGACOM LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: REZNIKOV, ELAD, PELED, ARIEL, REGEV, YIZHAR, SAVION, GILAD, SHMULEWITZ, IZHAK
Publication of US20090063470A1 publication Critical patent/US20090063470A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates generally to information processing, and specifically to methods and systems for indexing and searching documents.
  • Structured data can be efficiently indexed, addressed and searched using well-known tools, such as structured query language (SQL).
  • SQL structured query language
  • Search tools for natural language documents are limited for the most part to keyword-based techniques. As a result, searching a corpus of textual documents for a particular occurrence of a certain data object is frequently inefficient and time-consuming and may miss relevant occurrences of an object of interest, such as a person, company or product.
  • Embodiments of the present invention provide improved methods and systems for analyzing a set of data objects in a data repository of an organization, and using these data objects in tagging, classifying and then searching a corpus of data.
  • a computer-implemented method for processing information includes collecting data objects from one or more data repositories, the data objects having respective properties, which identify the data objects.
  • the properties of the collected data objects are analyzed in order to derive respective identifiers corresponding to the data objects.
  • a text string that matches one of the identifiers of a data object is identified within a context in a document. Responsively to the context, an indication that the identified text string is a valid instance of the data object is generated, and the document is processed responsively to the indication.
  • a computer-implemented method for processing information includes collecting data objects from one or more data repositories and identifying a respective record in the repositories corresponding to each of the data objects.
  • One or more documents are processed so as to generate a listing of occurrences of the data objects in the documents.
  • the listing is automatically updated with respect to the one of the data objects.
  • the documents are processed responsively to the listing.
  • FIG. 1 is a block diagram that schematically illustrates a system for exchange and management of data, in accordance with an embodiment of the present invention
  • FIG. 2 is a flow chart that schematically illustrates a method for classifying and tagging documents according to business objects and indexing the documents according to the tagging results, in accordance with an embodiment of the present invention
  • FIG. 3 is a flow chart that schematically illustrates a method for searching a set of documents, in accordance with an embodiment of the present invention
  • FIG. 4 is a flow chart that schematically illustrates a method for updating a business object, in accordance with an embodiment of the present invention.
  • FIG. 5 is a flow chart that schematically illustrates a method for scoring business objects, in accordance with an embodiment of the present invention.
  • Embodiments of the present invention that are described hereinbelow provide apparatus, methods and software for document and knowledge management within an organization.
  • the methods focus on analyzing a set of data objects in the organization's data repositories, which are then used in tagging, classifying, indexing and searching a corpus of documents that is maintained by the organization.
  • the data objects may refer to entities of importance to the business, such as employees, products, customers and suppliers of the business, departments or units of the organization, or geographical areas.
  • business objects For convenience and clarity, the description that follows will relate to these sorts of data objects, which will be referred to hereinafter as “business objects.”
  • business objects the principles of the present invention, however, are similarly applicable to organizations and data objects of other types.
  • Each business object is identified by properties that typically include one or more names, which may comprise multiple words.
  • each business object typically has additional properties, such as synonyms, e-mail address, physical address, job title with organization name and/or affiliation, telephone number, and ID numbers, which may be useful in identifying occurrences of the business object in documents.
  • Business objects are also dynamic, and their properties may change during their life cycle.
  • Some business objects properties may be numerical or fixed strings, but many properties, such as business object names, are open, literal, natural language strings, and thus are more complex and error-prone.
  • people tend to automatically shorten natural language names and/or to create from them synonyms and other variants, according to various lexical and semantic rules.
  • the different names or parts of a name of a given business object may be used separately and independently.
  • business objects may be referred to in documents by partial names, such as the first name or nickname of a person, an abbreviation of a product, or an organization name without the usual prefix or suffix.
  • a certain text fragment such as a word
  • a text string may be found in a document that matches a name or a variant of a name of a business object, while the actual semantic meaning of the string in the specific context of the document does not refer to the business object.
  • Some embodiments of the present invention address this problem by using natural language processing to ascertain automatically, based on the context (and without human intervention in most cases), whether or not the text string in question actually refers to a certain data object.
  • business objects are identified automatically by processing data repositories of the organization.
  • the business objects are identified in sources of structured data, such as databases, CRM (Customer Relation Management) systems, or other similar organizational systems and spreadsheets.
  • business objects are also extracted from documents containing unstructured data, in which case the “record” with which the business object is associated is the document or portion of the document from which the business object was extracted.
  • different types of business objects may be managed in different systems, which may include duplicates and errors. Therefore, in the disclosed embodiments, data are collected and compared from various repositories, and are then analyzed to create a unified listing of the business objects across the organization.
  • This analysis uses natural language tools, such as lexical, linguistic and semantic analysis, to find identifiers, including variants that are different from the actual names of the business objects, that may identify the business objects in a document. As explained in detail hereinbelow, these variants may be based either on the object names or on other object properties.
  • each business object may be associated with one or more corresponding source records in the structured data.
  • a business object may be present in several organizational systems. For example, an employee may have a record both in Windows Active Directory and in the organization HR system.
  • each such source record (with its corresponding ID) is listed.
  • the business object is then updated accordingly, without any need for human intervention. If required, the relevant documents are re-tagged, so that subsequent searches use the most up-to-date information regarding all the business objects in the set.
  • FIG. 1 is a block diagram that schematically illustrates a system 20 for exchange and management of data, in accordance with an embodiment of the present invention.
  • System 20 is typically maintained by an organization, such as a business, for purposes of exchanging, storing and recalling data used by the organization.
  • a data classification and search server 22 identifies business objects and builds a listing, such as an index, for use in searching the data, as described in detail hereinbelow.
  • System 20 is typically built around an enterprise network 24 , which may comprise any suitable type or types of data communication network, and may, for example, include both intranet and extranet segments.
  • a variety of servers 26 may be connected to the network, including mail and other application servers, for instance.
  • Storage repositories 28 are also connected to the network and typically contain both structured and unstructured data.
  • the structured data may include a variety of databases, such as product databases, human resources (HR) databases containing records of personnel of the organization, and customer relations management (CRM) databases containing records of customers of the organization, as well as their orders and payment records. Additionally or alternatively, structured data may be organized and stored in other forms and formats that are known in the art, such as spreadsheets.
  • Servers 26 and repositories 28 are accessible to client computers 30 via network 24 .
  • Server 22 connects to network 24 via a suitable network interface 32 .
  • the server typically comprises one or more general-purpose computer processors, which are programmed in software to carry out the functions that are described herein.
  • This software may be downloaded to server 22 in electronic form, over a network, for example.
  • the software may be provided on tangible storage media, such as optical, magnetic or electronic memory media.
  • server 22 is shown in FIG. 1 , for the sake of simplicity, as a single unit, in practice the functions of the server may be carried out by a number of different processors, such as a separate processor (or even a separate computer) for each of the functional blocks shown in the figure.
  • the functional blocks may be implemented simply as different processes running on the same computer. All such alternative configurations are considered to be within the scope of the present invention.
  • Server 22 comprises a classifier 34 , which automatically assembles a listing of business objects based on information in repositories 28 (and possibly other sources, as well), and then tags the documents in system 20 according to instances of the business objects that occur in the documents.
  • the classifier recognizes and resolves variant forms of the business object names, such as shortened names and abbreviations, using techniques of natural language processing, and may assign confidence scores to instances of the business objects depending on the level of certainty that a given variant actually refers to the business object in question.
  • a crawler 38 collects documents from system 20 , and classifier 34 builds an index of the documents, for use in subsequent search and update operations, according to occurrences of the business objects in the document text.
  • Classifier 34 stores the business object listing and index in an internal repository 36 , which typically comprises a suitable storage device or group of such devices. Details of the processes of identifying data objects and tagging documents are described further hereinbelow with reference to FIG. 2 .
  • classifier 34 may create a general index of strings appearing in the documents, for purposes of subsequent keyword-based searching, as is known in the art.
  • a searcher 40 receives requests, typically from client computers 30 , to search the documents in system 20 for a certain business object or combination or type of business objects.
  • the search queries may also specify keywords, in addition to the business objects, as well as logical operators connecting the business objects and (optional) keywords in the queries.
  • the searcher extracts documents from system 20 that contain instances of the business objects specified by the query and scores each document according to factors such as the number of occurrences of the business objects and the confidence level. The score may also reflect occurrences of specified keywords in the documents, as well as factors such as document type and metadata. Searcher 40 ranks the documents according to their scores and returns the result to the requesting client. Details of the search process are described hereinbelow with reference to FIG. 3 .
  • FIG. 2 is a flow chart that schematically illustrates a method for classifying and tagging a set of documents according to a business object set, in accordance with an embodiment of the present invention. The method is described, for the sake of clarity, with reference to the system architecture shown in FIG. 1 , but the principles of this method may similarly be applied in tagging and indexing of data objects in other applications. Examples of some of the functions shown in FIG. 2 are described below in the Appendices.
  • crawler 38 loads business objects to classifier 34 from repositories 28 of system 20 .
  • repositories may include, for example, records maintained by applications such as a CRM system and a HR system, as well as computer system management applications, such as Microsoft Active Directory.
  • applications typically have an application program interface (API), which the crawler can use to access the tables of business objects and their properties.
  • API application program interface
  • Classifier 34 uses the information provided by the crawler to build a table of each type of business objects, such as customers, employees, products, etc.
  • the crawler continually samples repositories 28 in order to report changes in the business object listings.
  • crawler 38 also loads and analyzes, for each business object, permission and control access details, specifying which users are allowed to view the business object and its details and which users are allowed to change these points.
  • the crawler converts the access list into a standard Access Control List (ACL) form and saves the ACL in the business object repository.
  • ACL Access Control List
  • Classifier 34 may also identify new business objects in unstructured documents, as described below in reference to step 58 . This identification is typically based on morphological, syntactic and semantic analysis of the document using appropriate rules.
  • classifier 34 activates a business object (BO) comparer function 50 to compare the new business object to the business objects that are already listed in repository 36 .
  • the comparer function calculates a similarity factor between the new business object and each of the existing business objects. If the factor is above a high threshold, the classifier will treat the two business objects as identical, i.e., as alternative names of the same object. The classifier will then merge the record of the new business object into the record of the existing business object that it matched.
  • the classifier adds the new business object to the list in repository 36 and records a similarity relation between the new and existing business objects. This similarity relation is used subsequently in tagging and scoring occurrences of the business objects in documents, as described hereinbelow.
  • the comparer function may also discover and record other relations between business objects. For example, it may find that two employees share a telephone number, or that two organizations share a domain name. These relations may also be used in tagging and scoring, and may in addition be queried directly by clients.
  • scoring formula may vary depending on the type of business object involved. For example, if two business objects of type “person” have the same social security number, they can be assumed to be one and the same. If two customers have identical postal addresses, they may be considered to be the same business objects, although if they share only the same city, street and building number, they may receive a lower similarity factor.
  • a business object analyzer function 52 of classifier 34 uses the information provided by crawler 38 and comparer function 50 in building, for each business object, a set of identifiers, including variants, that will serve as the basis for tagging instances of the business object in the documents in system 20 .
  • Each business object is typically identified by a name and appropriate additional properties, such as ID number, telephone number, e-mail address, etc.
  • a listing of representative properties for different business object types is presented below in Appendix B.
  • the analyzer parses the name and other properties in order to create the set of partial names, synonyms and acronyms that may refer to instances of the business object in the documents in system 20 .
  • the name and properties may be specified separately in different languages if necessary, and the analyzer may automatically identify the language as part of the parsing process.
  • Classifier 34 stores the listings of business objects and their properties in internal repository 36 , as noted above. These listings are typically not static, but rather are updated continually in response to changes occurring in the records and other documents in repositories 28 . A method for updating business objects is described hereinbelow with reference to FIG. 4 .
  • Classifier 34 applies the business object listings described above in tagging instances of business objects that occur in documents 56 , which are collected by crawler 38 from system 20 .
  • a basic tagger 54 loads the list of business objects from repository 36 , including all the possible variants, and searches each document for the patterns corresponding to the business object name and variants.
  • Tagger 54 may also use other lexicons of relevant terms, such as common first names and common organizational suffixes (such as “corp.”), in addition to the business object names, as well as vocabularies and/or regular expressions.
  • Tagger 54 typically analyzes the tokens (such as words of natural language text) appearing in each document both typographically and morphologically for similarity to the names that are to be tagged.
  • the list of possible variants of a particular business object that is to be used for this purpose may be adjusted according to the language of the document that is to be tagged.
  • the tagger also checks the context of each business object name or variant that is found in the document to make sure that the reference is valid. For example, before identifying an occurrence of the name “Pandora” in a document as referring to a customer by this name, the tagger checks to ensure that “Pandora” is not part of another name, such as of a person named “Pandora Smith.”
  • the tagger tags each name that may be an instance of a given business object both with the business object name and with a confidence score.
  • the document is converted to text and then tokenized, i.e., separated into single words.
  • the relevant features are saved, such as typographical features (alphabetic token or numerical, capitalized or not, etc.) and part of speech (proper noun, noun, verb, etc.)
  • Each token may also be compared to relevant lexicons, as noted above.
  • full names receive higher scores than partial or abbreviated names, and the score may be increased or decreased based on the nature and number of variants of the business object in question that appear in the document being tagged.
  • classifier 34 may encounter business objects in documents 56 that are not included in the listings in repository 36 . To deal with such objects, as well as other object-related entities, the classifier invokes an entity extraction module 58 .
  • entity extraction module 58 applies rule-based natural language processing to identify and extract business entities such as persons and organizations, as well as ancillary data entities, such as locations, dates, telephone numbers, etc., which may refer to business objects.
  • the classifier may use the extracted entities to support identification of existing business objects or may add new business objects to the listings in repository 36 based on the extracted entities, either automatically or interactively with the support of a system manager, for example.
  • Classifier 34 also actuates a relation extraction module 60 in order to identify relations between business objects (or other entities) and other entities or properties appearing in documents 56 .
  • This module may, for example, extract relations such as company location (or headquarters), which identifies the relation between a company business object and a place; or affiliation/employment, which identifies the organization at which a person business object is employed and his position in the organization.
  • a resolver function 62 determines which business objects are actually referenced in the document.
  • the resolver is typically invoked to resolve ambiguities, which may occur when a given string may refer to more than one business object (as when two persons have the same name), or when it is not certain that a name extracted from the document actually matches a business object that it resembles.
  • the resolver computes a score for each business object to which the ambiguous entity might refer. The score may be based, for example, on how fully a partial name in the entity matches the full name of the business object or on other information appearing in the document that may be more relevant to one business object or the other.
  • the resolver chooses to tag the ambiguous string as an instance of the business object with the higher score.
  • classifier 34 may apply score derivations 64 in order to add relevance tags to the document for other business objects that do not occur explicitly in the document.
  • the classifier typically computes relevance scores of other business objects that are related to the business objects occurring in the document.
  • Relations that may be used in score derivation include, for instance, similarity (as explained above), container relations (one entity contains another), hierarchical relations, and affinity relations (such as the affinity between a customer and an invoice issued to the customer).
  • the classifier may give the document a certain relevance score with respect to the finance department business object, even if the finance department is not mentioned in the document.
  • the related business object such as the finance department in this example, will receive a lower relevance score than the actual business object in the document.
  • the derived score that is assigned to the finance department may drop in inverse proportion to the size of the department.
  • Classifier 34 classifies each document 56 according to the business objects that it has tagged in and with respect to the document, and stores the results in a classification repository 66 (which may be part of repository 36 ).
  • the classification results may be organized in an inverted index of business objects for use in subsequent searching.
  • the tagged document itself may be stored in a document repository 68 (which may also be part of repository 36 ). Rather than storing the entire document, however, it may be sufficient for the classifier to store document metadata, containing the tag information for the document and pointing to the location of the document in system 20 . Each instance is thus saved and later retrieved by the document ID of the document in which it was found and the character offset (i.e., the index of the character within the document at which the instance begins).
  • Appendix C below presents an example of tagging a sample document using the methods described above.
  • FIG. 3 is a flow chart that schematically illustrates a method for searching the set of documents in system 20 , in accordance with an embodiment of the present invention.
  • the search is performed by searcher 40 after the documents have been tagged and indexed according to the method of FIG. 2 .
  • Searcher 40 receives a search query, typically from one of client computers 30 , at a search input stage 70 .
  • the user of the client computer inputs the search terms and limitations via a suitable graphical user interface (GUI), and a program running on the computer converts the query to a structured form that is accepted by searcher 40 .
  • GUI graphical user interface
  • the user may compose the query directly in this structured form.
  • the user specifies one or more business objects, at an object specification step 72 .
  • These objects may be chosen by the user from a list of the objects held in repository 36 , or they may alternatively be entered manually by the user. In the latter case, the user may, for example, enter a partial name or nickname, and searcher 40 then automatically identifies the corresponding business object in repository 36 using techniques similar to those described above as part of the tagging process. Additionally or alternatively, the user may specify that the search should be conducted over all business objects in a certain group or of a certain type, such as all customers in a given geographical area or all employees in a given department.
  • the user may also specify one or more keywords, in the form of a word or a phrase, at a keyword input step 74 , as in text-based search engines known in the art.
  • the business objects that were specified at step 72 and the keywords, if any, specified at step 74 may be joined by logical operators, which are specified by the user at a logic specification step 76 .
  • Such operators may include, for example, AND, OR, NOT, and may group the search terms into sub-queries.
  • the user may also specify scoring refinements, indicating how much weight searcher 40 should give each part of the query in computing document scores.
  • Searcher 40 scores the documents in the repository or repositories of system 20 according to the search query, at a document scoring step 78 .
  • this stage in the process uses the indices of business objects and, if appropriate, keywords that have been stored in repository 36 .
  • each instance occurring in a given document contributes to the score of that document, wherein the contribution depends, inter alia, on the level of confidence with which the business object was identified in the document.
  • the document contains a business object that is related to one of the business objects in the query (as identified by relation extraction module 60 ), the related business object may also contribute to the document score.
  • the final score of each document is typically a weighted sum of the object scores, which are generated by matching the business objects in the query to the document, and of the keyword scores, due to matching of keywords in the document.
  • the object scores receive greater weight, although the weights may be adjusted based on user preference and application requirements.
  • Searcher 40 ranks the documents according to the scores, at a ranking step 80 , and returns the ranked results to the user. Typically, the searcher returns a certain number of the documents that had the highest scores, or all documents with scores above some threshold.
  • the search results may be filtered by searcher 40 according to applicable permission (access control) list constraints, which are saved in repository 36 for both documents and business objects.
  • the searcher checks and applies these constraints in a manner that is transparent to the user: If the user is not authorized to view or access a certain document, the document will not be included within the user search results. If the user is not authorized to view a certain business object, that business object will not be included within the business object tree displayed to the user (although nothing will change in the business object repository itself) and the business object tagging results referring to the business object within the searched document(s), if any, will not be displayed to the user.
  • Business object analyzer function 52 (referred to hereinbelow simply as the business object analyzer) analyzes the business object name, its properties and its linked and related business objects if available in order to identify a complete list of variants (also referred to as variations).
  • the variants include all strings that may (theoretically) be used within the text of a document as a reference to the business object.
  • Each variant is a couple of a search string and a set of context-based, natural language constraints that must be met in order for a given instance of the variant (a business object candidate reference) to be considered a valid reference to that business object.
  • the constraints are optional and differ from one type of business object to another. Examples of such constraints are listed below in Appendix D.
  • the business object analyzer analyzes the business object name and its properties before any document is processed and prepares the variants to be used later, when documents are actually processed.
  • each variant it is possible to store explicitly, as a list of pairs, each pair consisting of the search string and the required constraints. Since there are typically many business objects in repository 36 , however, each with several variants, it is generally more efficient to store the variant pattern strings and, separately, the required constraints.
  • Each variant that may occur in a document is typically also assigned a score—a certainty level (expressed as a percentage, with 100% as maximum and 0% as minimum), indicating the likelihood that an instance of this variant is indeed a valid reference to the specific business object in question.
  • a score a certainty level (expressed as a percentage, with 100% as maximum and 0% as minimum), indicating the likelihood that an instance of this variant is indeed a valid reference to the specific business object in question.
  • the shorter the variant string the more frequent it is within documents and texts in general, and the fewer the available contextual cues around the instance in a document, the lower will be the variant score.
  • the score of a variant of a person's full name e.g., David Carlisle Fisher
  • Some examples of variants and their respective scores are listed below in Appendix A.
  • a business object candidate reference is found within the text of a document, it is linked to the business object if the candidate string is identical to a recognized variant string and if the specific instance within the text obeys the variant constraints.
  • the creation of the variants is based on analyzing the linguistic and semantic structure of the business object name and attributes. As noted earlier, in natural language texts the variant is often a shorter version of the name, instead of the full name.
  • the ways in which such variants may be created and used by document authors are based mainly on the base type of the business object. For example, a person (employee) name is usually compounded of a first name, last name and (possibly) middle name. In most references to a person, either the first name or the last name (depending on the context) is omitted.
  • the business object analyzer therefore uses rules to analyze each name lexically and linguistically, thus identifying which words or tokens are less important suffixes, which may be omitted, and which are “core” words that cannot be omitted. For example (as in Appendix C below), in the organization name Avnet Components Israel Ltd., all tokens except Avnet may be omitted. On the other hand, in Israel Corporation, no token may be omitted.
  • each business object may contain other business objects, or be a member in another business object, or have another relation with a second business object.
  • these relations are also processed by the business object analyzer, for subsequent consideration in classifying and tagging documents 56 .
  • the business object analyzer deals with a given business object, it also identifies and marks the related business objects, as indicated by the organizational repository.
  • each business object is searched directly according to its direct variants. After tagging is completed, however, the tagging results may also be used to derive the relevance score of the document to other, related business objects, as noted above. For example, if a document refers to David Fisher, and David Fisher is identified as working in the Finance Department, then the document is related (with lower score) to the Finance Department business object.
  • the relations between business objects may be similarity relations, container relations (such as distribution list membership), or other relations.
  • the score derivation formula may be configured according to the nature of the relation and the of the business objects themselves. For example, in the case of the container relation, the derived score may be inversely proportional to the distribution list size.
  • the score derivation algorithm is described in greater detail in Appendix F.
  • business objects are loaded into server 22 from various organizational repositories 28 , which may include old, irrelevant or even incorrect records. For example, entries such as “Build Master” found within an organization's Active Directory cannot be a valid employee name and should be excluded at an early stage. To avoid incorrect tagging and classification due to such records, the business object analyzer typically validates the business objects and their properties.
  • the business object analyzer validates business object names and properties using appropriate rules, which are written separately for each business object type according to its corresponding properties and semantic characteristics.
  • names of human beings should include only letters, possibly with some punctuation or connector characters, such as a hyphen or apostrophe.
  • the business object analyzer may use several level of validation: The lowest level of validation is “wrong”, meaning that the business object is inacceptable and will not be used (for example a numerical string as a person's name). A middle validation level or status may be “warning”, indicating that an important property is missing or seems to be problematic. (For example, “Build Master” as an employee name will generate a warning, since the name consists entirely of English dictionary words without a valid lexical first name.) The highest level is “correct”, for example, the employee name “David Fisher”.
  • FIG. 4 is a flow chart that schematically illustrates a method for updating business objects, in accordance with an embodiment of the present invention.
  • the method begins with initial identification of a business object, at an object identification step 90 .
  • the business object is typically identified by analyzing a structured repository such as a database, a CRM or similar system or a spreadsheet that is maintained by the organization in question. Alternatively, the business object may be identified based on tagging unstructured documents. The processes by which such business objects are identified and recorded were described above in detail with reference to FIG. 2 .
  • classifier 34 For each new business object, classifier 34 records a link to a source of information concerning the business object, at a link recording step 92 .
  • the link typically indicates the source record from which the business object was derived.
  • the source record may be an entry in a database or spreadsheet, or an Active Directory listing, or a page of a document on which the business object was found.
  • Crawler 38 periodically detects changes in the source records of business objects, at a change detection step 94 . These changes may be detected, for example, by polling the source records of the business objects that are indicated in the business object listings. Alternatively or additionally, the crawler may receive event notifications from certain data sources, such as HR and CRM databases, when changes are made. In either case, the changes in the business object records may indicate, for example, a new address or telephone number of a person or company, or a newly-discovered nickname or abbreviation. Changes may also indicate deletion of existing business objects or addition of new ones.
  • classifier 34 Upon receiving an indication that a business object has changed, classifier 34 updates the information regarding the business object in the listing of business objects in repository 36 , at a business object update step.
  • the updated information is used in subsequent tagging and indexing of new documents, as described above with reference to FIG. 2 .
  • the classifier may use the updated business object information to update the tagging and indexing of documents that have already been indexed. For example, if the update indicates a new nickname for a given person, the classifier may tag and index occurrences of the new nickname as instances of the corresponding business object, which were not previously recognized.
  • EMPLOYEE (SUB-TYPE) BUSINESS OBJECTS: NON-NAME-BASED VARIATIONS Instant Variation String Score Example Employee Mail 100% Ron.Bleakney@radvision.com Employee Phone Number 70% 618-4534232 (personal, full number) Employee ID Number (or Social 90% 009516808 Security number)(6)
  • Last names are divided into 3 groups: Very common last names (about 1% of the general population or more—“Smith”, “Cohen” in the Israeli population), common last names (about 0.2% of the general population (“Biton” in the Israeli population), and other last names.
  • Score is the respective score calculated in (3) less 10%.
  • the Employee ID must be at least six characters long or at least three characters long with the first character a letter.
  • This variation is allowed only if the first token is not one of the following: a number, a known (lexicon-based) first name, a country, state or a big city name, a nationality or another very common (junk) term.
  • PRODUCT (SUB-TYPE) BUSINESS OBJECTS: NON-NAME-BASED VARIATIONS Instant Variation String Score Example Product ID Number 90% 003456543-9
  • LOCATION business objects are solely the variations implemented for BASIC business objects. (LOCATION business objects, however, require context-natural language constraints not required for BASIC business objects, as listed below in Appendix D below. All LOCATION variations are considered name-based variations.)
  • Appendix B Provides of Business Object Types
  • Business objects are dealt with and analyzed by type. Each type typically has its own properties and rules for analyzing the properties. Typical business object types include EMPLOYEE, ORGANIZATION, PRODUCT, LOCATION or BASIC (unknown/other base type).
  • system 20 is used by a hypothetical company (“First Sample Corporation”) selling to customers in various countries.
  • the company documents may be in different languages accordingly (English, Italian, Dutch, etc.)
  • the customers' names may be in different languages, as listed below in Table I, which is an extract from the “Customers” table in the company's CRM system.
  • the customers may be referred to in company documents by their full names, partial names or synonyms. Other properties may be used, as well, as customer references, such as an e-mail address, Web domain, customer ID number, etc.
  • First Sample Corporation may have hundreds of employees from various cultural backgrounds. Each employee may have a full name (consisting of a first name, a middle name and a last name) and possibly also a nickname. Reference to an employee may be by his or her full name, partial name (first name only, last name only) or his/her nickname. Common first names such as David are probably shared by several employees. Table II is an extract from the company's HR module, listing the company's employees:
  • Classifier 34 tags the company's documents, including the following e-mail sent by David Carlisle Fisher, the company's CFO, to his assistant, Robert Jones:
  • the e-mail includes references to five customers, four of which are listed in the above customer table: Stichting Pandora, Istituto Nazionale per la Fisica della Materia, Cruz Roja Chilena and Avnet Components Israel Ltd. One customer is not listed in the table: Grupo Anaya S.A.
  • Classifier 34 nevertheless is able to identify all four objects, since it has automatically created the valid possible variants based on the stored properties of the objects.
  • the classifier parses each name using linguistic and semantic heuristics in order to find the valid variants, and then searches the document for valid matches of these variants.
  • the variants are identified as follows:
  • Stichting Pandora—“stichting” means “foundation” in Dutch.
  • the classifier thus identifies this name as a Dutch name. In Dutch, German and other languages, such a head word is usually the first word in the name, and the classifier therefore identifies “Pandora” automatically as a valid variant or synonym. (Such a rule would not generally be correct in English). In addition, the classifier distinguishes between valid contextual instances of this variant (“Pandora” as above) and invalid instances (“It is a Pandora's box” or “Pandora Smith”).
  • the classifier identifies the reference “INFM” although it is not the full name, but rather an acronym. In this case the acronym is available within the customer listings. Even had it not been listed, however, the classifier would have created it automatically by eliminating stop-words (prepositions and articles) within the Italian name (“per”, “la”, “delta”), and taking the first letters of the rest of the words. The fact that the acronym is listed in Table I, however, will raise the score (confidence level) that the classifier attaches to this reference.
  • the classifier identifies the first token of the name, “Avnet”, as a valid reference, since the name is identified as a standard English name, and the token is not identified as ambiguous (i.e., a word ordinarily used in another context).
  • the classifier identifies the first word (stichting or istituto, meaning “institute” in Italian) as a meaningful keyword, which is less likely to be used alone as a reference to the business object.
  • Israel Corporation the token Israel alone is most likely to refer to the country and not to the company known as Israel Corporation.
  • the classifier automatically assigns for each person's name the appropriate list of possible nicknames, based on that person's full name and the corresponding nicknames that are common in various cultures.
  • the name “Dave” may theoretically refer to 2 employees: David Carlisle Fisher and David Jefferson Smith. Since the email includes a safer (more complete) version of David Fisher's name (David C. Fisher), however, and his e-mail (david.fisher@samplecorp.com), the classifier concludes that the nickname Dave should be resolved solely as an instance of David Carlisle Fisher.
  • the name Jose Rodriguez is extracted by entity extraction module 58 . Based on the context within the mail: “(his mail is renownedas@cruzroja.cl),” the classifier identifies him as a contact person or employee of the customer Cruz Roja Chilena. Once this link is extracted from the mail, it can be stored within the CRM database for future use.
  • a business object reference may also be required to obey certain constraints regarding the context within the document in which the string occur. Examples of such context-based constraints include:
  • the variant string must not be a part of another name or entity.
  • Constraint 1 above is not required, because the morphological role of such a property is different (only numerical).
  • Other constraints, however, are usually required for numerical (or ID) properties, such as contextual cue-based constraints:
  • the reference business object candidate must be prefixed by a (lexicon-based) term indicating that this string indeed refers to the business object property.
  • the string (reference candidate) must be prefixed by one of the strings: “phone:”, “tel.”, “call us at”, “1-800-”, “1-808-”, etc.
  • Business object analyzer function 52 validates each new business object before inserting it into repository 36 in order to avoid incorrect entries that might harm the classification and tagging process (including both naive errors/junk and malicious content). For example, an entry such as “Build Master” found within the organization Active Directory cannot be a valid employee name. If the language of the business object name is specified or can be identified automatically, the analyzer uses this language in validating the business objects. Otherwise, English is used as the default language.
  • a new business object may be classified as either:
  • the Location Name (in any language) may include only letters or the following characters: -, ., (,).′,: ‘, ’, ’or ‘.
  • a business object whose name (in any language) includes such characters will be considered invalid.
  • FIG. 5 is a flow chart that schematically illustrates a method for deriving scores of business objects in a document, in accordance with an embodiment of the present invention.
  • the method uses tagging results 100 that were previously assembled by tagging business object names (including variants) in the document.
  • Server 22 gets all tagged business objects from the document, at an object fetching step 102 , and adds each business object to the appropriate listing in repository 36 , at an object recording step 104 .
  • the server checks whether there are any more objects to process, at an object checking step 106 . When all the objects have been completed, the classification results are saved, at a result storage step 108 , and the method terminates.
  • the server looks up all relations for the current business object, at a relation fetching step 110 .
  • the server checks whether the current business object has further relations to handle, at a relation checking step 112 . If there are no further relations, the process returns to step 106 to take the next business object or terminate.
  • the server checks whether the current relation is of a type that can the document to be assigned a score with respect to the related business object, at a relation checking step 114 . If not, the method returns to step 112 . If there is a score to be assigned, the server loads the relation scoring routine, at a scorer loading step 116 . This routine assigns a score for the related business object, at a score assignment step 118 . This score is typically the maximum between any existing score that has already been assigned to the document for the related business object and a new score that may be derived from the current business object.
  • the related business object is inserted or updated in repository 36 , at an object updating step 120 , and the new score for this business objected is added to the repository, as well, at a score recording step 122 .

Abstract

A computer-implemented method for processing information includes collecting data objects from one or more data repositories, the data objects having respective properties, which identify the data objects. The properties of the collected data objects are analyzed in order to derive respective identifiers corresponding to the data objects. A text string that matches one of the identifiers of a data object is identified within a context in a document. Responsively to the context, an indication that the identified text string is a valid instance of the data object is generated, and the document is processed responsively to the indication.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application claims the benefit of U.S. Provisional Patent Application 60/968,329, filed Aug. 28, 2007, whose disclosure is incorporated herein by reference.
  • FIELD OF THE INVENTION
  • The present invention relates generally to information processing, and specifically to methods and systems for indexing and searching documents.
  • BACKGROUND OF THE INVENTION
  • Organizations, such as business enterprises, typically accumulate vast amounts of data, including both structured data, such as database and spreadsheet records, and unstructured data, in the form of natural language text (also referred to as “free text”). Structured data can be efficiently indexed, addressed and searched using well-known tools, such as structured query language (SQL). Search tools for natural language documents, however, are limited for the most part to keyword-based techniques. As a result, searching a corpus of textual documents for a particular occurrence of a certain data object is frequently inefficient and time-consuming and may miss relevant occurrences of an object of interest, such as a person, company or product.
  • SUMMARY OF THE INVENTION
  • Embodiments of the present invention provide improved methods and systems for analyzing a set of data objects in a data repository of an organization, and using these data objects in tagging, classifying and then searching a corpus of data.
  • There is therefore provided, in accordance with an embodiment of the present invention, a computer-implemented method for processing information includes collecting data objects from one or more data repositories, the data objects having respective properties, which identify the data objects. The properties of the collected data objects are analyzed in order to derive respective identifiers corresponding to the data objects. A text string that matches one of the identifiers of a data object is identified within a context in a document. Responsively to the context, an indication that the identified text string is a valid instance of the data object is generated, and the document is processed responsively to the indication.
  • In another embodiment of the present invention, a computer-implemented method for processing information includes collecting data objects from one or more data repositories and identifying a respective record in the repositories corresponding to each of the data objects. One or more documents are processed so as to generate a listing of occurrences of the data objects in the documents. Upon detecting a change in the respective record corresponding to one of the data objects, the listing is automatically updated with respect to the one of the data objects. The documents are processed responsively to the listing.
  • Other embodiments of the present invention provide apparatus and computer software products for processing information.
  • The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram that schematically illustrates a system for exchange and management of data, in accordance with an embodiment of the present invention;
  • FIG. 2 is a flow chart that schematically illustrates a method for classifying and tagging documents according to business objects and indexing the documents according to the tagging results, in accordance with an embodiment of the present invention;
  • FIG. 3 is a flow chart that schematically illustrates a method for searching a set of documents, in accordance with an embodiment of the present invention;
  • FIG. 4 is a flow chart that schematically illustrates a method for updating a business object, in accordance with an embodiment of the present invention; and
  • FIG. 5 is a flow chart that schematically illustrates a method for scoring business objects, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS Overview
  • Embodiments of the present invention that are described hereinbelow provide apparatus, methods and software for document and knowledge management within an organization. The methods focus on analyzing a set of data objects in the organization's data repositories, which are then used in tagging, classifying, indexing and searching a corpus of documents that is maintained by the organization. In a business, for example, the data objects may refer to entities of importance to the business, such as employees, products, customers and suppliers of the business, departments or units of the organization, or geographical areas. For convenience and clarity, the description that follows will relate to these sorts of data objects, which will be referred to hereinafter as “business objects.” The principles of the present invention, however, are similarly applicable to organizations and data objects of other types.
  • Each business object is identified by properties that typically include one or more names, which may comprise multiple words. In addition, each business object typically has additional properties, such as synonyms, e-mail address, physical address, job title with organization name and/or affiliation, telephone number, and ID numbers, which may be useful in identifying occurrences of the business object in documents. There may also be links and dependencies between the business objects. (For example, a person business object might be an employee of an organization business object, such as a customer or supplier.) Business objects are also dynamic, and their properties may change during their life cycle.
  • Some business objects properties may be numerical or fixed strings, but many properties, such as business object names, are open, literal, natural language strings, and thus are more complex and error-prone. In addition, people tend to automatically shorten natural language names and/or to create from them synonyms and other variants, according to various lexical and semantic rules. Frequently the different names or parts of a name of a given business object may be used separately and independently. For example, business objects may be referred to in documents by partial names, such as the first name or nickname of a person, an abbreviation of a product, or an organization name without the usual prefix or suffix.
  • Thus, in the context of a document (and particularly a natural language document), it may not be clear whether a certain text fragment, such as a word, that is known to be related to the name of a given business object actually represents an instance of the business object. In some cases a text string may be found in a document that matches a name or a variant of a name of a business object, while the actual semantic meaning of the string in the specific context of the document does not refer to the business object. (For example, the phrase “nice systems,” without further information, could either refer to the company NICE Systems or describe the qualities of certain products unrelated to the particular company.) Some embodiments of the present invention address this problem by using natural language processing to ascertain automatically, based on the context (and without human intervention in most cases), whether or not the text string in question actually refers to a certain data object.
  • In embodiments of the present invention, business objects are identified automatically by processing data repositories of the organization. Typically, the business objects are identified in sources of structured data, such as databases, CRM (Customer Relation Management) systems, or other similar organizational systems and spreadsheets. In some embodiments, business objects are also extracted from documents containing unstructured data, in which case the “record” with which the business object is associated is the document or portion of the document from which the business object was extracted. Furthermore, in many organizations, different types of business objects may be managed in different systems, which may include duplicates and errors. Therefore, in the disclosed embodiments, data are collected and compared from various repositories, and are then analyzed to create a unified listing of the business objects across the organization. This analysis uses natural language tools, such as lexical, linguistic and semantic analysis, to find identifiers, including variants that are different from the actual names of the business objects, that may identify the business objects in a document. As explained in detail hereinbelow, these variants may be based either on the object names or on other object properties.
  • For purposes of management and updating of the centralized listing of business objects, each business object may be associated with one or more corresponding source records in the structured data. (A business object may be present in several organizational systems. For example, an employee may have a record both in Windows Active Directory and in the organization HR system. For each business object, each such source record (with its corresponding ID) is listed. When a change occurs in a source record or when a new source record is found to correspond to an existing business object, the business object is then updated accordingly, without any need for human intervention. If required, the relevant documents are re-tagged, so that subsequent searches use the most up-to-date information regarding all the business objects in the set.
  • System Description
  • FIG. 1 is a block diagram that schematically illustrates a system 20 for exchange and management of data, in accordance with an embodiment of the present invention. System 20 is typically maintained by an organization, such as a business, for purposes of exchanging, storing and recalling data used by the organization. A data classification and search server 22 identifies business objects and builds a listing, such as an index, for use in searching the data, as described in detail hereinbelow.
  • System 20 is typically built around an enterprise network 24, which may comprise any suitable type or types of data communication network, and may, for example, include both intranet and extranet segments. A variety of servers 26 may be connected to the network, including mail and other application servers, for instance. Storage repositories 28 are also connected to the network and typically contain both structured and unstructured data. The structured data may include a variety of databases, such as product databases, human resources (HR) databases containing records of personnel of the organization, and customer relations management (CRM) databases containing records of customers of the organization, as well as their orders and payment records. Additionally or alternatively, structured data may be organized and stored in other forms and formats that are known in the art, such as spreadsheets. Servers 26 and repositories 28 are accessible to client computers 30 via network 24.
  • Server 22 connects to network 24 via a suitable network interface 32. The server typically comprises one or more general-purpose computer processors, which are programmed in software to carry out the functions that are described herein. This software may be downloaded to server 22 in electronic form, over a network, for example. Alternatively or additionally, the software may be provided on tangible storage media, such as optical, magnetic or electronic memory media. Although server 22 is shown in FIG. 1, for the sake of simplicity, as a single unit, in practice the functions of the server may be carried out by a number of different processors, such as a separate processor (or even a separate computer) for each of the functional blocks shown in the figure. Alternatively, the functional blocks may be implemented simply as different processes running on the same computer. All such alternative configurations are considered to be within the scope of the present invention.
  • Server 22 comprises a classifier 34, which automatically assembles a listing of business objects based on information in repositories 28 (and possibly other sources, as well), and then tags the documents in system 20 according to instances of the business objects that occur in the documents. The classifier recognizes and resolves variant forms of the business object names, such as shortened names and abbreviations, using techniques of natural language processing, and may assign confidence scores to instances of the business objects depending on the level of certainty that a given variant actually refers to the business object in question. A crawler 38 collects documents from system 20, and classifier 34 builds an index of the documents, for use in subsequent search and update operations, according to occurrences of the business objects in the document text. Classifier 34 stores the business object listing and index in an internal repository 36, which typically comprises a suitable storage device or group of such devices. Details of the processes of identifying data objects and tagging documents are described further hereinbelow with reference to FIG. 2.
  • In addition, classifier 34 may create a general index of strings appearing in the documents, for purposes of subsequent keyword-based searching, as is known in the art.
  • A searcher 40 receives requests, typically from client computers 30, to search the documents in system 20 for a certain business object or combination or type of business objects. The search queries may also specify keywords, in addition to the business objects, as well as logical operators connecting the business objects and (optional) keywords in the queries. The searcher extracts documents from system 20 that contain instances of the business objects specified by the query and scores each document according to factors such as the number of occurrences of the business objects and the confidence level. The score may also reflect occurrences of specified keywords in the documents, as well as factors such as document type and metadata. Searcher 40 ranks the documents according to their scores and returns the result to the requesting client. Details of the search process are described hereinbelow with reference to FIG. 3.
  • Business Object Tagging and Search
  • FIG. 2 is a flow chart that schematically illustrates a method for classifying and tagging a set of documents according to a business object set, in accordance with an embodiment of the present invention. The method is described, for the sake of clarity, with reference to the system architecture shown in FIG. 1, but the principles of this method may similarly be applied in tagging and indexing of data objects in other applications. Examples of some of the functions shown in FIG. 2 are described below in the Appendices.
  • To begin the process, crawler 38 loads business objects to classifier 34 from repositories 28 of system 20. These repositories may include, for example, records maintained by applications such as a CRM system and a HR system, as well as computer system management applications, such as Microsoft Active Directory. Such applications typically have an application program interface (API), which the crawler can use to access the tables of business objects and their properties. Classifier 34 uses the information provided by the crawler to build a table of each type of business objects, such as customers, employees, products, etc. The crawler continually samples repositories 28 in order to report changes in the business object listings.
  • In some embodiments, crawler 38 also loads and analyzes, for each business object, permission and control access details, specifying which users are allowed to view the business object and its details and which users are allowed to change these points. The crawler converts the access list into a standard Access Control List (ACL) form and saves the ACL in the business object repository.
  • Classifier 34 may also identify new business objects in unstructured documents, as described below in reference to step 58. This identification is typically based on morphological, syntactic and semantic analysis of the document using appropriate rules.
  • When crawler 38 delivers a new business object from one of repositories 28, classifier 34 activates a business object (BO) comparer function 50 to compare the new business object to the business objects that are already listed in repository 36. The comparer function calculates a similarity factor between the new business object and each of the existing business objects. If the factor is above a high threshold, the classifier will treat the two business objects as identical, i.e., as alternative names of the same object. The classifier will then merge the record of the new business object into the record of the existing business object that it matched. Some examples of rules that may be used in calculating and applying similarity factors are presented in Appendix A below.
  • On the other hand, even when the similarity factor is not high enough to support a conclusion that the new object is identical to an existing object, there may still be a similarity relation between the objects if the factor is above a certain lower threshold. In this case, the classifier adds the new business object to the list in repository 36 and records a similarity relation between the new and existing business objects. This similarity relation is used subsequently in tagging and scoring occurrences of the business objects in documents, as described hereinbelow.
  • The comparer function may also discover and record other relations between business objects. For example, it may find that two employees share a telephone number, or that two organizations share a domain name. These relations may also be used in tagging and scoring, and may in addition be queried directly by clients.
  • Various methods and considerations may be used in computing similarity factors, and the scoring formula may vary depending on the type of business object involved. For example, if two business objects of type “person” have the same social security number, they can be assumed to be one and the same. If two customers have identical postal addresses, they may be considered to be the same business objects, although if they share only the same city, street and building number, they may receive a lower similarity factor.
  • A business object analyzer function 52 of classifier 34 uses the information provided by crawler 38 and comparer function 50 in building, for each business object, a set of identifiers, including variants, that will serve as the basis for tagging instances of the business object in the documents in system 20. Each business object is typically identified by a name and appropriate additional properties, such as ID number, telephone number, e-mail address, etc. A listing of representative properties for different business object types is presented below in Appendix B. To generate the possible variants, the analyzer parses the name and other properties in order to create the set of partial names, synonyms and acronyms that may refer to instances of the business object in the documents in system 20. The name and properties may be specified separately in different languages if necessary, and the analyzer may automatically identify the language as part of the parsing process.
  • Further aspects of the operation of the business object analyzer function are described below in the section headed “Business Object Analysis and Validation.”
  • Classifier 34 stores the listings of business objects and their properties in internal repository 36, as noted above. These listings are typically not static, but rather are updated continually in response to changes occurring in the records and other documents in repositories 28. A method for updating business objects is described hereinbelow with reference to FIG. 4.
  • Classifier 34 applies the business object listings described above in tagging instances of business objects that occur in documents 56, which are collected by crawler 38 from system 20. A basic tagger 54 loads the list of business objects from repository 36, including all the possible variants, and searches each document for the patterns corresponding to the business object name and variants. Tagger 54 may also use other lexicons of relevant terms, such as common first names and common organizational suffixes (such as “corp.”), in addition to the business object names, as well as vocabularies and/or regular expressions. Tagger 54 typically analyzes the tokens (such as words of natural language text) appearing in each document both typographically and morphologically for similarity to the names that are to be tagged. The list of possible variants of a particular business object that is to be used for this purpose may be adjusted according to the language of the document that is to be tagged.
  • The tagger also checks the context of each business object name or variant that is found in the document to make sure that the reference is valid. For example, before identifying an occurrence of the name “Pandora” in a document as referring to a customer by this name, the tagger checks to ensure that “Pandora” is not part of another name, such as of a person named “Pandora Smith.”
  • The tagger tags each name that may be an instance of a given business object both with the business object name and with a confidence score. For this purpose, the document is converted to text and then tokenized, i.e., separated into single words. For each token, the relevant features are saved, such as typographical features (alphabetic token or numerical, capitalized or not, etc.) and part of speech (proper noun, noun, verb, etc.) Each token may also be compared to relevant lexicons, as noted above. Typically, full names receive higher scores than partial or abbreviated names, and the score may be increased or decreased based on the nature and number of variants of the business object in question that appear in the document being tagged.
  • As noted above, classifier 34 may encounter business objects in documents 56 that are not included in the listings in repository 36. To deal with such objects, as well as other object-related entities, the classifier invokes an entity extraction module 58. This module applies rule-based natural language processing to identify and extract business entities such as persons and organizations, as well as ancillary data entities, such as locations, dates, telephone numbers, etc., which may refer to business objects. The classifier may use the extracted entities to support identification of existing business objects or may add new business objects to the listings in repository 36 based on the extracted entities, either automatically or interactively with the support of a system manager, for example.
  • Classifier 34 also actuates a relation extraction module 60 in order to identify relations between business objects (or other entities) and other entities or properties appearing in documents 56. This module may, for example, extract relations such as company location (or headquarters), which identifies the relation between a company business object and a place; or affiliation/employment, which identifies the organization at which a person business object is employed and his position in the organization.
  • After the business objects in a given document have been tagged, and entities and relations have been extracted, a resolver function 62 determines which business objects are actually referenced in the document. The resolver is typically invoked to resolve ambiguities, which may occur when a given string may refer to more than one business object (as when two persons have the same name), or when it is not certain that a name extracted from the document actually matches a business object that it resembles. The resolver computes a score for each business object to which the ambiguous entity might refer. The score may be based, for example, on how fully a partial name in the entity matches the full name of the business object or on other information appearing in the document that may be more relevant to one business object or the other. Typically, the resolver chooses to tag the ambiguous string as an instance of the business object with the higher score.
  • After tagging of entities in the text of document 56 is completed, classifier 34 may apply score derivations 64 in order to add relevance tags to the document for other business objects that do not occur explicitly in the document. For this purpose, the classifier typically computes relevance scores of other business objects that are related to the business objects occurring in the document. Relations that may be used in score derivation include, for instance, similarity (as explained above), container relations (one entity contains another), hierarchical relations, and affinity relations (such as the affinity between a customer and an invoice issued to the customer).
  • For example, if the classifier has found that a given document refers to a person who works in the finance department of the business, it may give the document a certain relevance score with respect to the finance department business object, even if the finance department is not mentioned in the document. Typically, the related business object, such as the finance department in this example, will receive a lower relevance score than the actual business object in the document. In the present example of a container relation, the derived score that is assigned to the finance department may drop in inverse proportion to the size of the department.
  • Classifier 34 classifies each document 56 according to the business objects that it has tagged in and with respect to the document, and stores the results in a classification repository 66 (which may be part of repository 36). The classification results may be organized in an inverted index of business objects for use in subsequent searching. The tagged document itself may be stored in a document repository 68 (which may also be part of repository 36). Rather than storing the entire document, however, it may be sufficient for the classifier to store document metadata, containing the tag information for the document and pointing to the location of the document in system 20. Each instance is thus saved and later retrieved by the document ID of the document in which it was found and the character offset (i.e., the index of the character within the document at which the instance begins).
  • Appendix C below presents an example of tagging a sample document using the methods described above.
  • FIG. 3 is a flow chart that schematically illustrates a method for searching the set of documents in system 20, in accordance with an embodiment of the present invention. The search is performed by searcher 40 after the documents have been tagged and indexed according to the method of FIG. 2.
  • Searcher 40 receives a search query, typically from one of client computers 30, at a search input stage 70. Typically, the user of the client computer inputs the search terms and limitations via a suitable graphical user interface (GUI), and a program running on the computer converts the query to a structured form that is accepted by searcher 40. Alternatively, the user may compose the query directly in this structured form.
  • As part of the query, the user specifies one or more business objects, at an object specification step 72. These objects may be chosen by the user from a list of the objects held in repository 36, or they may alternatively be entered manually by the user. In the latter case, the user may, for example, enter a partial name or nickname, and searcher 40 then automatically identifies the corresponding business object in repository 36 using techniques similar to those described above as part of the tagging process. Additionally or alternatively, the user may specify that the search should be conducted over all business objects in a certain group or of a certain type, such as all customers in a given geographical area or all employees in a given department.
  • The user may also specify one or more keywords, in the form of a word or a phrase, at a keyword input step 74, as in text-based search engines known in the art. The business objects that were specified at step 72 and the keywords, if any, specified at step 74 may be joined by logical operators, which are specified by the user at a logic specification step 76. Such operators may include, for example, AND, OR, NOT, and may group the search terms into sub-queries. The user may also specify scoring refinements, indicating how much weight searcher 40 should give each part of the query in computing document scores.
  • Searcher 40 scores the documents in the repository or repositories of system 20 according to the search query, at a document scoring step 78. Typically, this stage in the process uses the indices of business objects and, if appropriate, keywords that have been stored in repository 36. As noted earlier, for each business object in the query, each instance occurring in a given document contributes to the score of that document, wherein the contribution depends, inter alia, on the level of confidence with which the business object was identified in the document. Furthermore, if the document contains a business object that is related to one of the business objects in the query (as identified by relation extraction module 60), the related business object may also contribute to the document score.
  • The final score of each document is typically a weighted sum of the object scores, which are generated by matching the business objects in the query to the document, and of the keyword scores, due to matching of keywords in the document. In general, the object scores receive greater weight, although the weights may be adjusted based on user preference and application requirements. Searcher 40 ranks the documents according to the scores, at a ranking step 80, and returns the ranked results to the user. Typically, the searcher returns a certain number of the documents that had the highest scores, or all documents with scores above some threshold.
  • The search results may be filtered by searcher 40 according to applicable permission (access control) list constraints, which are saved in repository 36 for both documents and business objects. The searcher checks and applies these constraints in a manner that is transparent to the user: If the user is not authorized to view or access a certain document, the document will not be included within the user search results. If the user is not authorized to view a certain business object, that business object will not be included within the business object tree displayed to the user (although nothing will change in the business object repository itself) and the business object tagging results referring to the business object within the searched document(s), if any, will not be displayed to the user.
  • Business Object Analysis and Validation
  • Business object analyzer function 52 (referred to hereinbelow simply as the business object analyzer) analyzes the business object name, its properties and its linked and related business objects if available in order to identify a complete list of variants (also referred to as variations). The variants include all strings that may (theoretically) be used within the text of a document as a reference to the business object. Each variant is a couple of a search string and a set of context-based, natural language constraints that must be met in order for a given instance of the variant (a business object candidate reference) to be considered a valid reference to that business object. The constraints are optional and differ from one type of business object to another. Examples of such constraints are listed below in Appendix D. The business object analyzer analyzes the business object name and its properties before any document is processed and prepares the variants to be used later, when documents are actually processed.
  • It is possible to store each variant explicitly, as a list of pairs, each pair consisting of the search string and the required constraints. Since there are typically many business objects in repository 36, however, each with several variants, it is generally more efficient to store the variant pattern strings and, separately, the required constraints.
  • Each variant that may occur in a document is typically also assigned a score—a certainty level (expressed as a percentage, with 100% as maximum and 0% as minimum), indicating the likelihood that an instance of this variant is indeed a valid reference to the specific business object in question. Typically, the shorter the variant string, the more frequent it is within documents and texts in general, and the fewer the available contextual cues around the instance in a document, the lower will be the variant score. For example, the score of a variant of a person's full name (e.g., David Carlisle Fisher) will be higher than the score of the variant consisting of the very frequent first name David when occurring alone. Some examples of variants and their respective scores are listed below in Appendix A.
  • Once a business object candidate reference is found within the text of a document, it is linked to the business object if the candidate string is identical to a recognized variant string and if the specific instance within the text obeys the variant constraints.
  • The creation of the variants is based on analyzing the linguistic and semantic structure of the business object name and attributes. As noted earlier, in natural language texts the variant is often a shorter version of the name, instead of the full name. The ways in which such variants may be created and used by document authors are based mainly on the base type of the business object. For example, a person (employee) name is usually compounded of a first name, last name and (possibly) middle name. In most references to a person, either the first name or the last name (depending on the context) is omitted.
  • In organizations names, however, there is no formal division into parts. There is one name that often includes some semantically meaningful suffixes, and the business object name should be analyzed accordingly. The business object analyzer therefore uses rules to analyze each name lexically and linguistically, thus identifying which words or tokens are less important suffixes, which may be omitted, and which are “core” words that cannot be omitted. For example (as in Appendix C below), in the organization name Avnet Components Israel Ltd., all tokens except Avnet may be omitted. On the other hand, in Israel Corporation, no token may be omitted.
  • Further examples of types of variant strings and their respective scores in listed below in Appendix A, while typical constraints are given in Appendix D.
  • Analyzing Links and Relations between Business Objects
  • Since business objects are complex objects, each business object may contain other business objects, or be a member in another business object, or have another relation with a second business object. In an embodiment of the present invention, these relations are also processed by the business object analyzer, for subsequent consideration in classifying and tagging documents 56. In other words, when the business object analyzer deals with a given business object, it also identifies and marks the related business objects, as indicated by the organizational repository.
  • For example, if John Adams, director of Technical Services of Alcatel, is mentioned in an e-mail, then this e-mail may be tagged as having a reference to “Alcatel” (albeit with a relatively low score), even if Alcatel is not mentioned at all within the mail. Similarly, if a different e-mail was sent to the group “Alcatel Top Executives” of which “John Adams” is a member, then that mail should be tagged or classified with a reference to “John Adams”.
  • When documents are tagged and classified, each business object is searched directly according to its direct variants. After tagging is completed, however, the tagging results may also be used to derive the relevance score of the document to other, related business objects, as noted above. For example, if a document refers to David Fisher, and David Fisher is identified as working in the Finance Department, then the document is related (with lower score) to the Finance Department business object.
  • As noted earlier, the relations between business objects may be similarity relations, container relations (such as distribution list membership), or other relations. The score derivation formula may be configured according to the nature of the relation and the of the business objects themselves. For example, in the case of the container relation, the derived score may be inversely proportional to the distribution list size. The score derivation algorithm is described in greater detail in Appendix F.
  • Business Object Name and Properties Validation
  • As noted earlier, business objects are loaded into server 22 from various organizational repositories 28, which may include old, irrelevant or even incorrect records. For example, entries such as “Build Master” found within an organization's Active Directory cannot be a valid employee name and should be excluded at an early stage. To avoid incorrect tagging and classification due to such records, the business object analyzer typically validates the business objects and their properties.
  • In an embodiment of the present invention, the business object analyzer validates business object names and properties using appropriate rules, which are written separately for each business object type according to its corresponding properties and semantic characteristics. For example, names of human beings should include only letters, possibly with some punctuation or connector characters, such as a hyphen or apostrophe.
  • The business object analyzer may use several level of validation: The lowest level of validation is “wrong”, meaning that the business object is inacceptable and will not be used (for example a numerical string as a person's name). A middle validation level or status may be “warning”, indicating that an important property is missing or seems to be problematic. (For example, “Build Master” as an employee name will generate a warning, since the name consists entirely of English dictionary words without a valid lexical first name.) The highest level is “correct”, for example, the employee name “David Fisher”.
  • Additional examples of validation rules are given in Appendix E below.
  • Business Object Life Cycle Management
  • FIG. 4 is a flow chart that schematically illustrates a method for updating business objects, in accordance with an embodiment of the present invention. The method begins with initial identification of a business object, at an object identification step 90. The business object is typically identified by analyzing a structured repository such as a database, a CRM or similar system or a spreadsheet that is maintained by the organization in question. Alternatively, the business object may be identified based on tagging unstructured documents. The processes by which such business objects are identified and recorded were described above in detail with reference to FIG. 2.
  • For each new business object, classifier 34 records a link to a source of information concerning the business object, at a link recording step 92. The link typically indicates the source record from which the business object was derived. For example, the source record may be an entry in a database or spreadsheet, or an Active Directory listing, or a page of a document on which the business object was found.
  • Crawler 38 periodically detects changes in the source records of business objects, at a change detection step 94. These changes may be detected, for example, by polling the source records of the business objects that are indicated in the business object listings. Alternatively or additionally, the crawler may receive event notifications from certain data sources, such as HR and CRM databases, when changes are made. In either case, the changes in the business object records may indicate, for example, a new address or telephone number of a person or company, or a newly-discovered nickname or abbreviation. Changes may also indicate deletion of existing business objects or addition of new ones.
  • Upon receiving an indication that a business object has changed, classifier 34 updates the information regarding the business object in the listing of business objects in repository 36, at a business object update step. The updated information is used in subsequent tagging and indexing of new documents, as described above with reference to FIG. 2. Furthermore, the classifier may use the updated business object information to update the tagging and indexing of documents that have already been indexed. For example, if the update indicates a new nickname for a given person, the classifier may tag and index occurrences of the new nickname as instances of the corresponding business object, which were not previously recognized. (This tagging and indexing could be performed by checking the general keyword index for occurrences of the string corresponding to the nickname, rather than going back over all the original documents.) Documents containing the nickname may subsequently be returned in searches that include this business object among the query terms.
  • It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.
  • APPENDIX A
    RULES FOR CREATING VARIANTS
    BASIC (SUB-TYPE) BUSINESS OBJECT VARIATIONS
    Instant
    Variation String Score Example
    Name
    80% Ben-Gurion
    Airport
    Name without all ‘-’ 70% Ben Gurion
    characters Airport
    BO Known Synonym (For each 70% BGN
    synonym, a separate variation)
  • EMPLOYEE (SUB-TYPE) BUSINESS OBJECTS: NAME-BASED
    VARIATIONS
    Instant
    Variation String Score Example
    FirstName + MiddleName + 100% Ronald James
    LastName Bleakney
    FirstName + MiddleName Initial + 90% Ronald J
    LastName Bleakney
    FirstName + MiddleName Initial + 90% Ronald J.
    “.” + LastName Bleakney
    LastName + FirstName + 70% Bleakney
    MiddleName Initial Ronald J
    FirstName + “.” + MiddleName + 70% R. James
    LastName Bleakney
    FirstName Initial + MiddleName 60% R. J. Bleakney
    Initial + LastName
    FirstName + LastName 80% Ronald
    Bleakney
    LastName + FirstName(2) 60% Bleakney
    Ronald
    Honorific(1) + LastName 40%-60% Dr Bleakney
    (3)
    Honorific(1) + “.” + LastName 40%-60% Mr. Bleakney
    (3)
    Honorific(1) + “.” + FirstName 70% Mr. R. Bleakney
    Initial + “.” + LastName
    Honorific(1) + “.” + FirstName 70% Mr. R Bleakney
    Initial + LastName
    Honorific(1) + FirstName 70% Mr R. Bleakney
    Initial + “.” + LastName
    Last Name (only) 30%-50% Bleakney
    (4)
    First Name Nick Name + Last 60% Ron Bleakney,
    Name (for each known nick Ronnie
    name) Bleakney
    First Name (only) 20-40%(5) Ronald
    Nick Name (only) - for each 20% Ron, Ronnie . . .
    nick name
  • EMPLOYEE (SUB-TYPE) BUSINESS OBJECTS: NON-NAME-BASED
    VARIATIONS
    Instant
    Variation String Score Example
    Employee Mail
    100% Ron.Bleakney@radvision.com
    Employee Phone Number 70% 618-4534232
    (personal, full number)
    Employee ID Number (or Social 90% 009516808
    Security number)(6)
  • In addition, all the variations implemented for BASIC business objects are also implemented for EMPLOYEE business objects.
  • Notes to the above tables:
  • (1) Honorific: Mr, Mrs, Miss, Ms, Dr or Prof.
  • If Employee Gender is FEMALE, then Honorific may be only: Mrs, Miss, Ms, Dr or Prof.
  • If Employee Gender is MALE, then Honorific may be only Mr, Dr or Prof.
  • (2) This variation is not allowed if the last name is a “known” (lexicon-based) first name—in order to not confuse for example “Chaim Moshe” and “Moshe Chaim”.
  • (3) Last names are divided into 3 groups: Very common last names (about 1% of the general population or more—“Smith”, “Cohen” in the Israeli population), common last names (about 0.2% of the general population (“Biton” in the Israeli population), and other last names.
  • Score for the variation when the last name is a “very common last name”: 40% (“Mr Levi”).
  • Score for the variation when the last name is a “common last name”: 50% (“Dr Anwar”).
  • Score for the variation when the last name is other: 60%.
  • (4) Score is the respective score calculated in (3) less 10%.
  • (5) Similar division (to that described in (3)) exists for first names:
  • Score for the variation when the first name is a “very common first name”: 20% (“David”).
  • Score for the variation when the first name is a “common first name”: 30% (“Lucia”).
  • Score for the variation when the first name is other: 40%.
  • (6) The Employee ID must be at least six characters long or at least three characters long with the first character a letter.
  • ORGANIZATION (SUB-TYPE) BUSINESS OBJECTS: NAME-BASED
    VARIATIONS
    Instant
    Variation String Score Example
    Organization (full) name 80% Inxight Software,
    Inc.
    Name with sure company 70% Inxight Software
    suffix removed(1)
    Name with unsafe company 60% Inxight
    suffix removed (2)
    Name after removing a 60% Nestle (Canday
    suffix within Division) =>
    parenthesis(3) Nestle
    Name after removing 60% Atcolx de Mexico =>
    country as suffix Atcolx
    (Possibly with “de”) (4)
    First token of the name 50% Agilis
    (5) Computersysteme =>
    Agilis
  • ORGANIZATION (SUB-TYPE) BUSINESS OBJECTS:
    NON-NAME-BASED VARIATIONS
    Instant
    Variation String Score Example
    Organization mail domain 90% @basistech.com
    Organization main phone 70% 617-386-2000
    number (full)
    Organization main fax 70% 617-386-2020
    number (full)
  • In addition, all the variations implemented for BASIC business objects are also implemented for ORGANIZATION business objects.
  • Notes to the above tables:
  • (1) This variation is not allowed if the resulting (remaining)string is a name of a country (example: “Israel Corporation”=>“Israel”) or another very common term (“grupo”, “asia”).
  • (2) This variation is not allowed if the resulting (remaining) string is a name of a country, a big city, state or nationality or a dictionary English word or other common term as in (1) (example: “British Telecom”=>“British” is not allowed).
  • (3) This variation is not allowed if the resulting (remaining) string is a name of a country or an English dictionary word.
  • (4) This variation is not allowed if the resulting (remaining) string is an English dictionary word.
  • (5) This variation is allowed only if the first token is not one of the following: a number, a known (lexicon-based) first name, a country, state or a big city name, a nationality or another very common (junk) term.
  • PRODUCT (SUB-TYPE) BUSINESS OBJECTS: NON-NAME-BASED
    VARIATIONS
    Instant
    Variation String Score Example
    Product ID Number 90% 003456543-9
  • In addition, all the variations implemented for BASIC business objects are also implemented for PRODUCT business objects.
  • Location (Sub-Type) Business Objects: Name-Based Variations
  • The variations implemented for LOCATION business objects are solely the variations implemented for BASIC business objects. (LOCATION business objects, however, require context-natural language constraints not required for BASIC business objects, as listed below in Appendix D below. All LOCATION variations are considered name-based variations.)
  • Appendix B—Properties of Business Object Types
  • Business objects are dealt with and analyzed by type. Each type typically has its own properties and rules for analyzing the properties. Typical business object types include EMPLOYEE, ORGANIZATION, PRODUCT, LOCATION or BASIC (unknown/other base type).
  • The table below lists some of the properties of these business object types by way of example:
  • BO
    Attribute Relevant BO
    Name Base Types Description Example
    Bo Known All Known synonym of the CTM 3.3 (for
    Synonym BO Click to Meet
    3.3)
    Employee Employees Employee first Michael
    First Name name
    Employee Employees Employee middle Benjamin
    Middle Name name
    Employee Employees Employee last Jones
    Last Name name
    Employee Employees Employee mail Michael.jones@radvision.com
    Mail
    Employee Employees Employee phone 712-4534902
    Phone
    Employee ID Employees Employee national 009516888
    id number of social
    security number
    Employee Employees MALE or FEMALE MALE
    Gender
    BO Domain Organizations Organization videocentric.co.uk
    Name web or email
    domain
    BO Phone Organizations Organization's 44(0)118
    Number main phone number 9740125
    BO Fax Organizations Organization's main fax 44(0)118
    Number number 9740126
    Product ID Products Product unique 003456543-9
    ID number
  • Appendix C—Example of Business Object Tagging
  • In this example, system 20 is used by a hypothetical company (“First Sample Corporation”) selling to customers in various countries. The company documents may be in different languages accordingly (English, Italian, Dutch, etc.) In addition, even within English documents, the customers' names may be in different languages, as listed below in Table I, which is an extract from the “Customers” table in the company's CRM system. The customers may be referred to in company documents by their full names, partial names or synonyms. Other properties may be used, as well, as customer references, such as an e-mail address, Web domain, customer ID number, etc.
  • TABLE I
    CUSTOMER TABLE
    Customer
    Customer Customer Main Additional Customer Internet
    ID Name Name Domain
    1173 Stichting stichtingpandora.nl
    Pandora
    1174 Istituto INFM infm.it
    Nazionale per
    la Fisica della
    Materia
    1175 Avnet avnet-israel.com
    Components
    Israel Ltd.
    1176 Israel israelcorp.com
    Corporation
    1177 Cruz Roja cruzroja.cl
    Chilena
    . . .
  • Similarly, First Sample Corporation may have hundreds of employees from various cultural backgrounds. Each employee may have a full name (consisting of a first name, a middle name and a last name) and possibly also a nickname. Reference to an employee may be by his or her full name, partial name (first name only, last name only) or his/her nickname. Common first names such as David are probably shared by several employees. Table II is an extract from the company's HR module, listing the company's employees:
  • TABLE II
    EMPLOYEE TABLE
    Employee First Middle Last Direct
    ID Name Name Name E-mail Phone
    17 David Carlisle Fisher david.fisher@samplecorp.com 6394444
    18 Robert Jones bob2.jones@samplecorp.com 6394445
    19 David Jefferson Jones dave.jones@samplecorp.com 6394443
    20 Francisco Gomez Paco@samplecorp.com 00-1-212-
    6543222
    . . .
  • Classifier 34 tags the company's documents, including the following e-mail sent by David Carlisle Fisher, the company's CFO, to his assistant, Robert Jones:
  • Hi Bob,
  • Please check what's going on with the Pandora and INFM orders. Please also write Jose Rodriguez (his mail is finanzas@cruzroja.cl) and remind him we're still awaiting payment. If this does not help, I'll ask Paco to talk with him when he gets back from Israel—you know that his English is not the best. We'll talk tomorrow about Avnet and about this new customer (Grupo Anaya S.A.).
  • Thanks,
  • Dave
  • David C. Fisher
  • Chief Financial Officer
  • First Sample Corporation
  • david.fisher@samplecorp.com
  • The boldface terms in the letter above represent entities (customers and employees), which are tagged using the business object information in repository 36.
  • Customer Names:
  • The e-mail includes references to five customers, four of which are listed in the above customer table: Stichting Pandora, Istituto Nazionale per la Fisica della Materia, Cruz Roja Chilena and Avnet Components Israel Ltd. One customer is not listed in the table: Grupo Anaya S.A.
  • For the four customers that are listed, the references used in the e-mail are different from the full customer names in the CRM table. Classifier 34 nevertheless is able to identify all four objects, since it has automatically created the valid possible variants based on the stored properties of the objects. The classifier parses each name using linguistic and semantic heuristics in order to find the valid variants, and then searches the document for valid matches of these variants. In the present example, the variants are identified as follows:
  • Stichting Pandora—“stichting” means “foundation” in Dutch. The classifier thus identifies this name as a Dutch name. In Dutch, German and other languages, such a head word is usually the first word in the name, and the classifier therefore identifies “Pandora” automatically as a valid variant or synonym. (Such a rule would not generally be correct in English). In addition, the classifier distinguishes between valid contextual instances of this variant (“Pandora” as above) and invalid instances (“It is a Pandora's box” or “Pandora Smith”).
  • Istituto Nazionale per la Fisica della Materia—The classifier identifies the reference “INFM” although it is not the full name, but rather an acronym. In this case the acronym is available within the customer listings. Even had it not been listed, however, the classifier would have created it automatically by eliminating stop-words (prepositions and articles) within the Italian name (“per”, “la”, “delta”), and taking the first letters of the rest of the words. The fact that the acronym is listed in Table I, however, will raise the score (confidence level) that the classifier attaches to this reference.
  • Cruz Roja Chilena (“Chilean Red Cross” in Spanish)—This reference is identified based not on the name, but on another property: the related e-mail or internet domain, cruzroja.cl.
  • Avnet Components Israel Ltd.—The classifier identifies the first token of the name, “Avnet”, as a valid reference, since the name is identified as a standard English name, and the token is not identified as ambiguous (i.e., a word ordinarily used in another context). By contrast, in names such as Stichting Pandora or Istituto Nazionale per la Fisica della Materia, the classifier identifies the first word (stichting or istituto, meaning “institute” in Italian) as a meaningful keyword, which is less likely to be used alone as a reference to the business object. Similarly, for the name Israel Corporation, the token Israel alone is most likely to refer to the country and not to the company known as Israel Corporation.
  • Grupo Anaya S.A.—This customer is not (yet) listed but its name is extracted by entity extraction module 58.
  • Employee Names and Other Person Names:
  • The referenced employees in the letter above are David Carlisle Fisher, Robert Jones and Francisco Gomez. In addition the letter contains a reference to another person, Jose Rodriguez, who is an employee of a customer (Cruz Roja Chilena).
  • All employees are referred to in the letter by their nicknames (Dave=>David, Bob=>Robert, Paco=>Francisco). The classifier automatically assigns for each person's name the appropriate list of possible nicknames, based on that person's full name and the corresponding nicknames that are common in various cultures.
  • The name “Dave” may theoretically refer to 2 employees: David Carlisle Fisher and David Jefferson Smith. Since the email includes a safer (more complete) version of David Fisher's name (David C. Fisher), however, and his e-mail (david.fisher@samplecorp.com), the classifier concludes that the nickname Dave should be resolved solely as an instance of David Carlisle Fisher.
  • The name Jose Rodriguez is extracted by entity extraction module 58. Based on the context within the mail: “(his mail is finanzas@cruzroja.cl),” the classifier identifies him as a contact person or employee of the customer Cruz Roja Chilena. Once this link is extracted from the mail, it can be stored within the CRM database for future use.
  • Appendix D—Context-Based Natural Language Constraints
  • In addition to matching a valid business object variant string, a business object reference may also be required to obey certain constraints regarding the context within the document in which the string occur. Examples of such context-based constraints include:
  • For Employee, Organization and Location (Sub-Types) Business Objects:
  • For Named-based variation:
  • 1. The morphological (and hence, syntactic and semantic) role of the string within the document must be an Entity/Object name (proper noun in linguistic terms). In English such strings are usually, but not always, noted by the use of capitalization. Other languages may have different rules.
  • Examples:
  • (i) Business Object: Analog Devices, Inc.
  • Valid Reference: “Other critical components offered by Analog Devices are . . . ”
  • Invalid reference: “Analog devices are used to measure electrical quantities.”
  • (ii) Business Object: William Jackson Smith
  • Valid Reference: “Please call Bill.”
  • Invalid Reference: “This bill is very problematic.”
  • 2. The variant string must not be a part of another name or entity.
  • Examples:
  • (i) Business Object: London (capital of U.K.)
  • Valid Reference: “I'm flying to London next week.”
  • Invalid reference: “Yaron London will participate in this show.”
  • (ii) Business Object: Williams (a major oil company)
  • Valid Reference: “In 1966, Williams paid $287 million for the country's largest petroleum products pipeline.”
  • Invalid reference: “He has a B.A. degree from Williams College.”
  • For Other (Not Name) Properties (ID Number, Phone Number Etc.)
  • Constraint 1 above is not required, because the morphological role of such a property is different (only numerical). Other constraints, however, are usually required for numerical (or ID) properties, such as contextual cue-based constraints: Typically, the reference business object candidate must be prefixed by a (lexicon-based) term indicating that this string indeed refers to the business object property.
  • Example:
  • For the property phone number, the string (reference candidate) must be prefixed by one of the strings: “phone:”, “tel.”, “call us at”, “1-800-”, “1-808-”, etc.
  • The goal of this constraint is to distinguish between:
  • “Just call us at 1-808-624-8222 and mention that you want our internet airport special of $30,”
  • and
  • “U.S. Pat. No. 6,248,222.”
  • Appendix E—Business Object Validation Rules
  • Business object analyzer function 52 validates each new business object before inserting it into repository 36 in order to avoid incorrect entries that might harm the classification and tagging process (including both naive errors/junk and malicious content). For example, an entry such as “Build Master” found within the organization Active Directory cannot be a valid employee name. If the language of the business object name is specified or can be identified automatically, the analyzer uses this language in validating the business objects. Otherwise, English is used as the default language.
  • Based on analyzing its properties, a new business object may be classified as either:
    • Valid—May be used for tagging and classification without any further required action.
    • Invalid—Cannot be inserted or used with its current properties. At least one property must be changed or added in order to make it valid.
    • Warning—A possible problem with a property was found, or an important property is missing. If the user wishes, however, the business object may be still inserted. There are two warning types:
      • Major Warning—The business object will not be inserted (as with invalid business objects), and a suitable message will be displayed. If authorized by the user, the business object will be inserted.
      • Minor Warning—The business object will be inserted (despite the warning), and a warning message will be displayed.
    Validation Constraints—All Business Object Types:
  • A business object that does not meet the following constraints is considered invalid:
    • The business object name must be not empty. (If the business object Type is Employee, the Last Name must not be empty.
    • The business object name must be longer than one character. (For Employee business objects: the concatenation of all three name parts: first name, middle name and last name, must be longer than one character, and the last name must be longer than one character).
    • The business object name must be no longer than 120 characters.
    Validation Constraints—Employee Business Object Type:
  • Required constraints (a business object not meeting these constraints is considered invalid):
    • The Last Name property must not be empty.
    • The First Name, Middle Name and Last Name properties may include only letters or the following characters: -, ′,: ‘, ’, ’or ‘.
    • All of the above properties may be no longer than 30 characters.
  • Optional constraints (a business object not meeting these constraints will trigger a warning):
    • Employee First Name is missing—minor warning.
    • Employee Mail(Email) is missing—minor warning.
    • Employee name seems to be a general term and not a person name. (All the components of the name, including first, middle and last names, are dictionary words or organization suffixes or indicators, and the first name is not a lexicon-based first name—major warning.
    Validation Constraints—Organization Business Object Type:
  • Required constraints (a business object not meeting these constraints is considered invalid):
    • The organization name cannot be a general/frequent term alone: “Systems”, “Suppliers”, etc.
  • Optional constraints (a business object not meeting these constraints will trigger a warning):
    • Organization mail/internet domain is missing—minor warning.
    • Words in the organization name indicate that it has probably been marked as inactive/irrelevant by the system user: “cancel”, “inactive”, etc.—minor warning.
    • The organization name is identical to a country name (“Israel”, “France”)—major warning.
    • The organization name appears to be a person's name (starting with a lexicon-based known first name)—major warning.
    Validations Constraints—Location Business Object Type:
  • The Location Name (in any language) may include only letters or the following characters: -, ., (,).′,: ‘, ’, ’or ‘. A business object whose name (in any language) includes such characters will be considered invalid.
  • Appendix F—Score Derivation
  • FIG. 5 is a flow chart that schematically illustrates a method for deriving scores of business objects in a document, in accordance with an embodiment of the present invention. The method uses tagging results 100 that were previously assembled by tagging business object names (including variants) in the document. Server 22 gets all tagged business objects from the document, at an object fetching step 102, and adds each business object to the appropriate listing in repository 36, at an object recording step 104. The server checks whether there are any more objects to process, at an object checking step 106. When all the objects have been completed, the classification results are saved, at a result storage step 108, and the method terminates.
  • As long as there are still objects to process at step 106, the server looks up all relations for the current business object, at a relation fetching step 110. The server checks whether the current business object has further relations to handle, at a relation checking step 112. If there are no further relations, the process returns to step 106 to take the next business object or terminate.
  • As long as there are relations to process at step 112, the server checks whether the current relation is of a type that can the document to be assigned a score with respect to the related business object, at a relation checking step 114. If not, the method returns to step 112. If there is a score to be assigned, the server loads the relation scoring routine, at a scorer loading step 116. This routine assigns a score for the related business object, at a score assignment step 118. This score is typically the maximum between any existing score that has already been assigned to the document for the related business object and a new score that may be derived from the current business object. The related business object is inserted or updated in repository 36, at an object updating step 120, and the new score for this business objected is added to the repository, as well, at a score recording step 122.
  • This process continues until all of the relevant business objects and their relations have been handled.

Claims (24)

1. A computer-implemented method for processing information, the method comprising:
collecting data objects from one or more data repositories, the data objects having respective properties, which identify the data objects;
analyzing the properties of the collected data objects in order to derive respective identifiers corresponding to the data objects;
identifying a text string that matches one of the identifiers of a data object within a context in a document;
generating, responsively to the context, an indication that the identified text string is a valid instance of the data object; and
processing the document responsively to the indication.
2. The method according to claim 1, wherein the properties comprise names of the data objects, and wherein analyzing the properties comprises identifying variants that are different from the names.
3. The method according to claim 2, wherein the variants are selected from a group of variant types consisting of a part of a name, an abbreviation of the name, and a nickname.
4. The method according to claim 1, wherein analyzing the properties comprises applying natural language analysis to the properties in order to find the identifiers, and wherein generating the indication comprises recognizing the valid instance within the context in the document by applying natural language analysis to the document.
5. The method according to claim 1, wherein the data repository contains information in multiple different languages, and wherein analyzing the properties comprises deriving the identifiers that are respectively applicable in each of two or more of the languages, and wherein identifying the text string choosing the identifiers responsively to a language of the document.
6. The method according to claim 1, wherein generating the indication comprises computing a score indicative of a level of confidence that the identified text string validly represents the data object.
7. The method according to claim 6, wherein each of the identifiers is derived from the properties of the data object using a respective rule, and wherein computing the score comprises assigning the level of confidence to each of the identifiers responsively to the respective rule.
8. The method according to claim 6, wherein processing the document comprises assigning a respective document score to the document, indicative of a relevance of the document to the data object, responsively to the score.
9. The method according to claim 1, wherein analyzing the properties comprises identifying a relation between at least first and second data objects, and wherein processing the document comprises assigning a document score to the document indicating that the document is relevant to the first data object responsively to a match between the text string and one of the identifiers of the second data object and to the relation.
10. The method according to claim 9, wherein the relation is selected from a group of relations consisting of container relations, similarity relations, hierarchical relations and affinity relations.
11. The method according to claim 1, wherein analyzing the set of data objects comprises identifying respective records in the data repository corresponding to the data objects, and wherein processing the document comprises generating a listing of occurrences of the data objects in a corpus of documents, detecting a change in one of the respective records corresponding to one of the data objects, and responsively to analyzing the change, automatically updating the listing with respect to the one of the data objects.
12. The method according to claim 1, wherein processing the document comprises generating a response to a search query based on valid instances of the data objects that occur in the document.
13. The method according to claim 1, wherein collecting the set of the data objects comprises extracting a first set of the data objects from a repository of structured data, and extracting one or more second data objects, not in the initial set, from the document, and adding the second data objects to the first set.
14. The method according to claim 13, wherein adding the second data objects comprises comparing the second data objects to the data objects in the first set, and adding the second data objects upon determining that the second data objects do not match any of the data objects in the first set.
15. The method according to claim 1, wherein analyzing the properties comprises applying predetermined rules in order to validate the collected data objects before using the data objects in processing the document.
16. The method according to claim 15, wherein applying the predetermined rules comprises making a determination selected from a group of determinations consisting of determining whether the data object should be used in processing the document, whether a property of the data object should be used in processing the document, and whether a property of the data object is missing, and wherein the determination is based on at least one of comparing the property with a lexicon, comparing the property with a vocabulary, and matching the property with one or more regular expressions.
17. The method according to claim 1, wherein collecting the data objects comprises retrieving and analyzing access control information with respect to each of at least some of the data objects and the properties of the data objects, and wherein processing the documents comprises providing an output to a user while filtering at least a portion of the output using the access control information.
18. A computer-implemented method for processing information, comprising:
collecting data objects from one or more data repositories and identifying a respective record in the repositories corresponding to each of the data objects;
processing one or more documents so as to generate a listing of occurrences of the data objects in the documents;
detecting a change in the respective record corresponding to one of the data objects;
responsively to the change in the respective record, automatically updating the listing with respect to the one of the data objects; and
processing the documents responsively to the listing.
19. The method according to claim 18, wherein detecting the change comprises polling records in at least one of the data repositories that correspond to the set of data objects.
20. The method according to claim 18, wherein detecting the change comprises receiving an event message that is indicative of the change from at least one of the data repositories.
21. Apparatus for processing information, comprising:
an interface, which is coupled to communicate with one or more data repositories; and
a processor, which is configured to collect data objects from the one or more data repositories, the data objects having respective properties, which identify the data objects, to analyze the properties of the collected data objects in order to derive respective identifiers corresponding to the data objects, to identify a text string that matches one of the identifiers of a data object within a context in a document, to generate, responsively to the context, an indication that the identified text string is a valid instance of the data object, and to process the document responsively to the indication.
22. Apparatus for processing information, comprising:
an interface, which is coupled to communicate with one or more data repositories; and
a processor, which is configured to collect data objects from the one or more data repositories while identifying a respective record in the repositories corresponding to each of the data objects, to process one or more documents so as to generate a listing of occurrences of the data objects in the documents, to detect a change in the respective record corresponding to one of the data objects, to automatically update the listing with respect to the one of the data objects responsively to the change in the respective record, and to process the documents responsively to the listing.
23. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to collect data objects from one or more data repositories, the data objects having respective properties, which identify the data objects, to analyze the properties of the collected data objects in order to derive respective identifiers corresponding to the data objects, to identify a text string that matches one of the identifiers of a data object within a context in a document, to generate, responsively to the context, an indication that the identified text string is a valid instance of the data object, and to process the document responsively to the indication.
24. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to collect data objects from one or more data repositories while identifying a respective record in the repositories corresponding to each of the data objects, to process one or more documents so as to generate a listing of occurrences of the data objects in the documents, to detect a change in the respective record corresponding to one of the data objects, to automatically update the listing with respect to the one of the data objects responsively to the change in the respective record, and to process the documents responsively to the listing.
US12/199,043 2007-08-28 2008-08-27 Document management using business objects Abandoned US20090063470A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/199,043 US20090063470A1 (en) 2007-08-28 2008-08-27 Document management using business objects

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US96832907P 2007-08-28 2007-08-28
US12/199,043 US20090063470A1 (en) 2007-08-28 2008-08-27 Document management using business objects

Publications (1)

Publication Number Publication Date
US20090063470A1 true US20090063470A1 (en) 2009-03-05

Family

ID=40409076

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/199,043 Abandoned US20090063470A1 (en) 2007-08-28 2008-08-27 Document management using business objects
US12/200,089 Expired - Fee Related US8315997B1 (en) 2007-08-28 2008-08-28 Automatic identification of document versions

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/200,089 Expired - Fee Related US8315997B1 (en) 2007-08-28 2008-08-28 Automatic identification of document versions

Country Status (1)

Country Link
US (2) US20090063470A1 (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100083085A1 (en) * 2008-09-29 2010-04-01 Tow Bruce System and method for management of common decentralized applications data and logic
US20100161621A1 (en) * 2008-12-19 2010-06-24 Johan Christiaan Peters Inferring rules to classify objects in a file management system
US20110099192A1 (en) * 2009-10-28 2011-04-28 Yahoo! Inc. Translation Model and Method for Matching Reviews to Objects
US20110251837A1 (en) * 2010-04-07 2011-10-13 eBook Technologies, Inc. Electronic reference integration with an electronic reader
US20130086101A1 (en) * 2011-09-29 2013-04-04 Sap Ag Data Search Using Context Information
US20130145371A1 (en) * 2011-12-01 2013-06-06 Sap Ag Batch processing of business objects
US20140143734A1 (en) * 2012-11-21 2014-05-22 Florian Jann Business object explorer
US20150007010A1 (en) * 2013-07-01 2015-01-01 International Business Machines Corporation Discovering Relationships in Tabular Data
US20150046477A1 (en) * 2013-03-14 2015-02-12 Igor Gershteyn Method and system for data structure creation, organization and searching using basic atomic units of information
US20150370796A1 (en) * 2014-06-20 2015-12-24 Google Inc. Media store with a canonical layer for content
US20160063195A1 (en) * 2014-08-29 2016-03-03 International Business Machines Corporation Case management model processing
US9354922B2 (en) 2014-04-02 2016-05-31 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US20160371375A1 (en) * 2011-12-12 2016-12-22 The Cleveland Clinic Foundation Storing structured and unstructured clinical information for information retrieval
US20170060915A1 (en) * 2015-08-27 2017-03-02 International Business Machines Corporation System and a method for associating contextual structured data with unstructured documents on map-reduce
US20170154115A1 (en) * 2015-12-01 2017-06-01 International Business Machines Corporation Searching people, content and documents from another person's social perspective
US9773252B1 (en) * 2013-03-12 2017-09-26 Groupon, Inc. Discovery of new business openings using web content analysis
US20170310451A1 (en) * 2016-04-23 2017-10-26 Sugarcrm Inc. Full-duplex real-time cross-module updates of customer relationship management (crm) data in a crm data processing system
WO2018048512A1 (en) * 2016-07-31 2018-03-15 Vatbox, Ltd. Matching transaction electronic documents to evidencing electronic
US20180101745A1 (en) * 2015-11-29 2018-04-12 Vatbox, Ltd. System and method for finding evidencing electronic documents based on unstructured data
US20180300296A1 (en) * 2017-04-17 2018-10-18 Microstrategy Incorporated Document similarity analysis
US20190079649A1 (en) * 2017-09-12 2019-03-14 Sap Se Ui rendering based on adaptive label text infrastructure
US20190205636A1 (en) * 2018-01-02 2019-07-04 Bank Of America Corporation Artificial Intelligence Based Smart Data Engine
US10387561B2 (en) 2015-11-29 2019-08-20 Vatbox, Ltd. System and method for obtaining reissues of electronic documents lacking required data
US10453109B2 (en) * 2015-07-30 2019-10-22 Sci Limited Evaluation and training for online vehicle request and response messaging
US10509811B2 (en) 2015-11-29 2019-12-17 Vatbox, Ltd. System and method for improved analysis of travel-indicating unstructured electronic documents
US10986061B2 (en) * 2017-01-16 2021-04-20 Ercan TURFAN Knowledge-based structured communication system
US11080765B2 (en) 2013-03-14 2021-08-03 Igor Gershteyn Method and system for data structure creation, organization and searching using basic atomic units of information
US20210256212A1 (en) * 2018-11-30 2021-08-19 Honeywell International Inc. Classifying devices of a building management system
US20210256071A1 (en) * 2020-02-14 2021-08-19 Coupang Corp. Systems and methods for receiving and propagating efficient search updates in real time
US11106815B2 (en) * 2012-07-24 2021-08-31 ID Insight System, method and computer product for fast and secure data searching
US11138372B2 (en) 2015-11-29 2021-10-05 Vatbox, Ltd. System and method for reporting based on electronic documents
US11188720B2 (en) * 2019-07-18 2021-11-30 International Business Machines Corporation Computing system including virtual agent bot providing semantic topic model-based response
US20230017384A1 (en) * 2021-07-15 2023-01-19 DryvIQ, Inc. Systems and methods for machine learning classification-based automated remediations and handling of data items

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685177B2 (en) * 2009-01-07 2020-06-16 Litera Corporation System and method for comparing digital data in spreadsheets or database tables
US9037597B2 (en) * 2011-01-10 2015-05-19 International Business Machines Corporation Verifying file versions in a networked computing environment
US11080462B2 (en) 2017-11-13 2021-08-03 Workshare Ltd. Method of comparing two data tables and displaying the results without source formatting
US11561947B2 (en) * 2020-09-17 2023-01-24 EMC IP Holding Company LLC File lifetime tracking for cloud-based object stores

Citations (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5331554A (en) * 1992-12-10 1994-07-19 Ricoh Corporation Method and apparatus for semantic pattern matching for text retrieval
US5423032A (en) * 1991-10-31 1995-06-06 International Business Machines Corporation Method for extracting multi-word technical terms from text
US5463773A (en) * 1992-05-25 1995-10-31 Fujitsu Limited Building of a document classification tree by recursive optimization of keyword selection function
US5627979A (en) * 1994-07-18 1997-05-06 International Business Machines Corporation System and method for providing a graphical user interface for mapping and accessing objects in data stores
US5819261A (en) * 1995-03-28 1998-10-06 Canon Kabushiki Kaisha Method and apparatus for extracting a keyword from scheduling data using the keyword for searching the schedule data file
US5819258A (en) * 1997-03-07 1998-10-06 Digital Equipment Corporation Method and apparatus for automatically generating hierarchical categories from large document collections
US5857179A (en) * 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
US5893908A (en) * 1996-11-21 1999-04-13 Ricoh Company Limited Document management system
US5920854A (en) * 1996-08-14 1999-07-06 Infoseek Corporation Real-time document collection search engine with phrase indexing
US5983216A (en) * 1997-09-12 1999-11-09 Infoseek Corporation Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections
US6182091B1 (en) * 1998-03-18 2001-01-30 Xerox Corporation Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis
US6205456B1 (en) * 1997-01-17 2001-03-20 Fujitsu Limited Summarization apparatus and method
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US6295529B1 (en) * 1998-12-24 2001-09-25 Microsoft Corporation Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts
US20010042085A1 (en) * 1998-09-30 2001-11-15 Mark Peairs Automatic document classification using text and images
US20020016787A1 (en) * 2000-06-28 2002-02-07 Matsushita Electric Industrial Co., Ltd. Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
US6349296B1 (en) * 1998-03-26 2002-02-19 Altavista Company Method for clustering closely resembling data objects
US20020169872A1 (en) * 2001-05-14 2002-11-14 Hiroshi Nomiyama Method for arranging information, information processing apparatus, storage media and program tranmission apparatus
US20030033333A1 (en) * 2001-05-11 2003-02-13 Fujitsu Limited Hot topic extraction apparatus and method, storage medium therefor
US6523025B1 (en) * 1998-03-10 2003-02-18 Fujitsu Limited Document processing system and recording medium
US6556987B1 (en) * 2000-05-12 2003-04-29 Applied Psychology Research, Ltd. Automatic text classification system
US6571240B1 (en) * 2000-02-02 2003-05-27 Chi Fai Ho Information processing for searching categorizing information in a document based on a categorization hierarchy and extracted phrases
US20030101164A1 (en) * 2001-10-12 2003-05-29 Marc Pic Method of indexing and comparing multimedia documents
US20030140311A1 (en) * 2002-01-18 2003-07-24 Lemon Michael J. Method for content mining of semi-structured documents
US20030172058A1 (en) * 2002-03-07 2003-09-11 Fujitsu Limited Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus
US20030187834A1 (en) * 2002-03-29 2003-10-02 Fujitsu Limited Document search method
US6654744B2 (en) * 2000-04-17 2003-11-25 Fujitsu Limited Method and apparatus for categorizing information, and a computer product
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files
US20040006736A1 (en) * 2002-07-04 2004-01-08 Takahiko Kawatani Evaluating distinctiveness of document
US6687689B1 (en) * 2000-06-16 2004-02-03 Nusuara Technologies Sdn. Bhd. System and methods for document retrieval using natural language-based queries
US6718333B1 (en) * 1998-07-15 2004-04-06 Nec Corporation Structured document classification device, structured document search system, and computer-readable memory causing a computer to function as the same
US20040122826A1 (en) * 2002-12-20 2004-06-24 Kevin Mackie Data model and applications
US20040181755A1 (en) * 2003-03-12 2004-09-16 Communications Research Laboratory, Independent Administrative Institution Apparatus, method and computer program for keyword highlighting, and computer-readable medium storing the program thereof
US20040230577A1 (en) * 2003-03-05 2004-11-18 Takahiko Kawatani Document and pattern clustering method and apparatus
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US20050044426A1 (en) * 2003-08-18 2005-02-24 Matthias Vogel Data structure for access control
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US6895552B1 (en) * 2000-05-31 2005-05-17 Ricoh Co., Ltd. Method and an apparatus for visual summarization of documents
US20050131932A1 (en) * 2003-12-15 2005-06-16 Microsoft Corporation Dynamic content clustering
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20050234975A1 (en) * 2004-04-16 2005-10-20 Via Technologies, Inc. Related content linking managing system, method and recording medium
US6978419B1 (en) * 2000-11-15 2005-12-20 Justsystem Corporation Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US20060016114A1 (en) * 2004-07-26 2006-01-26 Mark Vanderberg Photo mat with alignment grid and method of using the same
US20060026114A1 (en) * 2004-07-28 2006-02-02 Ken Gregoire Data gathering and distribution system
US20060089924A1 (en) * 2000-09-25 2006-04-27 Bhavani Raskutti Document categorisation system
US7062487B1 (en) * 1999-06-04 2006-06-13 Seiko Epson Corporation Information categorizing method and apparatus, and a program for implementing the method
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction
US20060212441A1 (en) * 2004-10-25 2006-09-21 Yuanhua Tang Full text query and search systems and methods of use
US20060248053A1 (en) * 2005-04-29 2006-11-02 Antonio Sanfilippo Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture
US20060288015A1 (en) * 2005-06-15 2006-12-21 Schirripa Steven R Electronic content classification
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
US20070067289A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Extracting informative phrases from unstructured text
US7213205B1 (en) * 1999-06-04 2007-05-01 Seiko Epson Corporation Document categorizing method, document categorizing apparatus, and storage medium on which a document categorization program is stored
US7260773B2 (en) * 2002-03-28 2007-08-21 Uri Zernik Device system and method for determining document similarities and differences
US20070244915A1 (en) * 2006-04-13 2007-10-18 Lg Electronics Inc. System and method for clustering documents
US20070250497A1 (en) * 2006-04-19 2007-10-25 Apple Computer Inc. Semantic reconstruction
US20070276847A1 (en) * 2005-05-26 2007-11-29 Mark Henry Butler Client and method for database
US20070294610A1 (en) * 2006-06-02 2007-12-20 Ching Phillip W System and method for identifying similar portions in documents
US20080104502A1 (en) * 2006-10-26 2008-05-01 Yahoo! Inc. System and method for providing a change profile of a web page
US20080162455A1 (en) * 2006-12-27 2008-07-03 Rakshit Daga Determination of document similarity
US7398201B2 (en) * 2001-08-14 2008-07-08 Evri Inc. Method and system for enhanced data searching
US20080250007A1 (en) * 2003-10-21 2008-10-09 Hiroaki Masuyama Document Characteristic Analysis Device for Document To Be Surveyed

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3598742B2 (en) 1996-11-25 2004-12-08 富士ゼロックス株式会社 Document search device and document search method
JP3726263B2 (en) 2002-03-01 2005-12-14 ヒューレット・パッカード・カンパニー Document classification method and apparatus
AU2003295358A1 (en) * 2002-10-31 2004-06-07 Arizan Corporation Methods and apparatus for summarizing document content for mobile communication devices
US7865354B2 (en) 2003-12-05 2011-01-04 International Business Machines Corporation Extracting and grouping opinions from text documents
US20080104016A1 (en) * 2006-10-30 2008-05-01 Susan Handayani Putri Atmaja Method and system for comparing data
US7702680B2 (en) * 2006-11-02 2010-04-20 Microsoft Corporation Document summarization by maximizing informative content words
US8943057B2 (en) 2007-12-12 2015-01-27 Oracle America, Inc. Method and system for distributed bulk matching and loading

Patent Citations (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423032A (en) * 1991-10-31 1995-06-06 International Business Machines Corporation Method for extracting multi-word technical terms from text
US5463773A (en) * 1992-05-25 1995-10-31 Fujitsu Limited Building of a document classification tree by recursive optimization of keyword selection function
US5331554A (en) * 1992-12-10 1994-07-19 Ricoh Corporation Method and apparatus for semantic pattern matching for text retrieval
US5627979A (en) * 1994-07-18 1997-05-06 International Business Machines Corporation System and method for providing a graphical user interface for mapping and accessing objects in data stores
US5819261A (en) * 1995-03-28 1998-10-06 Canon Kabushiki Kaisha Method and apparatus for extracting a keyword from scheduling data using the keyword for searching the schedule data file
US5920854A (en) * 1996-08-14 1999-07-06 Infoseek Corporation Real-time document collection search engine with phrase indexing
US5857179A (en) * 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
US5893908A (en) * 1996-11-21 1999-04-13 Ricoh Company Limited Document management system
US6205456B1 (en) * 1997-01-17 2001-03-20 Fujitsu Limited Summarization apparatus and method
US5819258A (en) * 1997-03-07 1998-10-06 Digital Equipment Corporation Method and apparatus for automatically generating hierarchical categories from large document collections
US5983216A (en) * 1997-09-12 1999-11-09 Infoseek Corporation Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections
US6523025B1 (en) * 1998-03-10 2003-02-18 Fujitsu Limited Document processing system and recording medium
US6182091B1 (en) * 1998-03-18 2001-01-30 Xerox Corporation Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis
US6349296B1 (en) * 1998-03-26 2002-02-19 Altavista Company Method for clustering closely resembling data objects
US6718333B1 (en) * 1998-07-15 2004-04-06 Nec Corporation Structured document classification device, structured document search system, and computer-readable memory causing a computer to function as the same
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US20010042085A1 (en) * 1998-09-30 2001-11-15 Mark Peairs Automatic document classification using text and images
US6295529B1 (en) * 1998-12-24 2001-09-25 Microsoft Corporation Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts
US7062487B1 (en) * 1999-06-04 2006-06-13 Seiko Epson Corporation Information categorizing method and apparatus, and a program for implementing the method
US7213205B1 (en) * 1999-06-04 2007-05-01 Seiko Epson Corporation Document categorizing method, document categorizing apparatus, and storage medium on which a document categorization program is stored
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US6571240B1 (en) * 2000-02-02 2003-05-27 Chi Fai Ho Information processing for searching categorizing information in a document based on a categorization hierarchy and extracted phrases
US6654744B2 (en) * 2000-04-17 2003-11-25 Fujitsu Limited Method and apparatus for categorizing information, and a computer product
US6556987B1 (en) * 2000-05-12 2003-04-29 Applied Psychology Research, Ltd. Automatic text classification system
US6895552B1 (en) * 2000-05-31 2005-05-17 Ricoh Co., Ltd. Method and an apparatus for visual summarization of documents
US6687689B1 (en) * 2000-06-16 2004-02-03 Nusuara Technologies Sdn. Bhd. System and methods for document retrieval using natural language-based queries
US20020016787A1 (en) * 2000-06-28 2002-02-07 Matsushita Electric Industrial Co., Ltd. Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
US20060089924A1 (en) * 2000-09-25 2006-04-27 Bhavani Raskutti Document categorisation system
US6978419B1 (en) * 2000-11-15 2005-12-20 Justsystem Corporation Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files
US7366718B1 (en) * 2001-01-24 2008-04-29 Google, Inc. Detecting duplicate and near-duplicate files
US20030033333A1 (en) * 2001-05-11 2003-02-13 Fujitsu Limited Hot topic extraction apparatus and method, storage medium therefor
US20020169872A1 (en) * 2001-05-14 2002-11-14 Hiroshi Nomiyama Method for arranging information, information processing apparatus, storage media and program tranmission apparatus
US7398201B2 (en) * 2001-08-14 2008-07-08 Evri Inc. Method and system for enhanced data searching
US20030101164A1 (en) * 2001-10-12 2003-05-29 Marc Pic Method of indexing and comparing multimedia documents
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20030140311A1 (en) * 2002-01-18 2003-07-24 Lemon Michael J. Method for content mining of semi-structured documents
US20030172058A1 (en) * 2002-03-07 2003-09-11 Fujitsu Limited Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus
US7260773B2 (en) * 2002-03-28 2007-08-21 Uri Zernik Device system and method for determining document similarities and differences
US20030187834A1 (en) * 2002-03-29 2003-10-02 Fujitsu Limited Document search method
US20040006736A1 (en) * 2002-07-04 2004-01-08 Takahiko Kawatani Evaluating distinctiveness of document
US20040122826A1 (en) * 2002-12-20 2004-06-24 Kevin Mackie Data model and applications
US20040230577A1 (en) * 2003-03-05 2004-11-18 Takahiko Kawatani Document and pattern clustering method and apparatus
US20040181755A1 (en) * 2003-03-12 2004-09-16 Communications Research Laboratory, Independent Administrative Institution Apparatus, method and computer program for keyword highlighting, and computer-readable medium storing the program thereof
US20050044426A1 (en) * 2003-08-18 2005-02-24 Matthias Vogel Data structure for access control
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US20080250007A1 (en) * 2003-10-21 2008-10-09 Hiroaki Masuyama Document Characteristic Analysis Device for Document To Be Surveyed
US20050131932A1 (en) * 2003-12-15 2005-06-16 Microsoft Corporation Dynamic content clustering
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20050234975A1 (en) * 2004-04-16 2005-10-20 Via Technologies, Inc. Related content linking managing system, method and recording medium
US20060016114A1 (en) * 2004-07-26 2006-01-26 Mark Vanderberg Photo mat with alignment grid and method of using the same
US20060026114A1 (en) * 2004-07-28 2006-02-02 Ken Gregoire Data gathering and distribution system
US20060212441A1 (en) * 2004-10-25 2006-09-21 Yuanhua Tang Full text query and search systems and methods of use
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction
US20060248053A1 (en) * 2005-04-29 2006-11-02 Antonio Sanfilippo Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture
US20070276847A1 (en) * 2005-05-26 2007-11-29 Mark Henry Butler Client and method for database
US20060288015A1 (en) * 2005-06-15 2006-12-21 Schirripa Steven R Electronic content classification
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
US20070067289A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Extracting informative phrases from unstructured text
US20070244915A1 (en) * 2006-04-13 2007-10-18 Lg Electronics Inc. System and method for clustering documents
US20070250497A1 (en) * 2006-04-19 2007-10-25 Apple Computer Inc. Semantic reconstruction
US20070294610A1 (en) * 2006-06-02 2007-12-20 Ching Phillip W System and method for identifying similar portions in documents
US20080104502A1 (en) * 2006-10-26 2008-05-01 Yahoo! Inc. System and method for providing a change profile of a web page
US20080162455A1 (en) * 2006-12-27 2008-07-03 Rakshit Daga Determination of document similarity

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122340B2 (en) * 2008-09-29 2012-02-21 Tow Bruce System and method for management of common decentralized applications data and logic
US20100083085A1 (en) * 2008-09-29 2010-04-01 Tow Bruce System and method for management of common decentralized applications data and logic
US20100161621A1 (en) * 2008-12-19 2010-06-24 Johan Christiaan Peters Inferring rules to classify objects in a file management system
US8099419B2 (en) * 2008-12-19 2012-01-17 Sap Ag Inferring rules to classify objects in a file management system
US8972436B2 (en) * 2009-10-28 2015-03-03 Yahoo! Inc. Translation model and method for matching reviews to objects
US20110099192A1 (en) * 2009-10-28 2011-04-28 Yahoo! Inc. Translation Model and Method for Matching Reviews to Objects
US20110251837A1 (en) * 2010-04-07 2011-10-13 eBook Technologies, Inc. Electronic reference integration with an electronic reader
US20130086101A1 (en) * 2011-09-29 2013-04-04 Sap Ag Data Search Using Context Information
US9245006B2 (en) * 2011-09-29 2016-01-26 Sap Se Data search using context information
US20130145371A1 (en) * 2011-12-01 2013-06-06 Sap Ag Batch processing of business objects
US20160371375A1 (en) * 2011-12-12 2016-12-22 The Cleveland Clinic Foundation Storing structured and unstructured clinical information for information retrieval
US10445378B2 (en) * 2011-12-12 2019-10-15 The Cleveland Clinic Foundation Storing structured and unstructured clinical information for information retrieval
US11106815B2 (en) * 2012-07-24 2021-08-31 ID Insight System, method and computer product for fast and secure data searching
US20210350018A1 (en) * 2012-07-24 2021-11-11 ID Insight System, method and computer product for fast and secure data searching
US20140143734A1 (en) * 2012-11-21 2014-05-22 Florian Jann Business object explorer
US10489800B2 (en) 2013-03-12 2019-11-26 Groupon, Inc. Discovery of new business openings using web content analysis
US11756059B2 (en) 2013-03-12 2023-09-12 Groupon, Inc. Discovery of new business openings using web content analysis
US11244328B2 (en) 2013-03-12 2022-02-08 Groupon, Inc. Discovery of new business openings using web content analysis
US9773252B1 (en) * 2013-03-12 2017-09-26 Groupon, Inc. Discovery of new business openings using web content analysis
US11080765B2 (en) 2013-03-14 2021-08-03 Igor Gershteyn Method and system for data structure creation, organization and searching using basic atomic units of information
US9830620B2 (en) * 2013-03-14 2017-11-28 Igor Gershteyn Method and system for data structure creation, organization and searching using basic atomic units of information
US20150046477A1 (en) * 2013-03-14 2015-02-12 Igor Gershteyn Method and system for data structure creation, organization and searching using basic atomic units of information
US9600461B2 (en) * 2013-07-01 2017-03-21 International Business Machines Corporation Discovering relationships in tabular data
US9606978B2 (en) * 2013-07-01 2017-03-28 International Business Machines Corporation Discovering relationships in tabular data
US20150007010A1 (en) * 2013-07-01 2015-01-01 International Business Machines Corporation Discovering Relationships in Tabular Data
US20150007007A1 (en) * 2013-07-01 2015-01-01 International Business Machines Corporation Discovering relationships in tabular data
US10025791B2 (en) 2014-04-02 2018-07-17 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US9354922B2 (en) 2014-04-02 2016-05-31 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US9767101B2 (en) * 2014-06-20 2017-09-19 Google Inc. Media store with a canonical layer for content
US20150370796A1 (en) * 2014-06-20 2015-12-24 Google Inc. Media store with a canonical layer for content
CN105447609A (en) * 2014-08-29 2016-03-30 国际商业机器公司 Method, device and system for processing case management model
US20160063195A1 (en) * 2014-08-29 2016-03-03 International Business Machines Corporation Case management model processing
US10832809B2 (en) * 2014-08-29 2020-11-10 International Business Machines Corporation Case management model processing
US10453109B2 (en) * 2015-07-30 2019-10-22 Sci Limited Evaluation and training for online vehicle request and response messaging
US10915537B2 (en) * 2015-08-27 2021-02-09 International Business Machines Corporation System and a method for associating contextual structured data with unstructured documents on map-reduce
US10885042B2 (en) * 2015-08-27 2021-01-05 International Business Machines Corporation Associating contextual structured data with unstructured documents on map-reduce
US20170060915A1 (en) * 2015-08-27 2017-03-02 International Business Machines Corporation System and a method for associating contextual structured data with unstructured documents on map-reduce
US20170060992A1 (en) * 2015-08-27 2017-03-02 International Business Machines Corporation System and a method for associating contextual structured data with unstructured documents on map-reduce
US10387561B2 (en) 2015-11-29 2019-08-20 Vatbox, Ltd. System and method for obtaining reissues of electronic documents lacking required data
US10509811B2 (en) 2015-11-29 2019-12-17 Vatbox, Ltd. System and method for improved analysis of travel-indicating unstructured electronic documents
US10558880B2 (en) * 2015-11-29 2020-02-11 Vatbox, Ltd. System and method for finding evidencing electronic documents based on unstructured data
US11138372B2 (en) 2015-11-29 2021-10-05 Vatbox, Ltd. System and method for reporting based on electronic documents
US20180101745A1 (en) * 2015-11-29 2018-04-12 Vatbox, Ltd. System and method for finding evidencing electronic documents based on unstructured data
US11227023B2 (en) 2015-12-01 2022-01-18 International Business Machines Corporation Searching people, content and documents from another person's social perspective
US10592567B2 (en) * 2015-12-01 2020-03-17 International Business Machines Corporation Searching people, content and documents from another person's social perspective
US20170154115A1 (en) * 2015-12-01 2017-06-01 International Business Machines Corporation Searching people, content and documents from another person's social perspective
US20170310451A1 (en) * 2016-04-23 2017-10-26 Sugarcrm Inc. Full-duplex real-time cross-module updates of customer relationship management (crm) data in a crm data processing system
WO2018048512A1 (en) * 2016-07-31 2018-03-15 Vatbox, Ltd. Matching transaction electronic documents to evidencing electronic
US10986061B2 (en) * 2017-01-16 2021-04-20 Ercan TURFAN Knowledge-based structured communication system
US20180300296A1 (en) * 2017-04-17 2018-10-18 Microstrategy Incorporated Document similarity analysis
US10489024B2 (en) * 2017-09-12 2019-11-26 Sap Se UI rendering based on adaptive label text infrastructure
US20190079649A1 (en) * 2017-09-12 2019-03-14 Sap Se Ui rendering based on adaptive label text infrastructure
US11709854B2 (en) * 2018-01-02 2023-07-25 Bank Of America Corporation Artificial intelligence based smart data engine
US20190205636A1 (en) * 2018-01-02 2019-07-04 Bank Of America Corporation Artificial Intelligence Based Smart Data Engine
US20210256212A1 (en) * 2018-11-30 2021-08-19 Honeywell International Inc. Classifying devices of a building management system
US11797771B2 (en) * 2018-11-30 2023-10-24 Honeywell International Inc. Classifying devices from entity names based on contextual information
US11188720B2 (en) * 2019-07-18 2021-11-30 International Business Machines Corporation Computing system including virtual agent bot providing semantic topic model-based response
US20210256071A1 (en) * 2020-02-14 2021-08-19 Coupang Corp. Systems and methods for receiving and propagating efficient search updates in real time
US11803593B2 (en) * 2020-02-14 2023-10-31 Coupang Corp. Systems and methods for receiving and propagating efficient search updates in real time
US20230017384A1 (en) * 2021-07-15 2023-01-19 DryvIQ, Inc. Systems and methods for machine learning classification-based automated remediations and handling of data items

Also Published As

Publication number Publication date
US8315997B1 (en) 2012-11-20

Similar Documents

Publication Publication Date Title
US20090063470A1 (en) Document management using business objects
US8977953B1 (en) Customizing information by combining pair of annotations from at least two different documents
US7707023B2 (en) Method of finding answers to questions
US8229883B2 (en) Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases
US7065483B2 (en) Computer method and apparatus for extracting data from web pages
Borgman et al. Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms
US8346795B2 (en) System and method for guiding entity-based searching
Ganti et al. Data cleaning: A practical perspective
US8417702B2 (en) Associating data records in multiple languages
EP1883026A1 (en) Reference resolution for text enrichment and normalization in mining mixed data
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
US20080147578A1 (en) System for prioritizing search results retrieved in response to a computerized search query
US8856119B2 (en) Holistic disambiguation for entity name spotting
US20130325881A1 (en) Supplementing Structured Information About Entities With Information From Unstructured Data Sources
US20070027672A1 (en) Computer method and apparatus for extracting data from web pages
US20050080780A1 (en) System and method for processing a query
US20070198600A1 (en) Entity normalization via name normalization
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
US20080147641A1 (en) Method for prioritizing search results retrieved in response to a computerized search query
US20180181646A1 (en) System and method for determining identity relationships among enterprise data entities
WO2007113546A1 (en) Ranking of entities associated with stored content
CN115292450A (en) Data classification field knowledge base construction method based on information extraction
Weichselbraun et al. Consolidating heterogeneous enterprise data for named entity linking and web intelligence
McNamee et al. HLTCOE approaches to knowledge base population at TAC 2009
Ananthanarayanan et al. Rule based synonyms for entity extraction from noisy text

Legal Events

Date Code Title Description
AS Assignment

Owner name: NOGACOM LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PELED, ARIEL;SAVION, GILAD;REZNIKOV, ELAD;AND OTHERS;REEL/FRAME:021447/0875;SIGNING DATES FROM 20080824 TO 20080825

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION