US20070112748A1 - System and method for using text analytics to identify a set of related documents from a source document - Google Patents

System and method for using text analytics to identify a set of related documents from a source document Download PDF

Info

Publication number
US20070112748A1
US20070112748A1 US11/281,291 US28129105A US2007112748A1 US 20070112748 A1 US20070112748 A1 US 20070112748A1 US 28129105 A US28129105 A US 28129105A US 2007112748 A1 US2007112748 A1 US 2007112748A1
Authority
US
United States
Prior art keywords
document
structured information
related documents
metadata
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/281,291
Other versions
US9495349B2 (en
Inventor
Robert Angell
Stephen Boyer
James Cooper
Richard Hennessy
Tapas Kanungo
Jeffrey Kreulen
David Martin
James Rhodes
W. Spangler
Herschel Weintraub
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/281,291 priority Critical patent/US9495349B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SPANGLER, W. SCOTT, WEINTRAUB, HERSCHEL J.R., MARTIN, DAVID C., BOYER, STEPHEN K., HENNESSY, RICHARD A., COOPER, JAMES W., KANUNGO, TAPAS, KREULEN, JEFFREY T., RHODES, JAMES J., ANGELL, ROBERT L.
Priority to CN200610110127A priority patent/CN100594495C/en
Publication of US20070112748A1 publication Critical patent/US20070112748A1/en
Application granted granted Critical
Publication of US9495349B2 publication Critical patent/US9495349B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the present invention relates generally to using text analytics to identify a set of documents from a source document, and more specifically relates to a system and method for using text analytics on a technical reference such as a patent, along with a MeSH database, to identify a set of related references.
  • the researcher may have an initial document, e.g., a patent, a journal article, a patient record, etc., and would like to find a superset of technical references that are related to the initial document.
  • Various methodologies are known for searching for technical references.
  • a common approach involves word searching, in which key words are entered into a database to identify references that include the key words.
  • Other approaches involve utilizing classification data. For instance, in the case of patents, related patents may be identified based on the classification and sub-classification codes that are designated to each patent.
  • investigators can examine the list of references cited in the initial document.
  • UMLS Universal Medical Language System
  • the UMLS knowledge services can also assist in data creation and indexing publications.
  • a part of the UMLS consists of the Medical Subject Heading (MeSH) Codes which serve as the basis for building ontology's important for the classification of the scientific literature.
  • the NLM has a full time staff who methodically index millions of scientific publications in practically all of the recognized scientific journals. This forms the bases of such national resources such as MedLine (as well as other databases).
  • the NLM indexers classify and index these journals they do it using the MeSH ontology and in so doing create an extremely valuable set of metadata that describes the articles being indexed. For example, the indexers typically read the articles and make a list of all chemicals that are mentioned in the articles (i.e., the chemical file).
  • the indexers use a variety of MeSH qualifier codes to determine if the article being indexed is about chemicals, surgery, genetics, etc.
  • MeSH qualifier codes At the more granular level, they classify the articles via an extensive system of concept codes, which number more than 750,000. This serves as a rich source of metadata for further classifying and indexing other content.
  • the present invention addresses the above-mentioned problems, as well as others, by providing
  • the invention provides a document processing system, comprising: a textual analytics system that analyzes unstructured data contained in a source document and extracts a set of structured information about the source document; and a compare system that identifies a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • the invention provides a computer program product stored on a computer readable medium for processing a content source, comprising: program code configured for analyzing unstructured data contained in the content source and for extracting a set of structured information about the content source; and program code configured for identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • the invention provides a method of processing a source document, comprising: analyzing unstructured data contained in the source document; extracting a set of structured information about the source document; and identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • the invention provides a method for deploying an application for processing a document, comprising: providing a computer infrastructure being operable to: analyze unstructured data contained in the content source and for extracting a set of structured information about the content source; and identify a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • the invention provides computer software embodied in a propagated signal for implementing an application for processing a document, the computer software comprising instructions to cause a computer to perform the following functions: analyze unstructured data contained in the source document; extract a set of structured information about the source document; and identify a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • FIG. 1 depicts a computer system having a document processing system in accordance with an embodiment of the present invention.
  • FIG. 2 depicts search engine for searching annotated documents in accordance with an embodiment of the present invention.
  • FIG. 1 depicts a computer system 10 having a document processing system 18 that analyzes an inputted source document 28 and generates a set of related documents 30 .
  • document processing system 18 may also generate an annotated document 32 that includes metadata 34 used to identify the set of related documents 30 .
  • the annotated document 32 may be stored in an annotated documents database 40 (i.e., with other annotated documents).
  • the set of related documents 30 comprises a list of publications that are somehow related or relevant to the inputted source document 28 .
  • source document 28 may comprise any type of document, but generally comprises “unstructured information.”
  • the generated set of related documents 30 may comprise any documents that can be identified via a metadata database 36 .
  • source document 28 may comprise a biotechnology related patent document that discloses a particular genetic sequence, and the set of related documents 30 comprises a list of biotechnology references (i.e., journal articles, etc.) that discuss the particular genetic sequence.
  • source document 28 may comprise a patient record that discloses a particular condition or disease, and the set of related documents 30 may include scientific articles relevant to the condition or disease.
  • document processing system 18 may input any type of content source that contains unstructured information.
  • Illustrative content sources may include multimedia data such as audio files, video data, images, streaming data, Web pages, etc.
  • document processing system 18 includes a textual analytics system 20 for extracting “structured information,” including key words, such as chemical names, diseases, genes, etc., from the source document 28 ; a compare system 22 for matching the structured information with metadata stored in metadata database 36 to locate the set of related documents 30 ; an aggregation and ranking system 24 for aggregating and ranking the set of related documents 30 and/or associated metadata/structured information; and an annotation system for generating an annotated document 32 that includes metadata 34 .
  • a textual analytics system 20 for extracting “structured information,” including key words, such as chemical names, diseases, genes, etc., from the source document 28 ; a compare system 22 for matching the structured information with metadata stored in metadata database 36 to locate the set of related documents 30 ; an aggregation and ranking system 24 for aggregating and ranking the set of related documents 30 and/or associated metadata/structured information; and an annotation system for generating an annotated document 32 that includes metadata 34 .
  • Textual analytics system 20 provides a system for analyzing unstructured information in order to generate a set of structured information.
  • Textual analytics system 20 may for instance be implemented with the IBMTM Unstructured Information Management Architecture (UIMA).
  • Structured information may be characterized as information whose intended meaning is unambiguous and explicitly represented in the structure or format of the data.
  • the canonical example of structured information is a relational database table.
  • Unstructured information may be characterized as information whose intended meaning is only loosely implied by its form and therefore requires interpretation in order to approximate and extract its intended meaning. Examples include natural language documents, speech, audio, still images, Web pages and video. It is estimated that 80 percent of all corporate information is unstructured.
  • Unstructured Information Management (UIM) applications make use of a variety of technologies including statistical and rule-based natural language processing (NLP), information retrieval, machine learning, ontologies, and automated reasoning.
  • UIM applications may consult structured sources to help resolve the semantics of the unstructured content. For example, a database of chemical names can help in focusing the analysis of medical abstracts.
  • a UIM application generally produces structured information resources that unambiguously represent content derived from unstructured information input. These structured resources can then be made accessible through a set of application-appropriate access methods.
  • a simple example is a search index and query processor that makes documents quickly accessible by topic and ranks them according to their relevance to key concepts specified by the user.
  • a more complex example is a formal ontology and inference system that, for example, allows the user to explore the concepts, their relationships, and the logical implications contained in a collection consisting of millions of documents.
  • Textual analytics system 20 may be implemented to identify structured information about a particular technology field (e.g., life sciences) including key words, such as chemical names, diseases, genes, molecules, etc., from the source document 28 .
  • Other information such as a list of chemical abstract (CAS) numbers and a list of SMILES (“simplified molecular input line entry specification,” which is a specification for unambiguously describing the structure of chemical molecules using short ASCII alpha-numeric strings) may also be derived by textual analytics system 20 from the source document 28 .
  • CAS chemical abstract
  • SMILES Simple molecular input line entry specification
  • Compare system 22 compares the results of textual analytics system 20 with information in metadata database 36 to identify a set of related documents 30 .
  • Metadata database 36 comprises metadata indexed from a comprehensive set of technology references, i.e., publications, such as scientific journal articles.
  • metadata database 36 comprises a database of MedLine abstracts, which include metadata comprised of MeSH codes, codes, chemical lists, CAS numbers, a SMILES data, etc., for associated publications.
  • Compare system 22 thus identifies publications whose associated metadata matches the structured information obtained by textual analysis system 20 . Each such match may result in the identification of a technology reference that can be added to the set of related documents 30 .
  • Aggregation and ranking system 24 may be implemented to aggregate results and rank documents within the set of related documents 30 .
  • Annotation system 26 can be utilized to annotate the source document 28 with metadata 34 derived from both the metadata database 36 and from the textual analytics system 20 .
  • the metadata 34 in annotated document 32 may likewise be processed/ranked by aggregation and ranking system 24 .
  • an annotated patent could be generated with, e.g., MedLine metadata that includes MeSH data, indexed data associated with technical references containing chemicals in common with the source patent, etc.
  • the metadata database 36 could be loaded as a separate star schema that is part of a larger data warehouse that also contains the annotated documents database 40 .
  • the aggregation and ranking system 24 could be implemented in any manner. For instance, if multiple references within the set of related documents 30 include the same piece of metadata, those instances of the metadata could be aggregated into a single listing with an increased rank of importance. Moreover, aggregation and ranking system 24 could identify “categories” of references and/or metadata that are deemed more important than others. Furthermore, aggregation and ranking system 24 could filter references and/or metadata to exclude certain references or metadata from the results.
  • annotation system 26 may be implemented in any fashion.
  • the metadata 34 may be stored in additional fields of a document database.
  • Metadata any type of metadata could be used within the context of the present invention to identify a set of related documents 30 and annotate a source document 28 .
  • Illustrative types of metadata include MedLine qualifier codes, chemicals, molecular structures, MeSH codes, concept codes, classifications, ontologies, etc.
  • Non-biotechnology related patents, such as software, mechanical, electrical, etc. could likewise be annotated in a similar fashion with domain specific metadata based on, e.g., existing or developed metadata ontologies and classifications.
  • FIG. 2 depicts a data mining system 42 for exploiting the annotated documents database 40 of FIG. 1 .
  • Data mining system 42 includes a search system 44 and metadata classification system 46 that allows a user to enter a metadata query 48 to generate a set of search results 50 .
  • the computer system 10 of FIG. 1 may comprise, e.g., a desktop, a laptop, a workstation, etc.
  • computer system 10 could be implemented as part of a client and/or a server.
  • Computer system 10 generally includes a processor 12 , input/output (I/O) 14 , memory 16 , and bus 17 .
  • the processor 12 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server.
  • Memory 16 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc.
  • memory 16 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
  • I/O 14 may comprise any system for exchanging information to/from an external resource.
  • External devices/resources may comprise any known type of external device, including a monitor/display, speakers, storage, another computer system, a hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc.
  • Bus 17 provides a communication link between each of the components in the computer system 10 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc.
  • additional components such as cache memory, communication systems, system software, etc., may be incorporated into computer system 10 .
  • Access to computer system 10 may be provided over a network 36 such as the Internet, a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), etc.
  • Communication could occur via a direct hardwired connection (e.g., serial port), or via an addressable connection that may utilize any combination of wireline and/or wireless transmission methods.
  • conventional network connectivity such as Token Ring, Ethernet, WiFi or other conventional communications standards could be used.
  • connectivity could be provided by conventional TCP/IP sockets-based protocol.
  • an Internet service provider could be used to establish interconnectivity.
  • communication could occur in a client-server or server-server environment.
  • a computer system 10 comprising document processing system could be created, maintained and/or deployed by a service provider that offers the functions described herein for customers. That is, a service provider could offer to provide identifying sets of related documents, a process for annotated documents, and/or a annotated documents database 40 as described above.
  • systems, functions, mechanisms, methods, engines and modules described herein can be implemented in hardware, software, or a combination of hardware and software. They may be implemented by any type of computer system or other apparatus adapted for carrying out the methods described herein.
  • a typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein.
  • a specific use computer containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized.
  • part of all of the invention could be implemented in a distributed manner, e.g., over a network such as the Internet.
  • the present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods and functions described herein, and which—when loaded in a computer system—is able to carry out these methods and functions.
  • Terms such as computer program, software program, program, program product, software, etc., in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.

Abstract

A system and method for processing a document to generate a set of related documents. A system is provided that includes a textual analytics system that analyzes unstructured data contained in a source document and extracts a set of structured information about the source document; and a compare system that identifies a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.

Description

    BACKGROUND OF THE INVENTION
  • 1. Technical Field
  • The present invention relates generally to using text analytics to identify a set of documents from a source document, and more specifically relates to a system and method for using text analytics on a technical reference such as a patent, along with a MeSH database, to identify a set of related references.
  • 2. Related Art
  • Recent years have seen an explosive growth in the field of biotechnology, where discoveries can be worth hundreds of millions of dollars for the entities that own the rights to the discoveries. An ongoing challenge however is the tremendous cost of the research and development that is typically required. Given the dollar figures that are involved, companies must have a full understanding of the technology landscape for a particular biotechnology field.
  • Much of the technology landscape for a particular field can be gleaned from technical references, such as patent references and other scientific articles. From such references, one can determine the current state of the art, what technology is proprietary, what technology is public domain, etc. One of the challenges however involves quickly and efficiently locating relevant references that relate to a technological endeavor.
  • In many cases, the researcher may have an initial document, e.g., a patent, a journal article, a patient record, etc., and would like to find a superset of technical references that are related to the initial document. Various methodologies are known for searching for technical references. A common approach involves word searching, in which key words are entered into a database to identify references that include the key words. Other approaches involve utilizing classification data. For instance, in the case of patents, related patents may be identified based on the classification and sub-classification codes that are designated to each patent. In even a further approach, investigators can examine the list of references cited in the initial document.
  • While each of these techniques is useful, each is limited for obvious reasons. Word searching is limited since different writers often refer to similar concepts using any number of different terms, which generates many useless results. Furthermore, in the case of patents, the number of patents that share the same classification/sub-classification codes can be very large in number, and not always include the relevant features that are being searched. Conversely, the number of cited references listed on a technical document is typically a relatively short list that can only point to preexisting references, which may provide a good starting point, but is almost certainly not comprehensive in nature.
  • Accordingly, there are currently significant limitations involved in searching and analyzing technical references when trying to understand the technology landscape of a particular field of study.
  • Fortunately, non-patent literature in the biotechnology field is somewhat more user-friendly. The US National Library of Medicine (NLM) has over the years developed a scientific system called the Universal Medical Language System (UMLS) for the international harmonization of medical information and for the purpose of improving access to medical and scientific literature. The UMLS (http://umls.nlm.nih.gov/) objective is to help researchers intelligently retrieve and integrate information from a wide range of disparate electronic biomedical information sources. It can be used to overcome variations in the way similar concepts are expressed in different sources. This makes it easier for users to link information from patient record systems, bibliographic databases, factual databases, expert systems, etc.
  • The UMLS knowledge services can also assist in data creation and indexing publications. A part of the UMLS consists of the Medical Subject Heading (MeSH) Codes which serve as the basis for building ontology's important for the classification of the scientific literature. To this end, the NLM has a full time staff who methodically index millions of scientific publications in practically all of the recognized scientific journals. This forms the bases of such national resources such as MedLine (as well as other databases). When the NLM indexers classify and index these journals they do it using the MeSH ontology and in so doing create an extremely valuable set of metadata that describes the articles being indexed. For example, the indexers typically read the articles and make a list of all chemicals that are mentioned in the articles (i.e., the chemical file).
  • At the highest level, the indexers use a variety of MeSH qualifier codes to determine if the article being indexed is about chemicals, surgery, genetics, etc. At the more granular level, they classify the articles via an extensive system of concept codes, which number more than 750,000. This serves as a rich source of metadata for further classifying and indexing other content.
  • Unfortunately, there is no automated mechanism that allows a user to find related technical references for an inputted document (e.g., patent document, newspaper article, patient record, etc.) that is not indexed by the NLM or other similar metadata database. Accordingly, a need exists for a system that can identify a superset of technical references for an inputted reference.
  • SUMMARY OF THE INVENTION
  • The present invention addresses the above-mentioned problems, as well as others, by providing
  • In a first aspect, the invention provides a document processing system, comprising: a textual analytics system that analyzes unstructured data contained in a source document and extracts a set of structured information about the source document; and a compare system that identifies a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • In a second aspect, the invention provides a computer program product stored on a computer readable medium for processing a content source, comprising: program code configured for analyzing unstructured data contained in the content source and for extracting a set of structured information about the content source; and program code configured for identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • In a third aspect, the invention provides a method of processing a source document, comprising: analyzing unstructured data contained in the source document; extracting a set of structured information about the source document; and identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • In a fourth aspect, the invention provides a method for deploying an application for processing a document, comprising: providing a computer infrastructure being operable to: analyze unstructured data contained in the content source and for extracting a set of structured information about the content source; and identify a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • In a fifth aspect, the invention provides computer software embodied in a propagated signal for implementing an application for processing a document, the computer software comprising instructions to cause a computer to perform the following functions: analyze unstructured data contained in the source document; extract a set of structured information about the source document; and identify a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
  • FIG. 1 depicts a computer system having a document processing system in accordance with an embodiment of the present invention.
  • FIG. 2 depicts search engine for searching annotated documents in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • Referring now to the drawings, FIG. 1 depicts a computer system 10 having a document processing system 18 that analyzes an inputted source document 28 and generates a set of related documents 30. In addition, document processing system 18 may also generate an annotated document 32 that includes metadata 34 used to identify the set of related documents 30. The annotated document 32 may be stored in an annotated documents database 40 (i.e., with other annotated documents). The set of related documents 30 comprises a list of publications that are somehow related or relevant to the inputted source document 28.
  • It is understood that source document 28 may comprise any type of document, but generally comprises “unstructured information.” The generated set of related documents 30 may comprise any documents that can be identified via a metadata database 36. For example, in one illustrative embodiment, source document 28 may comprise a biotechnology related patent document that discloses a particular genetic sequence, and the set of related documents 30 comprises a list of biotechnology references (i.e., journal articles, etc.) that discuss the particular genetic sequence. In another embodiment, source document 28 may comprise a patient record that discloses a particular condition or disease, and the set of related documents 30 may include scientific articles relevant to the condition or disease.
  • In still a further embodiment, rather than inputting a source document 28, document processing system 18 may input any type of content source that contains unstructured information. Illustrative content sources may include multimedia data such as audio files, video data, images, streaming data, Web pages, etc.
  • To generate the related set of documents 30, document processing system 18 includes a textual analytics system 20 for extracting “structured information,” including key words, such as chemical names, diseases, genes, etc., from the source document 28; a compare system 22 for matching the structured information with metadata stored in metadata database 36 to locate the set of related documents 30; an aggregation and ranking system 24 for aggregating and ranking the set of related documents 30 and/or associated metadata/structured information; and an annotation system for generating an annotated document 32 that includes metadata 34.
  • Textual analytics system 20 provides a system for analyzing unstructured information in order to generate a set of structured information. Textual analytics system 20 may for instance be implemented with the IBM™ Unstructured Information Management Architecture (UIMA). Structured information may be characterized as information whose intended meaning is unambiguous and explicitly represented in the structure or format of the data. The canonical example of structured information is a relational database table. Unstructured information may be characterized as information whose intended meaning is only loosely implied by its form and therefore requires interpretation in order to approximate and extract its intended meaning. Examples include natural language documents, speech, audio, still images, Web pages and video. It is estimated that 80 percent of all corporate information is unstructured.
  • In analyzing unstructured content, Unstructured Information Management (UIM) applications make use of a variety of technologies including statistical and rule-based natural language processing (NLP), information retrieval, machine learning, ontologies, and automated reasoning. UIM applications may consult structured sources to help resolve the semantics of the unstructured content. For example, a database of chemical names can help in focusing the analysis of medical abstracts. A UIM application generally produces structured information resources that unambiguously represent content derived from unstructured information input. These structured resources can then be made accessible through a set of application-appropriate access methods. A simple example is a search index and query processor that makes documents quickly accessible by topic and ranks them according to their relevance to key concepts specified by the user. A more complex example is a formal ontology and inference system that, for example, allows the user to explore the concepts, their relationships, and the logical implications contained in a collection consisting of millions of documents.
  • Textual analytics system 20 may be implemented to identify structured information about a particular technology field (e.g., life sciences) including key words, such as chemical names, diseases, genes, molecules, etc., from the source document 28. Other information, such as a list of chemical abstract (CAS) numbers and a list of SMILES (“simplified molecular input line entry specification,” which is a specification for unambiguously describing the structure of chemical molecules using short ASCII alpha-numeric strings) may also be derived by textual analytics system 20 from the source document 28.
  • Compare system 22 compares the results of textual analytics system 20 with information in metadata database 36 to identify a set of related documents 30. Metadata database 36 comprises metadata indexed from a comprehensive set of technology references, i.e., publications, such as scientific journal articles. In one illustrative embodiment, metadata database 36 comprises a database of MedLine abstracts, which include metadata comprised of MeSH codes, codes, chemical lists, CAS numbers, a SMILES data, etc., for associated publications. Compare system 22 thus identifies publications whose associated metadata matches the structured information obtained by textual analysis system 20. Each such match may result in the identification of a technology reference that can be added to the set of related documents 30. Aggregation and ranking system 24 may be implemented to aggregate results and rank documents within the set of related documents 30.
  • Annotation system 26 can be utilized to annotate the source document 28 with metadata 34 derived from both the metadata database 36 and from the textual analytics system 20. The metadata 34 in annotated document 32 may likewise be processed/ranked by aggregation and ranking system 24. In an example where source document 28 comprises a patent, an annotated patent could be generated with, e.g., MedLine metadata that includes MeSH data, indexed data associated with technical references containing chemicals in common with the source patent, etc.
  • In an illustrative embodiment, the metadata database 36 could be loaded as a separate star schema that is part of a larger data warehouse that also contains the annotated documents database 40.
  • The aggregation and ranking system 24 could be implemented in any manner. For instance, if multiple references within the set of related documents 30 include the same piece of metadata, those instances of the metadata could be aggregated into a single listing with an increased rank of importance. Moreover, aggregation and ranking system 24 could identify “categories” of references and/or metadata that are deemed more important than others. Furthermore, aggregation and ranking system 24 could filter references and/or metadata to exclude certain references or metadata from the results.
  • Likewise, annotation system 26 may be implemented in any fashion. For instance, the metadata 34 may be stored in additional fields of a document database.
  • It should be understood that any type of metadata could be used within the context of the present invention to identify a set of related documents 30 and annotate a source document 28. Illustrative types of metadata include MedLine qualifier codes, chemicals, molecular structures, MeSH codes, concept codes, classifications, ontologies, etc. Non-biotechnology related patents, such as software, mechanical, electrical, etc., could likewise be annotated in a similar fashion with domain specific metadata based on, e.g., existing or developed metadata ontologies and classifications.
  • FIG. 2 depicts a data mining system 42 for exploiting the annotated documents database 40 of FIG. 1. Data mining system 42 includes a search system 44 and metadata classification system 46 that allows a user to enter a metadata query 48 to generate a set of search results 50.
  • In general, the computer system 10 of FIG. 1 (as well as the data mining system 42 of FIG. 2) may comprise, e.g., a desktop, a laptop, a workstation, etc. Moreover, computer system 10 could be implemented as part of a client and/or a server. Computer system 10 generally includes a processor 12, input/output (I/O) 14, memory 16, and bus 17. The processor 12 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server. Memory 16 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Moreover, memory 16 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
  • I/O 14 may comprise any system for exchanging information to/from an external resource. External devices/resources may comprise any known type of external device, including a monitor/display, speakers, storage, another computer system, a hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc. Bus 17 provides a communication link between each of the components in the computer system 10 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc. Although not shown, additional components, such as cache memory, communication systems, system software, etc., may be incorporated into computer system 10.
  • Access to computer system 10 may be provided over a network 36 such as the Internet, a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), etc. Communication could occur via a direct hardwired connection (e.g., serial port), or via an addressable connection that may utilize any combination of wireline and/or wireless transmission methods. Moreover, conventional network connectivity, such as Token Ring, Ethernet, WiFi or other conventional communications standards could be used. Still yet, connectivity could be provided by conventional TCP/IP sockets-based protocol. In this instance, an Internet service provider could be used to establish interconnectivity. Further, as indicated above, communication could occur in a client-server or server-server environment.
  • It should be appreciated that the teachings of the present invention could be offered as a business method on a subscription or fee basis. For example, a computer system 10 comprising document processing system could be created, maintained and/or deployed by a service provider that offers the functions described herein for customers. That is, a service provider could offer to provide identifying sets of related documents, a process for annotated documents, and/or a annotated documents database 40 as described above.
  • It is understood that the systems, functions, mechanisms, methods, engines and modules described herein can be implemented in hardware, software, or a combination of hardware and software. They may be implemented by any type of computer system or other apparatus adapted for carrying out the methods described herein. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized. In a further embodiment, part of all of the invention could be implemented in a distributed manner, e.g., over a network such as the Internet.
  • The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods and functions described herein, and which—when loaded in a computer system—is able to carry out these methods and functions. Terms such as computer program, software program, program, program product, software, etc., in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
  • The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.

Claims (27)

1. A document processing system, comprising:
a textual analytics system that analyzes unstructured data contained in a source document and extracts a set of structured information about the source document; and
a compare system that identifies a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
2. The document processing system of claim 1, wherein the set of structured information comprises key words associated with a technology field.
3. The document processing system of claim 1, wherein the set of structured information comprises a list of chemical abstract numbers.
4. The document processing system of claim 1, wherein the set of structured information comprises a list of SMILES (simplified molecular input line entry specification) strings.
5. The document processing system of claim 1, wherein the source document comprises a patent document and the set of related documents comprise technical references.
6. The document processing system of claim 1, wherein the source document comprises a medical record and the set of related documents comprise technical references.
7. The document processing system of claim 1, further comprising an annotation system for annotating the source document with metadata associated with the set of related documents.
8. The document processing system of claim 7, further comprising:
a database of annotated documents; and
a data mining system for mining the database of annotated documents.
9. The document processing system of claim 1, wherein the metadata is contained in a database of MedLine abstracts.
10. The document processing system of claim 1, further comprising an aggregation and ranking system for prioritizing the set of related documents.
11. A computer program product stored on a computer readable medium for processing a content source, comprising:
program code configured for analyzing unstructured data contained in the content source and for extracting a set of structured information about the content source; and
program code configured for identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
12. The computer program product of claim 11, wherein the set of structured information comprises key words associated with a technology field.
13. The computer program product of claim 11, wherein the set of structured information comprises a list of chemical abstract numbers.
14. The computer program product of claim 11, wherein the set of structured information comprises a list of SMILES (simplified molecular input line entry specification) strings.
15. The computer program product of claim 11, wherein the content source comprises a patent document and the set of related documents comprise technical references.
16. The computer program product of claim 11, wherein the content source is selected from the group consisting of: a medical record, a Web page, a multimedia input, a technical reference, and a publication.
17. The computer program product of claim 11, further comprising program code configured for annotating the content source with metadata associated with the set of related documents.
18. The computer program product of claim 17, further comprising:
program code configured for storing an annotated content source in a database of annotated documents; and
program code configured for data mining the database of annotated content sources.
19. The computer program product of claim 11, wherein the metadata is contained in a database of MedLine abstracts.
20. The computer program product of claim 11, further comprising program code configured for prioritizing the set of related documents.
21. A method of processing a source document, comprising:
analyzing unstructured data contained in the source document;
extracting a set of structured information about the source document; and
identifying a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
22. The method of claim 21, wherein the set of structured information comprises information selected from the group consisting of: key words associated with a technology field, a list of chemical abstract numbers, and a list of SMILES (simplified molecular input line entry specification) strings.
23. The method of claim 21, wherein the source document comprises a document selected from the group consisting of: a patent document, a Web page, a medical record, a technical reference, and a publication.
24. The method of claim 21, further comprising the step of annotating the source document with metadata associated with the set of related documents.
25. The method of claim 21, wherein the metadata is contained in a database of MedLine abstracts.
26. The method of claim 21, further comprising the step of prioritizing the set of related documents.
27. A method for deploying an application for processing a document, comprising:
providing a computer infrastructure being operable to:
analyze unstructured data contained in the content source and for extracting a set of structured information about the content source; and
identify a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
US11/281,291 2005-11-17 2005-11-17 System and method for using text analytics to identify a set of related documents from a source document Active 2030-03-02 US9495349B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/281,291 US9495349B2 (en) 2005-11-17 2005-11-17 System and method for using text analytics to identify a set of related documents from a source document
CN200610110127A CN100594495C (en) 2005-11-17 2006-07-31 System and method for using text analytics to identify a set of related documents from a source document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/281,291 US9495349B2 (en) 2005-11-17 2005-11-17 System and method for using text analytics to identify a set of related documents from a source document

Publications (2)

Publication Number Publication Date
US20070112748A1 true US20070112748A1 (en) 2007-05-17
US9495349B2 US9495349B2 (en) 2016-11-15

Family

ID=38042110

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/281,291 Active 2030-03-02 US9495349B2 (en) 2005-11-17 2005-11-17 System and method for using text analytics to identify a set of related documents from a source document

Country Status (2)

Country Link
US (1) US9495349B2 (en)
CN (1) CN100594495C (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112833A1 (en) * 2005-11-17 2007-05-17 International Business Machines Corporation System and method for annotating patents with MeSH data
US20080114724A1 (en) * 2006-11-13 2008-05-15 Exegy Incorporated Method and System for High Performance Integration, Processing and Searching of Structured and Unstructured Data Using Coprocessors
US20080288484A1 (en) * 2005-09-15 2008-11-20 Motorola, Inc. Distributed User Profile
US7526554B1 (en) 2008-06-12 2009-04-28 International Business Machines Corporation Systems and methods for reaching resource neighborhoods
US20090313255A1 (en) * 2008-06-12 2009-12-17 International Business Machines Corporation Systems and methods for reaching resource neighborhoods
US20100064012A1 (en) * 2008-09-08 2010-03-11 International Business Machines Corporation Method, system and apparatus to automatically add senders of email to a contact list
US20110055811A1 (en) * 2009-09-02 2011-03-03 International Business Machines Corporation Discovery, Analysis, and Visualization of Dependencies
US20120150852A1 (en) * 2010-12-10 2012-06-14 Paul Sheedy Text analysis to identify relevant entities
US8326819B2 (en) 2006-11-13 2012-12-04 Exegy Incorporated Method and system for high performance data metatagging and data indexing using coprocessors
US8374986B2 (en) 2008-05-15 2013-02-12 Exegy Incorporated Method and system for accelerated stream processing
US8495064B2 (en) 2011-09-08 2013-07-23 Microsoft Corporation Management of metadata for life cycle assessment data
WO2014047051A1 (en) * 2012-09-21 2014-03-27 Atigeo Llc Methods and automated systems that assign medical codes to electronic medical records
WO2014093935A1 (en) * 2012-12-16 2014-06-19 Cloud 9 Llc Vital text analytics system for the enhancement of requirements engineering documents and other documents
CN103970792A (en) * 2013-02-04 2014-08-06 中国银联股份有限公司 Index-based file comparison method and device
US20150286697A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Analyzing a query and provisioning data to analytics
US9298453B2 (en) 2012-07-03 2016-03-29 Microsoft Technology Licensing, Llc Source code analytics platform using program analysis and information retrieval
US9633093B2 (en) 2012-10-23 2017-04-25 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US9633097B2 (en) 2012-10-23 2017-04-25 Ip Reservoir, Llc Method and apparatus for record pivoting to accelerate processing of data fields
US9760592B2 (en) 2014-02-20 2017-09-12 International Business Machines Corporation Metrics management and monitoring system for service transition and delivery management
US20180150651A1 (en) * 2010-12-22 2018-05-31 Koninklijke Philips N.V. Creating an access control policy based on consumer privacy preferences
US10146845B2 (en) 2012-10-23 2018-12-04 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US10891419B2 (en) 2017-10-27 2021-01-12 International Business Machines Corporation Displaying electronic text-based messages according to their typographic features
US10902013B2 (en) 2014-04-23 2021-01-26 Ip Reservoir, Llc Method and apparatus for accelerated record layout detection
US10942943B2 (en) 2015-10-29 2021-03-09 Ip Reservoir, Llc Dynamic field data translation to support high performance stream data processing
US10949607B2 (en) 2018-12-10 2021-03-16 International Business Machines Corporation Automated document filtration with normalized annotation for document searching and access
US10977292B2 (en) 2019-01-15 2021-04-13 International Business Machines Corporation Processing documents in content repositories to generate personalized treatment guidelines
US11061913B2 (en) 2018-11-30 2021-07-13 International Business Machines Corporation Automated document filtration and priority scoring for document searching and access
US11068490B2 (en) 2019-01-04 2021-07-20 International Business Machines Corporation Automated document filtration with machine learning of annotations for document searching and access
US11074262B2 (en) 2018-11-30 2021-07-27 International Business Machines Corporation Automated document filtration and prioritization for document searching and access
US11721441B2 (en) 2019-01-15 2023-08-08 Merative Us L.P. Determining drug effectiveness ranking for a patient using machine learning

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8201085B2 (en) * 2007-06-21 2012-06-12 Thomson Reuters Global Resources Method and system for validating references
CN100461183C (en) * 2007-07-10 2009-02-11 北京大学 Metadata automatic extraction method based on multiple rule in network search
CN101599011B (en) * 2008-06-05 2016-11-16 天津书生投资有限公司 DPS and method
CN102955773B (en) 2011-08-31 2015-12-02 国际商业机器公司 For identifying the method and system of chemical name in Chinese document
US9792276B2 (en) * 2013-12-13 2017-10-17 International Business Machines Corporation Content availability for natural language processing tasks
CN105022733B (en) * 2014-04-18 2018-03-23 中科鼎富(北京)科技发展有限公司 DINFO OEC text analyzings method for digging and equipment
US10318622B2 (en) 2016-03-30 2019-06-11 International Business Machines Corporation Weighted annotation evaluation

Citations (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4642762A (en) * 1984-05-25 1987-02-10 American Chemical Society Storage and retrieval of generic chemical structure representations
US5794236A (en) * 1996-05-29 1998-08-11 Lexis-Nexis Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy
US5950192A (en) * 1994-08-10 1999-09-07 Oxford Molecular Group, Inc. Relational database mangement system for chemical structure storage, searching and retrieval
US6038560A (en) * 1997-05-21 2000-03-14 Oracle Corporation Concept knowledge base search and retrieval system
US6038574A (en) * 1998-03-18 2000-03-14 Xerox Corporation Method and apparatus for clustering a collection of linked documents using co-citation analysis
US6098034A (en) * 1996-03-18 2000-08-01 Expert Ease Development, Ltd. Method for standardizing phrasing in a document
US6286018B1 (en) * 1998-03-18 2001-09-04 Xerox Corporation Method and apparatus for finding a set of documents relevant to a focus set using citation analysis and spreading activation techniques
US6289342B1 (en) * 1998-01-05 2001-09-11 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
US6389436B1 (en) * 1997-12-15 2002-05-14 International Business Machines Corporation Enhanced hypertext categorization using hyperlinks
US20020062302A1 (en) * 2000-08-09 2002-05-23 Oosta Gary Martin Methods for document indexing and analysis
US20020169755A1 (en) * 2001-05-09 2002-11-14 Framroze Bomi Patel System and method for the storage, searching, and retrieval of chemical names in a relational database
US6604114B1 (en) * 1998-12-04 2003-08-05 Technology Enabling Company, Llc Systems and methods for organizing data
US6732090B2 (en) * 2001-08-13 2004-05-04 Xerox Corporation Meta-document management system with user definable personalities
US20040088332A1 (en) * 2001-08-28 2004-05-06 Knowledge Management Objects, Llc Computer assisted and/or implemented process and system for annotating and/or linking documents and data, optionally in an intellectual property management system
US20040117405A1 (en) * 2002-08-26 2004-06-17 Gordon Short Relating media to information in a workflow system
US20040172378A1 (en) * 2002-11-15 2004-09-02 Shanahan James G. Method and apparatus for document filtering using ensemble filters
US20040205448A1 (en) * 2001-08-13 2004-10-14 Grefenstette Gregory T. Meta-document management system with document identifiers
US6823301B1 (en) * 1997-03-04 2004-11-23 Hiroshi Ishikura Language analysis using a reading point
US20050060305A1 (en) * 2003-09-16 2005-03-17 Pfizer Inc. System and method for the computer-assisted identification of drugs and indications
US6879990B1 (en) * 2000-04-28 2005-04-12 Institute For Scientific Information, Inc. System for identifying potential licensees of a source patent portfolio
US20050108001A1 (en) * 2001-11-15 2005-05-19 Aarskog Brit H. Method and apparatus for textual exploration discovery
US20050131025A1 (en) * 2003-05-19 2005-06-16 Matier William L. Amelioration of cataracts, macular degeneration and other ophthalmic diseases
US20050160107A1 (en) * 2003-12-29 2005-07-21 Ping Liang Advanced search, file system, and intelligent assistant agent
US20050234952A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Content propagation for enhanced document retrieval
US20050246316A1 (en) * 2004-04-30 2005-11-03 Lawson Alexander J Method and software for extracting chemical data
US6963830B1 (en) * 1999-07-19 2005-11-08 Fujitsu Limited Apparatus and method for generating a summary according to hierarchical structure of topic
US7003517B1 (en) * 2000-05-24 2006-02-21 Inetprofit, Inc. Web-based system and method for archiving and searching participant-based internet text sources for customer lead data
US20060095298A1 (en) * 2004-10-29 2006-05-04 Bina Robert B Method for horizontal integration and research of information of medical records utilizing HIPPA compliant internet protocols, workflow management and static/dynamic processing of information
US7054754B1 (en) * 1999-02-12 2006-05-30 Cambridgesoft Corporation Method, system, and software for deriving chemical structural information
US7065514B2 (en) * 1999-05-05 2006-06-20 West Publishing Company Document-classification system, method and software
US7197697B1 (en) * 1999-06-15 2007-03-27 Fujitsu Limited Apparatus for retrieving information using reference reason of document
US20070112833A1 (en) * 2005-11-17 2007-05-17 International Business Machines Corporation System and method for annotating patents with MeSH data
US20070208719A1 (en) * 2004-03-18 2007-09-06 Bao Tran Systems and methods for analyzing semantic documents over a network

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8095581B2 (en) 1999-02-05 2012-01-10 Gregory A Stobbs Computer-implemented patent portfolio analysis method and apparatus
US6385611B1 (en) 1999-05-07 2002-05-07 Carlos Cardona System and method for database retrieval, indexing and statistical analysis
JP4217033B2 (en) 2001-07-11 2009-01-28 ローム アンド ハース カンパニー Data processing system
KR100436356B1 (en) 2001-08-01 2004-06-18 (주) 위즈도메인 A method for analyzing and providing inter-citation relationship between patents related to a subject patent
EP1547009A1 (en) 2002-09-20 2005-06-29 Board Of Regents The University Of Texas System Computer program products, systems and methods for information discovery and relational analyses
TW200407736A (en) 2002-11-08 2004-05-16 Hon Hai Prec Ind Co Ltd System and method for classifying patents and displaying patent classification
US8694504B2 (en) 2003-03-05 2014-04-08 Spore, Inc. Methods and systems for technology analysis and mapping
US20040186833A1 (en) 2003-03-19 2004-09-23 The United States Of America As Represented By The Secretary Of The Army Requirements -based knowledge discovery for technology management
US7146361B2 (en) * 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
TWI273446B (en) 2003-09-30 2007-02-11 Hon Hai Prec Ind Co Ltd System and method for classifying patents and displaying patent classification
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search result clustering method

Patent Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4642762A (en) * 1984-05-25 1987-02-10 American Chemical Society Storage and retrieval of generic chemical structure representations
US5950192A (en) * 1994-08-10 1999-09-07 Oxford Molecular Group, Inc. Relational database mangement system for chemical structure storage, searching and retrieval
US6304869B1 (en) * 1994-08-10 2001-10-16 Oxford Molecular Group, Inc. Relational database management system for chemical structure storage, searching and retrieval
US6098034A (en) * 1996-03-18 2000-08-01 Expert Ease Development, Ltd. Method for standardizing phrasing in a document
US5794236A (en) * 1996-05-29 1998-08-11 Lexis-Nexis Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy
US6823301B1 (en) * 1997-03-04 2004-11-23 Hiroshi Ishikura Language analysis using a reading point
US6038560A (en) * 1997-05-21 2000-03-14 Oracle Corporation Concept knowledge base search and retrieval system
US6389436B1 (en) * 1997-12-15 2002-05-14 International Business Machines Corporation Enhanced hypertext categorization using hyperlinks
US6289342B1 (en) * 1998-01-05 2001-09-11 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
US6286018B1 (en) * 1998-03-18 2001-09-04 Xerox Corporation Method and apparatus for finding a set of documents relevant to a focus set using citation analysis and spreading activation techniques
US6038574A (en) * 1998-03-18 2000-03-14 Xerox Corporation Method and apparatus for clustering a collection of linked documents using co-citation analysis
US6604114B1 (en) * 1998-12-04 2003-08-05 Technology Enabling Company, Llc Systems and methods for organizing data
US7054754B1 (en) * 1999-02-12 2006-05-30 Cambridgesoft Corporation Method, system, and software for deriving chemical structural information
US7065514B2 (en) * 1999-05-05 2006-06-20 West Publishing Company Document-classification system, method and software
US7197697B1 (en) * 1999-06-15 2007-03-27 Fujitsu Limited Apparatus for retrieving information using reference reason of document
US6963830B1 (en) * 1999-07-19 2005-11-08 Fujitsu Limited Apparatus and method for generating a summary according to hierarchical structure of topic
US6879990B1 (en) * 2000-04-28 2005-04-12 Institute For Scientific Information, Inc. System for identifying potential licensees of a source patent portfolio
US7003517B1 (en) * 2000-05-24 2006-02-21 Inetprofit, Inc. Web-based system and method for archiving and searching participant-based internet text sources for customer lead data
US20020062302A1 (en) * 2000-08-09 2002-05-23 Oosta Gary Martin Methods for document indexing and analysis
US20020169755A1 (en) * 2001-05-09 2002-11-14 Framroze Bomi Patel System and method for the storage, searching, and retrieval of chemical names in a relational database
US20040205448A1 (en) * 2001-08-13 2004-10-14 Grefenstette Gregory T. Meta-document management system with document identifiers
US6732090B2 (en) * 2001-08-13 2004-05-04 Xerox Corporation Meta-document management system with user definable personalities
US20040088332A1 (en) * 2001-08-28 2004-05-06 Knowledge Management Objects, Llc Computer assisted and/or implemented process and system for annotating and/or linking documents and data, optionally in an intellectual property management system
US20050108001A1 (en) * 2001-11-15 2005-05-19 Aarskog Brit H. Method and apparatus for textual exploration discovery
US20040117405A1 (en) * 2002-08-26 2004-06-17 Gordon Short Relating media to information in a workflow system
US20040172378A1 (en) * 2002-11-15 2004-09-02 Shanahan James G. Method and apparatus for document filtering using ensemble filters
US20050131025A1 (en) * 2003-05-19 2005-06-16 Matier William L. Amelioration of cataracts, macular degeneration and other ophthalmic diseases
US20050060305A1 (en) * 2003-09-16 2005-03-17 Pfizer Inc. System and method for the computer-assisted identification of drugs and indications
US20050160107A1 (en) * 2003-12-29 2005-07-21 Ping Liang Advanced search, file system, and intelligent assistant agent
US20070208719A1 (en) * 2004-03-18 2007-09-06 Bao Tran Systems and methods for analyzing semantic documents over a network
US20050234952A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Content propagation for enhanced document retrieval
US20050246316A1 (en) * 2004-04-30 2005-11-03 Lawson Alexander J Method and software for extracting chemical data
US20060095298A1 (en) * 2004-10-29 2006-05-04 Bina Robert B Method for horizontal integration and research of information of medical records utilizing HIPPA compliant internet protocols, workflow management and static/dynamic processing of information
US20070112833A1 (en) * 2005-11-17 2007-05-17 International Business Machines Corporation System and method for annotating patents with MeSH data

Cited By (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9338249B2 (en) * 2005-09-15 2016-05-10 Google Technology Holdings, Inc. Distributed user profile
US20080288484A1 (en) * 2005-09-15 2008-11-20 Motorola, Inc. Distributed User Profile
US20070112833A1 (en) * 2005-11-17 2007-05-17 International Business Machines Corporation System and method for annotating patents with MeSH data
US8326819B2 (en) 2006-11-13 2012-12-04 Exegy Incorporated Method and system for high performance data metatagging and data indexing using coprocessors
US9396222B2 (en) 2006-11-13 2016-07-19 Ip Reservoir, Llc Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US7660793B2 (en) * 2006-11-13 2010-02-09 Exegy Incorporated Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US20080114724A1 (en) * 2006-11-13 2008-05-15 Exegy Incorporated Method and System for High Performance Integration, Processing and Searching of Structured and Unstructured Data Using Coprocessors
US8880501B2 (en) 2006-11-13 2014-11-04 Ip Reservoir, Llc Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US8156101B2 (en) 2006-11-13 2012-04-10 Exegy Incorporated Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US10191974B2 (en) 2006-11-13 2019-01-29 Ip Reservoir, Llc Method and system for high performance integration, processing and searching of structured and unstructured data
US9323794B2 (en) 2006-11-13 2016-04-26 Ip Reservoir, Llc Method and system for high performance pattern indexing
US11449538B2 (en) 2006-11-13 2022-09-20 Ip Reservoir, Llc Method and system for high performance integration, processing and searching of structured and unstructured data
US10411734B2 (en) 2008-05-15 2019-09-10 Ip Reservoir, Llc Method and system for accelerated stream processing
US9547824B2 (en) 2008-05-15 2017-01-17 Ip Reservoir, Llc Method and apparatus for accelerated data quality checking
US11677417B2 (en) 2008-05-15 2023-06-13 Ip Reservoir, Llc Method and system for accelerated stream processing
US8374986B2 (en) 2008-05-15 2013-02-12 Exegy Incorporated Method and system for accelerated stream processing
US10158377B2 (en) 2008-05-15 2018-12-18 Ip Reservoir, Llc Method and system for accelerated stream processing
US10965317B2 (en) 2008-05-15 2021-03-30 Ip Reservoir, Llc Method and system for accelerated stream processing
US7526554B1 (en) 2008-06-12 2009-04-28 International Business Machines Corporation Systems and methods for reaching resource neighborhoods
US8515994B2 (en) 2008-06-12 2013-08-20 International Business Machines Corporation Reaching resource neighborhoods
US20090313255A1 (en) * 2008-06-12 2009-12-17 International Business Machines Corporation Systems and methods for reaching resource neighborhoods
US20100064012A1 (en) * 2008-09-08 2010-03-11 International Business Machines Corporation Method, system and apparatus to automatically add senders of email to a contact list
US20110055811A1 (en) * 2009-09-02 2011-03-03 International Business Machines Corporation Discovery, Analysis, and Visualization of Dependencies
US8713521B2 (en) * 2009-09-02 2014-04-29 International Business Machines Corporation Discovery, analysis, and visualization of dependencies
US8407215B2 (en) * 2010-12-10 2013-03-26 Sap Ag Text analysis to identify relevant entities
US20120150852A1 (en) * 2010-12-10 2012-06-14 Paul Sheedy Text analysis to identify relevant entities
US20180150651A1 (en) * 2010-12-22 2018-05-31 Koninklijke Philips N.V. Creating an access control policy based on consumer privacy preferences
US8495064B2 (en) 2011-09-08 2013-07-23 Microsoft Corporation Management of metadata for life cycle assessment data
US9298453B2 (en) 2012-07-03 2016-03-29 Microsoft Technology Licensing, Llc Source code analytics platform using program analysis and information retrieval
WO2014047051A1 (en) * 2012-09-21 2014-03-27 Atigeo Llc Methods and automated systems that assign medical codes to electronic medical records
US9633093B2 (en) 2012-10-23 2017-04-25 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US10949442B2 (en) 2012-10-23 2021-03-16 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US10102260B2 (en) 2012-10-23 2018-10-16 Ip Reservoir, Llc Method and apparatus for accelerated data translation using record layout detection
US10133802B2 (en) 2012-10-23 2018-11-20 Ip Reservoir, Llc Method and apparatus for accelerated record layout detection
US10146845B2 (en) 2012-10-23 2018-12-04 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US11789965B2 (en) 2012-10-23 2023-10-17 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US9633097B2 (en) 2012-10-23 2017-04-25 Ip Reservoir, Llc Method and apparatus for record pivoting to accelerate processing of data fields
US10621192B2 (en) 2012-10-23 2020-04-14 IP Resevoir, LLC Method and apparatus for accelerated format translation of data in a delimited data format
US9678949B2 (en) 2012-12-16 2017-06-13 Cloud 9 Llc Vital text analytics system for the enhancement of requirements engineering documents and other documents
WO2014093935A1 (en) * 2012-12-16 2014-06-19 Cloud 9 Llc Vital text analytics system for the enhancement of requirements engineering documents and other documents
CN103970792A (en) * 2013-02-04 2014-08-06 中国银联股份有限公司 Index-based file comparison method and device
US9760592B2 (en) 2014-02-20 2017-09-12 International Business Machines Corporation Metrics management and monitoring system for service transition and delivery management
US10255364B2 (en) * 2014-04-08 2019-04-09 International Business Machines Corporation Analyzing a query and provisioning data to analytics
US20150286697A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Analyzing a query and provisioning data to analytics
US9633115B2 (en) * 2014-04-08 2017-04-25 International Business Machines Corporation Analyzing a query and provisioning data to analytics
US10902013B2 (en) 2014-04-23 2021-01-26 Ip Reservoir, Llc Method and apparatus for accelerated record layout detection
US10942943B2 (en) 2015-10-29 2021-03-09 Ip Reservoir, Llc Dynamic field data translation to support high performance stream data processing
US11526531B2 (en) 2015-10-29 2022-12-13 Ip Reservoir, Llc Dynamic field data translation to support high performance stream data processing
US10891419B2 (en) 2017-10-27 2021-01-12 International Business Machines Corporation Displaying electronic text-based messages according to their typographic features
US11061913B2 (en) 2018-11-30 2021-07-13 International Business Machines Corporation Automated document filtration and priority scoring for document searching and access
US11074262B2 (en) 2018-11-30 2021-07-27 International Business Machines Corporation Automated document filtration and prioritization for document searching and access
US10949607B2 (en) 2018-12-10 2021-03-16 International Business Machines Corporation Automated document filtration with normalized annotation for document searching and access
US11068490B2 (en) 2019-01-04 2021-07-20 International Business Machines Corporation Automated document filtration with machine learning of annotations for document searching and access
US10977292B2 (en) 2019-01-15 2021-04-13 International Business Machines Corporation Processing documents in content repositories to generate personalized treatment guidelines
US11721441B2 (en) 2019-01-15 2023-08-08 Merative Us L.P. Determining drug effectiveness ranking for a patient using machine learning

Also Published As

Publication number Publication date
CN100594495C (en) 2010-03-17
US9495349B2 (en) 2016-11-15
CN1967535A (en) 2007-05-23

Similar Documents

Publication Publication Date Title
US9495349B2 (en) System and method for using text analytics to identify a set of related documents from a source document
Tolle et al. Comparing noun phrasing techniques for use with medical digital library tools
Tseng et al. Text mining techniques for patent analysis
US7991733B2 (en) Data structure, system and method for knowledge navigation and discovery
US20100174675A1 (en) Data Structure, System and Method for Knowledge Navigation and Discovery
US20090217179A1 (en) System and method for knowledge navigation and discovery utilizing a graphical user interface
Smalheiser et al. Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results
US20070112833A1 (en) System and method for annotating patents with MeSH data
Rybchak et al. Analysis of methods and means of text mining
De Maio et al. Biomedical data integration and ontology-driven multi-facets visualization
Sathya et al. A review on text mining techniques
Tsatsaronis et al. A Maximum-Entropy approach for accurate document annotation in the biomedical domain
Leroy et al. Genescene: biomedical text and data mining
Benz et al. Query logs as folksonomies
Natarajan Role of text mining in information extraction and information management
Ferrod et al. Disclosing citation meanings for augmented research retrieval and exploration
Yeganova et al. A Field Sensor: computing the composition and intent of PubMed queries
Smalheiser et al. Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database
Hung et al. OGIR: an ontology‐based grid information retrieval framework
Selvalakshmi et al. Semantic Conceptual Relational Similarity Based Web Document Clustering for Efficient Information Retrieval Using Semantic Ontology.
Zhang et al. A Content-Based Dataset Recommendation System for Biomedical Datasets
Chun et al. Semantic annotation and search for deep web services
Jae-Woo A model for information retrieval agent system based on keywords distribution
da Silva et al. Agile semantic annotation of scientific texts at the biomedical scenario
Pawar et al. Analysis of Machine Learning Algorithms for Retrieval of Ontological Knowledge from Unstructured Text

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ANGELL, ROBERT L.;BOYER, STEPHEN K.;COOPER, JAMES W.;AND OTHERS;SIGNING DATES FROM 20051017 TO 20051127;REEL/FRAME:017129/0657

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ANGELL, ROBERT L.;BOYER, STEPHEN K.;COOPER, JAMES W.;AND OTHERS;SIGNING DATES FROM 20051017 TO 20051127;REEL/FRAME:017129/0657

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4