US20090019032A1 - Method and a system for semantic relation extraction - Google Patents

Method and a system for semantic relation extraction Download PDF

Info

Publication number
US20090019032A1
US20090019032A1 US11/979,534 US97953407A US2009019032A1 US 20090019032 A1 US20090019032 A1 US 20090019032A1 US 97953407 A US97953407 A US 97953407A US 2009019032 A1 US2009019032 A1 US 2009019032A1
Authority
US
United States
Prior art keywords
relation
feature
entities
semantic
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/979,534
Inventor
Markus Bundschus
Mathaeus DeJori
Martin Stetter
Volker Tresp
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Assigned to SIEMENS AKTIENGESELLSCHAFT reassignment SIEMENS AKTIENGESELLSCHAFT ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BUNDSCHUS, MARKUS, DEJORI, MATHAEUS, STETTER, MARTIN, TRESP, VOLKER
Publication of US20090019032A1 publication Critical patent/US20090019032A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Abstract

The invention provides a method for semantic relation extraction, wherein on the basis of an annotated training corpus having tokens with associated relational labels each indicating a relation between the respective token and a selectable key entity semantic relation between said key entity and other entities are directly extracted from unstructured text using a probabilistic extraction model.

Description

    BACKGROUND OF THE INVENTION
  • The invention relates to a method and a system for semantic relation extraction in particular from biomedical data.
  • The rapid growth of published literature in many fields of technology such as the biomedical domain renders automated information extraction tools indispensable for researchers to make use of this immense source of knowledge.
  • The past decade has been undergone an unprecedented increase of biomedical data in published literature. Progress in computational and biomedical methods has increased the pace of biomedical research. High throughput experiments, such as micro-arrays, produce large quantities of high-quality data which consequently leads to an increase of new findings and results. This development has caused an explosion of scientific literature published in this technical field. The overwhelming amount of textual information makes it necessary to use automated text information extraction tools to efficiently use the enormous amount of knowledge contained in biomedical literature stored in data bases. Text mining applications are provided to transfer unstructured information such as unstructured text information into structured form. Some text mining applications can only identify named entities. Possible entities in the biomedical field are genes, diseases, drugs, compounds, proteins etc. More important than identifying entities in an unstructured information data base is the identification of associations and relations between these entities. Relation extraction (RE) is the finding of associations and roles between entities having an unstructured information base such as text phrases. These text phrases are usually but not necessarily formed by a sentence.
  • The conventional semantic relation extraction methods comprise two consecutive steps. In a first step the entities are identified by means of a named entity recognition (NER). In a second step for each pair of entities a relation type is predicted.
  • FIG. 1 shows a flow-chart for explaining a conventional method for semantic relation extraction. In a preprocessing phase features for evaluating text information are defined and an annotated training corpus is generated. The features for evaluating the unstructured text information can be predefined character strings being typical for a certain entity, such as “CADH”. Another example for a feature might be whether a number can be found in the text. In the preprocessing phase an annotated training corpus is generated by experts in the respective technical field. The training corpus can be formed by sentences annotated by the experts.
  • FIG. 2 shows a table as an example for an annotated training corpus used by a conventional extraction method according to the state of the art. In the given example the training corpus consists of only two sentences i.e. “we found that TP53 is a lung cancer gene” and “smoking is bad for your lungs”. In real systems, the training corpus consists of a plurality of sentences or a plurality of documents or abstracts. Both sentences of the annotated training corpus consist of several words and tokens which are labeled by the experts according to a predefined classification scheme. It can be seen from FIG. 2 that most tokens of the annotated training corpus are labeled to be common words (C). However, some tokens such as “TP53”, “lung” and “cancer” are labeled differently. The token “TP53” is labeled to be a “gene”. The neighboring tokens “lung” and “cancer” are both labeled as a disease d. Note that in the table of FIG. 2 the word “lung” in the context of sentence 1 is labeled to be a disease d because the next word is “cancer”, whereas “lungs” in the other sentence 2 of the training corpus is labeled to be a common word c.
  • After the feature definition and the generation of the annotated training corpus in the preprocessing phase, a feature set is provided for the annotated training corpus and weights are calculated on the basis of a feature label distribution in a training phase.
  • In a further step an input query is input by a user to extract a semantic relation. A possible example is the sentence “Inactivating TP53 mutations were found in 55% of lethal metastatic pancreatic neoplasms”. The input query is tokenized into a sequence of tokens.
  • The table of FIG. 3 shows a token sequence consisting of twelve tokens x1 to x12 generated on the basis of the query input by the user. It can be seen from the flowchart of FIG. 1 that in a conventional method for semantic relation extraction entity detection is performed after tokenization of the query. By means of a Viterbi algorithm the most likely label sequence is calculated. FIG. 3 shows the most likely label sequence for the given example. In the given example two entities are detected, i.e. one gene G and one disease D. Please note that the labels Y9, Y10, Y11, Y12 are recognized to represent one disease D.
  • After completion of the entity detection a second step for relation extraction is performed in the conventional method as shown in the flow-chart of FIG. 1. The relation extraction is for example rule-based.
  • FIG. 4 shows a rule-based relation extraction performed by the conventional method. A possible way for a rule-based relation extraction according to the state of the art as shown in FIG. 4 is for the algorithm to check whether the tokens xi, which are labeled as common words c include keywords which are indicative for a corresponding relation. In the given example the token x3 “mutations” forms a common word c, but the token “mutations” is also an indicator for a particular relation, i.e. in this case genetic variation. After the rule-based relation extraction, the extracted relation is indicated to the user as shown in FIG. 5. The user is informed that there is a relation “genetic variation” between the primary entity “gene TP53” and a second entity, i.e. a disease “lethal metastatic pancreatic neoplasms”.
  • As can be seen from the given example, relation extraction in conventional methods performed in a two-step manner, i.e. first the participating entities are identified and then the relations between the entities are extracted. Both pairs of entities are enumerated for a given text phrase and for each pair a prediction is made whether there is a relation or not.
  • However, this conventional method for relation extraction as shown in the flow-chart of FIG. 1 has several disadvantages. During calculation of the most likely label sequence by means of a Viterbi algorithm it can occur that the extracted entities are not labeled correctly. The conventional method is very sensitive to errors made during a named entity recognition (NER). A disease mislabeled as another entity in the NER-phase cannot be taken into account in a gene disease relation classification phase. As another example for instance if tokens X9 to X12 shown as in table FIG. 3, i.e. “lethal”, “metastatic”, “pancreatic”, “neoplasms” are mislabeled as genes (G) following a rule-based relation extraction the error is carried along so that the user receives as an output a genetic variation relation between a gene TP53 and a gene “lethal metastatic pancreatic neoplasms”. A further possible disadvantage of the conventional method for extracting relations is that for training one needs to process all pairs of entities within sentences which results in a lower number of positive examples and, thus, lower accuracy.
  • It is an object of the present invention to provide a method and a system for overcoming the disadvantages of the conventional method for semantic relation extraction as shown in FIG. 1.
  • BRIEF SUMMARY OF THE INVENTION
  • The invention provides a method and a system for semantic relation extraction on the basis of an annotated training corpus having tokens with associated relation labels each indicating a relation between the respective token and a selectable key entity wherein semantic relations between the key entity and other entities are directly extracted from unstructured text using a probabilistic extraction model.
  • In an embodiment of the system according to the present invention the probabilistic extraction model is a conditional random field (CRF).
  • BRIEF DESCRIPTION OF THE FIGURES
  • FIG. 1 shows a flow-chart of a conventional method for semantic relation extraction according to the state of the art;
  • FIG. 2 shows a table of an example for an annotated training corpus as used by the conventional method for semantic relation extraction shown in the flow-chart of FIG. 1;
  • FIG. 3 is a table of a calculated most likely label sequence of a tokenized input query as an intermediate result of the conventional method for semantic relation extraction shown in the flow-chart of FIG. 1;
  • FIG. 4 illustrates a rule-based relation extraction step as employed by a conventional method for semantic relation extraction as shown in the flow-chart of FIG. 1;
  • FIG. 5 shows the output of a conventional method for semantic relation extraction according to the state of the art for the exemplary input query of FIG. 3 and the exemplary annotated training corpus indicated in FIG. 2;
  • FIG. 6 shows a block diagram of a possible embodiment of a system for semantic relation extraction according to the present invention;
  • FIG. 7 shows a flow-chart of a possible embodiment of the method for semantic relation extraction according to the present invention;
  • FIG. 8 shows a simple flow-chart illustrating the calculation of weighting factors as employed by an embodiment of the method for semantic relation extraction according to the present invention;
  • FIG. 9 shows a simple flow-chart illustrating the tokenization of an input query as employed by an embodiment of the method for semantic relation extraction according to the present invention;
  • FIG. 10 shows a simple flow-chart indicating the extraction of relations of a key entity as employed by an embodiment of the method for semantic relation extraction according to the present invention;
  • FIG. 11 shows an example of an annotated training corpus and a query for illustrating the functionality of an embodiment of the method for semantic relation extraction according to the present invention;
  • FIG. 12 shows a table illustrating the functionality of a method and a system for semantic relation extraction according to the present invention;
  • FIG. 13 shows a table indicating a calculated most-likely label sequence for a tokenized exemplary query as shown in FIG. 11;
  • FIG. 14 shows an exemplary output of a result of the method for semantic relation extraction according to an embodiment of the present invention for the given example of FIG. 11.
  • DETAILED DESCRIPTION OF THE INVENTION
  • FIG. 6 shows a block diagram of a possible embodiment of a semantic relation extraction system 1. It can be seen from FIG. 6 that unstructured text comprising-a plurality of documents is stored in a data base 2. The data base 2 is connected to processing means 3. The data base 2 is connected either directly or via a network to the processing means 3. In other embodiments the processing means 3 are connected to a plurality of different data bases each having a plurality of unstructured documents. In a memory 4, an annotated training corpus is stored. The annotated training corpus comprises a plurality of tokens each having an associated relational label indicating a relation between the respective token and a selectable key entity. An example for an annotated training corpus used by the system according to the present invention is shown in FIG. 11. The processing means 3 can be formed by any processor. The processing means 3 is connected to input means 5 and output means 6. The user can input a query, for instance an input query sentence by means of the input means 5. For example the input means 5 can be formed by a keyboard. The output means 6 can be formed by a display 6. The processing means 3 extracts semantic relations between a key entity and other entities from the unstructured text in the data base 2 on the basis of the annotated training corpus stored in the memory 4. Semantic relations extracted by the processing means 3 can be stored by the processing means 3 in a structured relational database 7.
  • FIG. 7 shows a flow-chart of a possible embodiment of the method for semantic relation extraction according to the present invention.
  • In a preprocessing phase a feature definition is performed in step S1 and the training corpus is generated in step S2. An example for an annotated training corpus generated in step S2 is shown in FIG. 11.
  • During a training phase consisting of step S3, S4 as shown in FIG. 7 a feature set for the annotated training corpus is provided and weights are calculated on the basis of a feature-label-distribution.
  • The features used by the method according to the present invention comprise a set of standard condition features and additional relation recognition features. The standard recognition features can comprise orthographic feature, work shape features, n-gram features, dictionary features or context features.
  • The biomedical entities often yield some orthographic characteristics. In many cases, biomedical entities consist of capitalized letters, include some numbers or are composed of combinations of both. Accordingly, orthographic features can help to distinguish various types of biomedical entities. Another recognition feature is a word shape feature.
  • Some words belonging to the same class of entities have the same word shape. For instance, for disease abbreviations it is common that no number plus normal letters appear in the token as for gene/protein co-occurrence of numbers and letters is typical.
  • As a further recognition feature according to the method according to the present invention uses character n-gram features for 2≦n≦4. This recognition feature helps to recognize informative sub-strikings like “ASE” or “HOMEO”, especially for words not seen in training.
  • A further group of recognition features are dictionary features. For example, a disease dictionary can be used and is constructed by taking all names and synonyms of concepts covered by the disease branch (C) of the MeSH ontology. Furthermore, as a possible embodiment keyword dictionaries are used for different relation types such as altered expression, genetic variation, regulatory modification and unrelated. For example, a genetic variation dictionary can contain words like “mutation” and “polymorphism”. A dictionary feature is on, if the token matches with at least one keyword in the corresponding dictionary. Note that the presence of a certain keyword in a sentence is indicative, but not imperative for a specific relation. This is handled by the method according to the present invention because of its probabilistic nature.
  • A further group of recognition features are context features. These context features consider the properties of preceding or following tokens for a current token xi in order to determine its category. Context features are important for several reasons. Thus, in case of nested entities such as: “breast cancer 2 protein is expressed . . . ”. In this text phrase one does not want to extract a disease entity. Thus, when determining the correct label y for the token “breast”, it is important that one of the preceding word features will be “protein” indicating that “breast” refers to gene/protein entity and not to a disease. In a possible embodiment a window size is set to three. Context features are not only important in case of nested entities but also for relation extraction.
  • In the method and system according to the present invention besides the recognition features further relation recognition features are provided. These additional relation recognition features comprise for example a dictionary window feature, a key entity neighborhood feature, a start window feature and a negation feature.
  • Each of the relation type dictionaries, for example for the relation type dictionaries mentioned above, i.e. the altered expression dictionary, the genetic variation dictionary, the regulatory modification dictionary and the unrelated dictionary it is defined that a feature is on, if at least keyword from the corresponding dictionary matches a word in a window size of N, i.e.
  • - N 2 and + N 2
  • tokens away from the current token. In an embodiment N=20.
  • Furthermore, as a key entity neighborhood feature for each of the relation type dictionaries a feature is defined to be on if at least one keyword matches a word in a window size of M, i.e.
  • - M 2 and + M 2
  • tokens away from the key entity token. In a possible embodiment M=6.
  • As a start window feature for each of the relation type dictionaries it is defined that the feature is on if at least one keyword matches a word in the first L tokens of a sentence. In a possible embodiment L=3. With this feature the fact is addressed that for many sentences important properties of a gene-disease-relation are mentioned at the beginning of a sentence.
  • A negation feature is defined such that this feature is on, if none of the three above-mentioned relation recognition features matches a dictionary keyword.
  • In an embodiment relation type features are based solely on dictionary information. In alternative embodiments, further information is integrated as relation type features such as word shape or n-gram features.
  • In step S3 of the flow-chart of FIG. 7 a feature a set of different features is provided for the annotated training corpus. For each feature of the feature set a corresponding weight λ is calculated by means of a maximum likelihood algorithm on the basis of a feature label distribution as shown in the flow-chart of FIG. 8. Accordingly, for each feature f a corresponding weighting factor λ is calculated as shown in the table of FIG. 12. A conditional random field CRF is defined as an undirected graphical model represented by a graph with vertices representing random variables and edges representing conditional independence assumptions. The most common graph is a graph which obeys a first order Markov property for each random variable yi. This means that each label variable y1 and yi+1 are associated in the graph G. Then y is said to be a linear chain CRF.
  • A conditional probability p of a label or state sequence for a given input sequence is defined as:
  • p ( y / x ) = 1 Z x exp ( i = 1 N k = 1 K λ k f k ( y i - 1 , y i , x , i )
  • wherein Zx is a normalization factor, fk(yi−1, yi, x, i) is an arbitrary feature function and λK is a calculated weight for a feature function ranging between −∞ and +∞.
  • Each feature function fi specifies an association between a token x at a certain position and a label y for that position. Therefore, with each feature f one can express some characteristics of an empirical distribution of training data that should also be true for a model distribution.
  • The corresponding feature weight λk specifies whether the association should be favored or disfavored. Higher values of λ indicate that their corresponding label transitions are more likely. In general, a weight λ for each feature f is high if the feature f tends to be on for the correct labeling. The weight λ is negative if the feature tends to be off for the correct labeling and should be around zero if it is uninformative. The weights λ are learned in a possible embodiment from labeled training data of the training corpus by a maximum likelihood estimation (MLE) algorithm.
  • The normalization factor Zx is the sum over all possible state or label sequences SN, while N is the length of the input sequence:
  • Z x = s S N exp ( i = 1 N k = 1 K λ k f k ( y i - 1 , y i , x , i )
  • After the training phase the user can input a query via the keyboard 5 to perform a semantic relation extraction in the extraction phase as shown in FIG. 7. In a step S5 the user inputs the query Q. The query Q can consist of a sentence, i.e. a sequence of words. The query Q comprises a key entity. As can be seen from the example in FIG. 11, the annotated training corpus employed by the method and system according to the present invention has a token labeled by the expert as key entities. As can be seen from the example in FIG. 11, token “TP53” is labeled as a key entity. The user inputs for example a query Q such as “inactivating TP53 mutations were found in 55% of lethal metastatic pancreatic neoplasms” in step S5.
  • In a further step S6 the query Q is tokenized, i.e. a token sequence x1, x2, . . . xm is generated as illustrated by FIG. 9. FIG. 13 shows a table with the generated token sequence consisting of twelve tokens x1 to x12 for the given query example.
  • As can be seen from the table in FIG. 11 in the annotated training corpus as used by the method according to the present invention, some tokens x such as “lung” and “cancer” are labeled with a relation such as “genetic variation disease GVD”. By comparing the annotated training corpus as used by the method according to the present invention as shown in FIG. 11 with the annotated training corpus used by the conventional method for semantic relation as shown in FIG. 2 it becomes evident that some tokens x such as “lung” or “cancer” in the annotated training corpus according to the present invention are not only labeled as a disease d but a relation of this token x to the key entity KE is also encoded or labeled. In the given example the encoded relation of the tokens “lung” and “cancer” to the key entity, i.e. TP53, is “genetic variation disease” (GVD).
  • In a step S7 the token sequence of the input query Q is labeled by means of a Viterbi algorithm to find a most likely label sequence as shown in FIG. 10.
  • FIG. 13 shows a most likely label sequence generated by means of a Viterbi algorithm for the token sequence of the given example. By comparing FIG. 3 with FIG. 13 it becomes evident that with the method according to the present invention in step S7 a semantic relation of the key entity KE (in this case TP53) to other entities are directly extracted, i.e. in one single step. On the display 6 the user can see directly the relation between the key entity TP53 and secondary entities. In the given example the user is informed that there is a genetic variation as a relation between the key entity TP53 and the secondary entity “lethal metastatic pancreatic neoplasms”.
  • In the present invention the investigated text phrase refers to a key entity KE such as “TP53” so that all other entities in the text phrase state a kind of relation to the key entity KE.
  • For example, a biographical text usually gives information about an entity such as “Tony Blair” and all other entities in the text are involved in a certain relation with the entity (for example his family). Thus, with the present invention it is possible to predict a kind of relation holding between the key entity KE and all other secondary entities. With the method and system according to the present invention relation extraction is treated as a sequence labeling task. Accordingly, with the present invention a named entity recognition NER and a relation extraction step are merged together.
  • Accordingly, with the method and system according to the present invention the entities' label y encodes a relation to the key entity KE and there is no initial labeling of the named entities.
  • Gene RIF-sentences represent a similar style of text in the biomedical domain as biographical text. Gene RIF-sentences describe the function of a gene/protein, the key entity KE, as a concise phrase. As a consequence, gene RIF-sentences are an adequate source for transferring relation extraction to a sequence labeling problem.
  • For example, the following gene RIF sentence is linked to a gene COX-2:
  • “COX-2 expression is significantly more common in endometrical adenocarcinoma and ovarian serous cystadenocarcinoma, but not in cervical squamous carcinoma, compared with normal tissue.”
  • This sentence states three disease relations with COX-2 (the key entity), namely two altered expression relations (expression of COX-2 relates to endometrical adenocarcinoma and ovarian serous cystadenocarcinoma) and one unrelated relation (cervical squamous carcinoma).
  • Relation extraction RE is treated by the method according to the present invention as a tagging task such as NER or part of speech POS tagging. Accordingly, for each secondary entity the method of the present invention predicts the type of relation it has to the key entity KE. Each word in a sentence is regarded as a token x. Each token x is associated with a tag or label y which indicates the type of the token x. In the given example sentence about COX-2, the label “unrelated” is assigned to the tokens “cervical”, “squamous”, “carcinoma”, as they are evidently not related with the key entity gene whereas the tokens “endometrical”, “adenocarcinoma”, “ovarian”, “serous”, “cystadenocarcinoma” are labeled each as a disease related to the gene altered expression behaviour, thus, “altered expression”. These are the words representing diseases in the sentence. The other tokens x are labeled as not forming part of an entity. Two random variables X and Y are used to denote any input token sequences with associated label sequences. In the method according to the present invention to the given token sequence x1, x2, . . . , x, xn a correct label sequence y1, Y2, . . . yn is assigned.
  • The method of the relation extraction according to the present invention is based on a one-step probabilistic extraction model, such as a linear chain conditional random field CRF. The method according to the present invention extracts the relations. For example, the method according to the present invention extracts relations between genes and diseases from Gene RIF (Gene Reference Into Function) sentences. Gene RIF (Gene Reference Into Function) are sentences which refer to a particular gene in the Entrez gene data base and describe its function in a concise phrase. The semantic relations extracted by the method and system according to the present invention can comprise different relations such as “altered expression”, “other genetic variation”, “regulatory modification”, “a general relation” or “an existing relation” between two entities. For example gene-disease-relations are categorized based on whether a gene is causing a disease state is a predisposition factor or is just associated with the disease. In an embodiment of the method according to the present invention, the gene-disease-relation categories are based on the observed state of a gene or protein, e.g. transcription level or mutation associated with the disease state. A class for sentences reporting evidence of no association between a gene state and a disease and a neutral class given not specific observe state are provided.
  • The “altered expression” level of a gene/protein is reported to be associated with a certain disease or state of a disease. For example “low expression of BRCA-1 was associated with colorectal cancer”.
  • As a further semantic relation, the “genetic variation” relates to a mutational event which is reported to be related with a disease. For example, “Inactivating TP53 mutations were found in 55% of lethal metastatic pancreatic neoplasms”.
  • A further semantic relation “regulatory modification” states a modification of the gene/protein through methylation or phosphorylation. For example “e-cadherin and P16INK4A are commonly methylated in non-small cell lung cancer”.
  • The semantic relation “any” is given when relation between a gene and a disease is reported without any further information regarding the gene's state. For example: “e-cadherin has a role in preventing peritoneal dissemination in gastric cancer”.
  • As a further semantic relation, the relation “unrelated” indicates that a sentence is evident for an independence between a gene an a certain disease. For example “variations in TP53NBAX alleles are unrelated to the development of pemphigus foleaceus”. The method and system according to the present invention has in comparison to conventional methods a high recall, precision and f-score value.
  • On a manually annotated data set of gene RIFS, the recall, precision and f-score of the method and system according to the present invention are evaluated. The recall and precision depend of true positive TP, false negative TN and false positive FP as follows:
  • Recall = T P T P + F N Precision = T P T P + F P
  • A true positive TP is a label sequence for a certain entity which exactly matches the label sequence for this entity from the standard. For example, in the following sentence “BRCA2 is mutated in stage II breast cancer” a human annotator labels “stage II breast cancer” as a disease related via genetic variation. Under the assumption that the method and system according to the present invention only recognizes “breast cancer” as a disease entity and categorizes the relation to gene-“BRCA2” as a “genetic variation”, the system gets assigned a false negative (FN) for not recognizing the whole sequence as well as one false positive (FP). In general, since this is hard matching criteria in many situations a more gentle criteria of correctness can be used.
  • Table 1 shows a text corpus statistics for an annotated data set of 5.469 gene RIFs.
  • TABLE 1
    Altered Genetic Regulatory
    Any Unrelated expression variation modification All
    Corpus 1396 369 1750 1695 186 5369
  • Table 2 shows the results of a relation extraction RF as performed using the method and system according to the present invention.
  • TABLE 2
    Recall Precision F-score
    Any 69.94 79.20 74.28
    Unrelated 56.01 66.93 60-09
    Altered 73.89 74.92 74.40
    expression
    Genetic 75.99 778.06 77.01
    variation
    Regulatory 61.13 70.50 65.48
    modification
    Overall 71.54 76.31 73.84
  • Table 2 lists accuracy measures for each of the predefined regulation types. For any, altered expression and genetic variation relations the method and system according to the present invention exceeds a boundary 74 F-measure. Average over all relations types the method and system according to the present invention achieves an overall accuracy of 73.84 F-measure for the given data set.
  • Table 3 shows a comparison of different methods of semantic relation extraction. The first two models are based on a conventional two-step approach according to the state of the art consisting of an NER-step and a successive RE-step. In a first baseline model (dictionary plus rule-base) the NER-step is done via a dictionary longest matching approach while in the CRF plus rule-based model the NER-step is tackled via a disease NER CRF.
  • TABLE 3
    Recall Precision F-score
    Dictionary + rule- 43.31 42.98 43.10
    based
    CRF + rule- 67.62 71.88 69.68
    based
    Relation CRF 71.54 76.31 73.84
  • As can be seen from table 3, the method and system according to the present invention clearly outperforms the conventional two baseline approaches. The difference between the two-step approach according to the prior art methods with disease CRF tagger plus additional successive rules for RE and the method according to the present invention is 4.16 F-measure. This result indicates that the unified CRF performed by the method according to the present invention is able to learn additional patterns from the empirical distribution which are important for inferring the type of relation holding between gene and disease pairs.
  • The method and system according to the present invention allows in a possible embodiment the identification of semantic gene disease relations based on a probabilistic extraction model. As can be seen from table 3, the overall performance of the method and system according to the present invention is better than conventional methods employing a two-step approach.
  • Since method and system according to the present invention is discussed mostly with respect to biomedical data it is emphasized that the method and system according to the present invention can be used for semantic relation extraction for any kind of unstructured text.
  • Further, the method and system according to the present invention can be used for semantic relation extraction for any unstructured text written in any language and any alphabet. The method and system according to the present invention allows to detect entities and their relations at the same time. The method and system according to the present invention has a higher performance, i.e. sensitivity and F-score, than conventional methods. The method and system according to the present invention do not only allow for a detection of a relation but also its characterization of its nature as far as mentioned in the unstructured text.
  • In a possible embodiment the method according to the present invention is performed by a computer program on a computer. A possible embodiment this computer program comprises instructions to perform the method and is stored on a data carrier.

Claims (20)

1. A method for semantic relation extraction comprising: extracting directly on the basis of an annotated training corpus having tokens with associated relational labels each indicating a relation between the respective token and a selectable key entity semantic relation between said key entity and other entities from unstructured text using a probabilistic extraction model.
2. The method according to claim 1,
wherein the probabilistic extraction model is a conditional random field.
3. The method according to claim 1,
wherein weighting factors (λ) for each feature are calculated on the basis of a feature label distribution of said annotated training corpus by means of a maximum likelihood algorithm.
4. The method according to claim 1, wherein a query comprising said key entity is input by a user.
5. The method according to claim 4, wherein the input query is tokenized to generate a token sequence.
6. The method according to claim 5, wherein a most likely label sequence is calculated for the generated token sequence by means of a Viterbi algorithm using said calculated weighting factors.
7. The method according to claim 6, wherein a conditional probability (P) of the label sequence is calculated as follows:
p ( y / x ) = 1 Z x exp ( i = 1 N k = 1 K λ k f k ( y i - 1 , y i , x , i )
wherein Zx is a normalization factor,
fk(yi−1, yi,x, i) is an arbitrary feature function, λK is a calculated weight factor for a feature function ranging between −∞ and +∞.
8. The method according to claim 7, wherein the normalization factor Zx is calculated as follows:
Z x = s S N exp ( i = 1 N k = 1 K λ k f k ( y i - 1 , y i , x , i )
wherein N is the length of the input sequence.
9. The method according to claim 1, wherein the semantic relations are formed by biomedical relations.
10. The method according to claim 9, wherein the biomedical relations is
an altered expression,
a genetic variation,
a regulatory modification,
a general relation, and
a non-existing relation between two entities.
11. The method according to claim 1, wherein a set of recognition features is provided.
12. The method according to claim 11, wherein the set of recognition features comprises:
orthographic features
word shape features,
n-gram features,
dictionary features, and
context features.
13. The method according to claim 1, wherein a set of relation recognition features is provided.
14. The method according to claim 13, wherein the set of relation recognition features comprises:
a dictionary window feature,
a key entity neighbourhood feature
a start window feature, and
a negation feature.
15. Method according to claim 1, wherein the entities are formed by biomedical entities.
16. The method according to claim 15, wherein the entities comprise genes, diseases, drugs, compounds and proteins.
17. A computer program for performing the method for semantic relation extraction according to claim 1.
18. A data carrier for storing instructions of a computer program which performs the method for semantic relation extraction according to claim 1.
19. A semantic relation extraction system comprising:
(a) means for storing unstructured text;
(b) means for storing an annotated training corpus having tokens with associated relational labels each indicating a relation between the respective token and a selectable key entity; and
(c) means for extracting semantic relations between the key entity and other entities from said unstructured text on the basis of said training corpus using a probabilistic extraction model.
20. The semantic relation extraction system according to claim 19, wherein said probabilistic extraction model is a conditional random field.
US11/979,534 2007-07-13 2007-11-05 Method and a system for semantic relation extraction Abandoned US20090019032A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP07013828 2007-07-13
EPEP07013828 2007-07-13

Publications (1)

Publication Number Publication Date
US20090019032A1 true US20090019032A1 (en) 2009-01-15

Family

ID=40253985

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/979,534 Abandoned US20090019032A1 (en) 2007-07-13 2007-11-05 Method and a system for semantic relation extraction

Country Status (1)

Country Link
US (1) US20090019032A1 (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090282012A1 (en) * 2008-05-05 2009-11-12 Microsoft Corporation Leveraging cross-document context to label entity
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
US20110251984A1 (en) * 2010-04-09 2011-10-13 Microsoft Corporation Web-scale entity relationship extraction
US20120158639A1 (en) * 2010-12-15 2012-06-21 Joshua Lamar Moore Method, system, and computer program for information retrieval in semantic networks
US20120226715A1 (en) * 2011-03-04 2012-09-06 Microsoft Corporation Extensible surface for consuming information extraction services
US20120253793A1 (en) * 2011-04-01 2012-10-04 Rima Ghannam System for natural language understanding
US8290968B2 (en) 2010-06-28 2012-10-16 International Business Machines Corporation Hint services for feature/entity extraction and classification
US20130086059A1 (en) * 2011-10-03 2013-04-04 Nuance Communications, Inc. Method for Discovering Key Entities and Concepts in Data
US20130246046A1 (en) * 2012-03-16 2013-09-19 International Business Machines Corporation Relation topic construction and its application in semantic relation extraction
US8849732B2 (en) 2010-09-28 2014-09-30 Siemens Aktiengesellschaft Adaptive remote maintenance of rolling stocks
WO2015077942A1 (en) * 2013-11-27 2015-06-04 Hewlett-Packard Development Company, L.P. Relationship extraction
WO2015080561A1 (en) 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
WO2016010245A1 (en) * 2014-07-14 2016-01-21 Samsung Electronics Co., Ltd. Method and system for robust tagging of named entities in the presence of source or translation errors
US20160085971A1 (en) * 2014-09-22 2016-03-24 Infosys Limited System and method for tokenization of data for privacy
US20160148116A1 (en) * 2014-11-21 2016-05-26 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
US20170011023A1 (en) * 2015-07-07 2017-01-12 Rima Ghannam System for Natural Language Understanding
WO2018005203A1 (en) * 2016-06-28 2018-01-04 Microsoft Technology Licensing, Llc Leveraging information available in a corpus for data parsing and predicting
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
US10200397B2 (en) 2016-06-28 2019-02-05 Microsoft Technology Licensing, Llc Robust matching for identity screening
CN109791570A (en) * 2018-12-13 2019-05-21 香港应用科技研究院有限公司 Efficiently and accurately name entity recognition method and device
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
US10394955B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Relation extraction from a corpus using an information retrieval based procedure
CN110348015A (en) * 2019-07-12 2019-10-18 北京百奥知信息科技有限公司 A kind of method of entity in automatic marking medicine text
CN110781683A (en) * 2019-11-04 2020-02-11 河海大学 Entity relation joint extraction method
CN110931084A (en) * 2018-08-31 2020-03-27 国际商业机器公司 Extraction and normalization of mutant genes from unstructured text for cognitive search and analysis
US20200175020A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Automated document filtration and prioritization for document searching and access
US20200175021A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Automated document filtration and priority scoring for document searching and access
WO2020118741A1 (en) * 2018-12-13 2020-06-18 Hong Kong Applied Science and Technology Research Institute Company Limited Efficient and accurate named entity recognition method and apparatus
US20200218719A1 (en) * 2019-01-04 2020-07-09 International Business Machines Corporation Automated document filtration with machine learning of annotations for document searching and access
US10949607B2 (en) 2018-12-10 2021-03-16 International Business Machines Corporation Automated document filtration with normalized annotation for document searching and access
US10977292B2 (en) 2019-01-15 2021-04-13 International Business Machines Corporation Processing documents in content repositories to generate personalized treatment guidelines
CN113032523A (en) * 2021-03-22 2021-06-25 平安科技(深圳)有限公司 Extraction method and device of triple information, electronic equipment and storage medium
WO2022072346A1 (en) * 2020-09-29 2022-04-07 Xcures, Inc. Automated individualized recommendations for medical treatment
CN114490928A (en) * 2021-12-31 2022-05-13 广州探迹科技有限公司 Implementation method, system, computer equipment and storage medium of semantic search
US11721441B2 (en) 2019-01-15 2023-08-08 Merative Us L.P. Determining drug effectiveness ranking for a patient using machine learning

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963894A (en) * 1994-06-24 1999-10-05 Microsoft Corporation Method and system for bootstrapping statistical processing into a rule-based natural language parser
US6070134A (en) * 1997-07-31 2000-05-30 Microsoft Corporation Identifying salient semantic relation paths between two words
US20030217335A1 (en) * 2002-05-17 2003-11-20 Verity, Inc. System and method for automatically discovering a hierarchy of concepts from a corpus of documents
US20040093331A1 (en) * 2002-09-20 2004-05-13 Board Of Regents, University Of Texas System Computer program products, systems and methods for information discovery and relational analyses
US20060053098A1 (en) * 2004-09-03 2006-03-09 Bio Wisdom Limited System and method for creating customized ontologies
US20060053171A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for curating one or more multi-relational ontologies
US20060053151A1 (en) * 2004-09-03 2006-03-09 Bio Wisdom Limited Multi-relational ontology structure
US20060053099A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for capturing knowledge for integration into one or more multi-relational ontologies
US20060294037A1 (en) * 2003-08-06 2006-12-28 Microsoft Corporation Cost-benefit approach to automatically composing answers to questions by extracting information from large unstructured corpora
US20070016863A1 (en) * 2005-07-08 2007-01-18 Yan Qu Method and apparatus for extracting and structuring domain terms
US7194406B2 (en) * 2000-06-22 2007-03-20 Hapax Limited Method and system for information extraction
US20070067280A1 (en) * 2003-12-31 2007-03-22 Agency For Science, Technology And Research System for recognising and classifying named entities
US20070219776A1 (en) * 2006-03-14 2007-09-20 Microsoft Corporation Language usage classifier
US20080010274A1 (en) * 2006-06-21 2008-01-10 Information Extraction Systems, Inc. Semantic exploration and discovery
US20090192954A1 (en) * 2006-03-15 2009-07-30 Araicom Research Llc Semantic Relationship Extraction, Text Categorization and Hypothesis Generation
US7899666B2 (en) * 2007-05-04 2011-03-01 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text
US20110270604A1 (en) * 2010-04-28 2011-11-03 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction
US8280719B2 (en) * 2005-05-05 2012-10-02 Ramp, Inc. Methods and systems relating to information extraction

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963894A (en) * 1994-06-24 1999-10-05 Microsoft Corporation Method and system for bootstrapping statistical processing into a rule-based natural language parser
US6070134A (en) * 1997-07-31 2000-05-30 Microsoft Corporation Identifying salient semantic relation paths between two words
US7194406B2 (en) * 2000-06-22 2007-03-20 Hapax Limited Method and system for information extraction
US20030217335A1 (en) * 2002-05-17 2003-11-20 Verity, Inc. System and method for automatically discovering a hierarchy of concepts from a corpus of documents
US20040093331A1 (en) * 2002-09-20 2004-05-13 Board Of Regents, University Of Texas System Computer program products, systems and methods for information discovery and relational analyses
US20060294037A1 (en) * 2003-08-06 2006-12-28 Microsoft Corporation Cost-benefit approach to automatically composing answers to questions by extracting information from large unstructured corpora
US20070067280A1 (en) * 2003-12-31 2007-03-22 Agency For Science, Technology And Research System for recognising and classifying named entities
US20060053171A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for curating one or more multi-relational ontologies
US20060053099A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for capturing knowledge for integration into one or more multi-relational ontologies
US20060053151A1 (en) * 2004-09-03 2006-03-09 Bio Wisdom Limited Multi-relational ontology structure
US20060053098A1 (en) * 2004-09-03 2006-03-09 Bio Wisdom Limited System and method for creating customized ontologies
US7505989B2 (en) * 2004-09-03 2009-03-17 Biowisdom Limited System and method for creating customized ontologies
US8280719B2 (en) * 2005-05-05 2012-10-02 Ramp, Inc. Methods and systems relating to information extraction
US20070016863A1 (en) * 2005-07-08 2007-01-18 Yan Qu Method and apparatus for extracting and structuring domain terms
US20070219776A1 (en) * 2006-03-14 2007-09-20 Microsoft Corporation Language usage classifier
US20090192954A1 (en) * 2006-03-15 2009-07-30 Araicom Research Llc Semantic Relationship Extraction, Text Categorization and Hypothesis Generation
US20080010274A1 (en) * 2006-06-21 2008-01-10 Information Extraction Systems, Inc. Semantic exploration and discovery
US7558778B2 (en) * 2006-06-21 2009-07-07 Information Extraction Systems, Inc. Semantic exploration and discovery
US7899666B2 (en) * 2007-05-04 2011-03-01 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text
US20110270604A1 (en) * 2010-04-28 2011-11-03 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Extracting Relations from Unstructured Text, Ryan McDonald, April 15, 2005 *

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970808B2 (en) * 2008-05-05 2011-06-28 Microsoft Corporation Leveraging cross-document context to label entity
US20090282012A1 (en) * 2008-05-05 2009-11-12 Microsoft Corporation Leveraging cross-document context to label entity
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
US8918348B2 (en) 2010-04-09 2014-12-23 Microsoft Corporation Web-scale entity relationship extraction
US20110251984A1 (en) * 2010-04-09 2011-10-13 Microsoft Corporation Web-scale entity relationship extraction
US8504490B2 (en) * 2010-04-09 2013-08-06 Microsoft Corporation Web-scale entity relationship extraction that extracts pattern(s) based on an extracted tuple
US9317569B2 (en) 2010-04-09 2016-04-19 Microsoft Technology Licensing, Llc Displaying search results with edges/entity relationships in regions/quadrants on a display device
US8290968B2 (en) 2010-06-28 2012-10-16 International Business Machines Corporation Hint services for feature/entity extraction and classification
US8849732B2 (en) 2010-09-28 2014-09-30 Siemens Aktiengesellschaft Adaptive remote maintenance of rolling stocks
US8566273B2 (en) * 2010-12-15 2013-10-22 Siemens Aktiengesellschaft Method, system, and computer program for information retrieval in semantic networks
US20120158639A1 (en) * 2010-12-15 2012-06-21 Joshua Lamar Moore Method, system, and computer program for information retrieval in semantic networks
US20120226715A1 (en) * 2011-03-04 2012-09-06 Microsoft Corporation Extensible surface for consuming information extraction services
US9064004B2 (en) * 2011-03-04 2015-06-23 Microsoft Technology Licensing, Llc Extensible surface for consuming information extraction services
US20120253793A1 (en) * 2011-04-01 2012-10-04 Rima Ghannam System for natural language understanding
US9710458B2 (en) * 2011-04-01 2017-07-18 Rima Ghannam System for natural language understanding
US9110883B2 (en) * 2011-04-01 2015-08-18 Rima Ghannam System for natural language understanding
US20160041967A1 (en) * 2011-04-01 2016-02-11 Rima Ghannam System for Natural Language Understanding
US20130086059A1 (en) * 2011-10-03 2013-04-04 Nuance Communications, Inc. Method for Discovering Key Entities and Concepts in Data
US20130246046A1 (en) * 2012-03-16 2013-09-19 International Business Machines Corporation Relation topic construction and its application in semantic relation extraction
US9037452B2 (en) * 2012-03-16 2015-05-19 Afrl/Rij Relation topic construction and its application in semantic relation extraction
WO2015080561A1 (en) 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
US10643145B2 (en) 2013-11-27 2020-05-05 Micro Focus Llc Relationship extraction
WO2015077942A1 (en) * 2013-11-27 2015-06-04 Hewlett-Packard Development Company, L.P. Relationship extraction
WO2016010245A1 (en) * 2014-07-14 2016-01-21 Samsung Electronics Co., Ltd. Method and system for robust tagging of named entities in the presence of source or translation errors
US10073673B2 (en) 2014-07-14 2018-09-11 Samsung Electronics Co., Ltd. Method and system for robust tagging of named entities in the presence of source or translation errors
US20160085971A1 (en) * 2014-09-22 2016-03-24 Infosys Limited System and method for tokenization of data for privacy
US9953171B2 (en) * 2014-09-22 2018-04-24 Infosys Limited System and method for tokenization of data for privacy
US20160148116A1 (en) * 2014-11-21 2016-05-26 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
US20160148096A1 (en) * 2014-11-21 2016-05-26 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
US9785887B2 (en) * 2014-11-21 2017-10-10 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
US9792549B2 (en) * 2014-11-21 2017-10-17 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
US9824083B2 (en) * 2015-07-07 2017-11-21 Rima Ghannam System for natural language understanding
US20170011023A1 (en) * 2015-07-07 2017-01-12 Rima Ghannam System for Natural Language Understanding
WO2018005203A1 (en) * 2016-06-28 2018-01-04 Microsoft Technology Licensing, Llc Leveraging information available in a corpus for data parsing and predicting
US10200397B2 (en) 2016-06-28 2019-02-05 Microsoft Technology Licensing, Llc Robust matching for identity screening
CN109416705A (en) * 2016-06-28 2019-03-01 微软技术许可有限责任公司 It parses and predicts for data using information available in corpus
US10311092B2 (en) 2016-06-28 2019-06-04 Microsoft Technology Licensing, Llc Leveraging corporal data for data parsing and predicting
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
US10394955B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Relation extraction from a corpus using an information retrieval based procedure
CN110931084A (en) * 2018-08-31 2020-03-27 国际商业机器公司 Extraction and normalization of mutant genes from unstructured text for cognitive search and analysis
US11170031B2 (en) * 2018-08-31 2021-11-09 International Business Machines Corporation Extraction and normalization of mutant genes from unstructured text for cognitive search and analytics
US20200175020A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Automated document filtration and prioritization for document searching and access
US20200175021A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Automated document filtration and priority scoring for document searching and access
US11074262B2 (en) 2018-11-30 2021-07-27 International Business Machines Corporation Automated document filtration and prioritization for document searching and access
US11061913B2 (en) 2018-11-30 2021-07-13 International Business Machines Corporation Automated document filtration and priority scoring for document searching and access
US10949607B2 (en) 2018-12-10 2021-03-16 International Business Machines Corporation Automated document filtration with normalized annotation for document searching and access
CN109791570A (en) * 2018-12-13 2019-05-21 香港应用科技研究院有限公司 Efficiently and accurately name entity recognition method and device
WO2020118741A1 (en) * 2018-12-13 2020-06-18 Hong Kong Applied Science and Technology Research Institute Company Limited Efficient and accurate named entity recognition method and apparatus
US11068490B2 (en) 2019-01-04 2021-07-20 International Business Machines Corporation Automated document filtration with machine learning of annotations for document searching and access
US20200218719A1 (en) * 2019-01-04 2020-07-09 International Business Machines Corporation Automated document filtration with machine learning of annotations for document searching and access
US11721441B2 (en) 2019-01-15 2023-08-08 Merative Us L.P. Determining drug effectiveness ranking for a patient using machine learning
US10977292B2 (en) 2019-01-15 2021-04-13 International Business Machines Corporation Processing documents in content repositories to generate personalized treatment guidelines
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
CN110348015A (en) * 2019-07-12 2019-10-18 北京百奥知信息科技有限公司 A kind of method of entity in automatic marking medicine text
CN110781683A (en) * 2019-11-04 2020-02-11 河海大学 Entity relation joint extraction method
WO2022072346A1 (en) * 2020-09-29 2022-04-07 Xcures, Inc. Automated individualized recommendations for medical treatment
CN113032523A (en) * 2021-03-22 2021-06-25 平安科技(深圳)有限公司 Extraction method and device of triple information, electronic equipment and storage medium
WO2022198747A1 (en) * 2021-03-22 2022-09-29 平安科技(深圳)有限公司 Triplet information extraction method and apparatus, electronic device and storage medium
CN114490928A (en) * 2021-12-31 2022-05-13 广州探迹科技有限公司 Implementation method, system, computer equipment and storage medium of semantic search

Similar Documents

Publication Publication Date Title
US20090019032A1 (en) Method and a system for semantic relation extraction
Leaman et al. TaggerOne: joint named entity recognition and normalization with semi-Markov Models
US9971974B2 (en) Methods and systems for knowledge discovery
Bhasuran et al. Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases
US20210012215A1 (en) Hierarchical multi-task term embedding learning for synonym prediction
Li et al. Two-phase biomedical named entity recognition using CRFs
Li et al. Recognizing irregular entities in biomedical text via deep neural networks
Lamurias et al. Extracting microRNA-gene relations from biomedical literature using distant supervision
Jiang et al. De-identification of medical records using conditional random fields and long short-term memory networks
Verbeke et al. A statistical relational learning approach to identifying evidence based medicine categories
Ahmad et al. Bengali word embeddings and it's application in solving document classification problem
Kim et al. Classifying protein-protein interaction articles using word and syntactic features
Liu et al. Multi-granularity sequence labeling model for acronym expansion identification
Florez et al. Named entity recognition using neural networks for clinical notes
Ekbal et al. Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition
Palakal et al. A multi-level text mining method to extract biological relationships
Hernandez et al. An automated approach to identify scientific publications reporting pharmacokinetic parameters
Flores et al. CREGEX: A biomedical text classifier based on automatically generated regular expressions
US20240013931A1 (en) Method for constructing variation literature interpretation knowledge base, and interpretation method and electronic device
Bokharaeian et al. Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method
He et al. End-to-end relation extraction based on bootstrapped multi-level distant supervision
de Arriba-Pérez et al. Explainable machine learning multi-label classification of Spanish legal judgements
Liu et al. Learning conditional random fields with latent sparse features for acronym expansion finding
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Sheikhshab et al. Graphner: Using corpus level similarities and graph propagation for named entity recognition

Legal Events

Date Code Title Description
AS Assignment

Owner name: SIEMENS AKTIENGESELLSCHAFT, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BUNDSCHUS, MARKUS;DEJORI, MATHAEUS;STETTER, MARTIN;AND OTHERS;REEL/FRAME:020148/0856

Effective date: 20071022

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION