US20090070327A1 - Method for automatically generating regular expressions for relaxed matching of text patterns - Google Patents

Method for automatically generating regular expressions for relaxed matching of text patterns Download PDF

Info

Publication number
US20090070327A1
US20090070327A1 US11/850,987 US85098707A US2009070327A1 US 20090070327 A1 US20090070327 A1 US 20090070327A1 US 85098707 A US85098707 A US 85098707A US 2009070327 A1 US2009070327 A1 US 2009070327A1
Authority
US
United States
Prior art keywords
regular expression
automatically
token list
operator
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/850,987
Inventor
Alexander Stephan Loeser
Sriram Raghavan
Shivakumar Vaithyanathan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/850,987 priority Critical patent/US20090070327A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: VAITHYANATHAN, SHIVAKUMAR, LOESER, ALEXANDER STEPHAN, RAGHAVEN, SRIRAM
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION CORRECTIVE ASSIGNMENT TO CORRECT THE THE LAST NAME OF THE SECOND ASSIGNOR PREVIOUSLY RECORDED ON REEL 019791 FRAME 0695. ASSIGNOR(S) HEREBY CONFIRMS THE CHANGE INCORRECT SPELLING OF LAST NAME RAGSHAVEN TO CORRECTLY READ -- RAGSHAVAN--. Assignors: VAITHYANATHAN, SHIVAKUMAR, LOESER, ALEXANDER STEPHAN, RAGSHAVAN, SRIRAM
Priority to US12/125,290 priority patent/US8484238B2/en
Publication of US20090070327A1 publication Critical patent/US20090070327A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Definitions

  • the present invention relates to a method and system for automatically generating regular expressions for relaxed matching of text patterns.
  • One category of information extraction employs query expansion and other query processing techniques in search engines.
  • Conventional query expansion techniques generate an expanded output query from an original query, where the expanded output query includes additional words obtained from a synonym dictionary.
  • the results of the expanded output query are documents that contain either the keywords of the original query or the additional words from the synonym dictionary.
  • a natural language dictionary e.g., standard English dictionary
  • the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc.
  • known query processing techniques that employ stemming and stop word removal decrease precision in information retrieval results.
  • Another category of information extraction is rule-based and utilizes regular expressions.
  • the present invention provides a computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
  • the present invention provides a technique for automatically generating regular expressions for a relaxed matching of text patterns. Further, the present invention provides a generic, extensible, and widely applicable rule-based framework in which the automatic generation of regular expressions is based on the creation and updating of rules without requiring the writing and maintenance of complex and customized software programs.
  • FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • FIG. 2 is a flow diagram of a regular expression generation process implemented by the system of FIG. 1A or FIG. 1B , in accordance with embodiments of the present invention.
  • FIG. 4A depicts an algorithm to apply a REPLACE_WORD rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
  • FIG. 4B depicts an exemplary tokenized phrase generated via the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 4C depicts an exemplary set of tokens resulting from executing the algorithm of FIG. 4A to apply the REPLACE_WORD rule included in the rule set of FIG. 3 to an escaped version of the phrase of FIG. 4B , in accordance with embodiments of the present invention.
  • FIG. 4D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 4C via the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 5A depicts an algorithm to apply a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
  • FIG. 5B depicts an exemplary set of tokens resulting from applying the rule of FIG. 5A via the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 5C depicts an exemplary set of tokens resulting from applying the rule of FIG. 4A to the set of tokens of FIG. 5B via the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 5D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 5C via the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 6 is a table of entities and relationships used in experiments for determining recall and precision of regular expressions generated by the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIGS. 7A-7D are tables of results of four sets of experiments organized according to the table of FIG. 6 , where the experiments are for determining recall and precision of regular expressions generated by the process of FIG. 2 , in accordance with embodiments of the present invention.
  • FIG. 8 is a block diagram of a computing unit that includes a relaxed regular expression generator of the system of FIG. 1A or FIG. 1B , in accordance with embodiments of the present invention.
  • a text pattern of interest for this example is the phrase “can be reached at”.
  • a rule-based IE system identifies occurrences of the form “ ⁇ Person> can be reached at ⁇ Phone>” and generates the corresponding pairs of related Persons and Phones.
  • the phrase “can be reached at” may occur with several variations: extra punctuation, multiple spaces or new lines between words, arbitrary capitalization, colloquial abbreviations for words (e.g., “reached” abbreviated as “rchd”).
  • Such variation in text is particularly true for informal communication mediums such as email where the formatting and style of the text is not strictly controlled.
  • a regular expression is used to account for the original input phrase “can be reached at” as well as the multiple variations.
  • the present invention addresses this problem by providing a generic and extensible rule-based framework for automatically generating a regular expression from a given input phrase (i.e., a plain text pattern) provided by a user.
  • the input phrase is provided in a natural, human language (e.g., a user's native English).
  • the regular expression output by the present invention improves the recall (i.e., increase the set of occurrences of the input phrase and its variations that are identified in the text) with little or no decrease in precision (i.e., without increasing the identification of spurious instances in the text).
  • relaxation is the method of the present invention that converts a plain text pattern to an output regular expression that matches the original plain text pattern and that matches other strings that are variations of the original plain text pattern.
  • the overall algorithm whose execution provides relaxation is referred to herein as the relaxed regular expression generator.
  • the relaxation disclosed herein includes syntactic relaxation and semantic relaxation. Syntactic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on primarily syntactic aspects of the original plain text pattern such as punctuation and whitespace between words (i.e., matching to patterns that have different punctuation and/or whitespace while having the same words and the same meaning as the original plain text pattern). Semantic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on a modification of the words of the original plain text pattern while retaining the meaning of the original plain text pattern.
  • FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • First system 100 includes a user input phrase 102 , a relaxed regular expression generator 104 , a relaxation rule file 106 and an output regular expression 108 .
  • User input phrase 102 is input into relaxed regular expression generator 104 as a phrase expressed in a natural, human language (e.g., a native English phrase).
  • Relaxed regular expression generator 104 obtains relaxation rules from relaxation rule file 106 and applies the obtained rules to user input phrase 102 to automatically generate regular expression 108 as output.
  • the relaxation rules in file 106 are predefined manually by, for example, an administrator of system 100 .
  • relaxed regular expression generator 104 is also referred to simply as regular expression generator 104 or generator 104 .
  • Relaxation rules included in relaxation rule file 106 are also referred to herein simply as rules. The functionalities of the components of system 100 are described in more detail below relative to FIG. 2 .
  • system 100 includes an information extraction system (not shown) that includes an annotator generator (not shown).
  • the annotator generator is coupled to relaxed regular expression generator 104 .
  • generator 104 receives as input an annotator rule expressed in a natural, human language and outputs an annotator rule as regular expression 108 .
  • the output regular expression is a relaxed regular expression in that it matches the original input annotator rule as well as variations of the annotator rule.
  • the annotator generator then uses output regular expression 108 to generate an annotator that facilitates information extraction.
  • FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • Second system 120 implements another embodiment of the present invention and includes user input phrase 102 , relaxed regular expression generator 104 , a software-based rule learning component 122 , one or more output relaxation rules 124 and an output regular expression 108 .
  • user input phrase 102 is expressed in a natural, human language and is input into generator 104 .
  • rule learning component 122 automatically learns one or more relaxation rules and outputs one or more rules 124 , which are then obtained by generator 104 and applied by generator 104 to user input phrase 102 to generate regular expression 108 .
  • step 204 regular expression generator 104 (see FIG. 1A and FIG. 1B ) receives user input phrase 102 and determines whether user input phrase 102 is already a regular expression or whether phrase 102 is a plain text pattern.
  • step 204 determines that phrase 102 is a plain text pattern
  • step 206 generator 104 (see FIG. 1A and FIG. 1B ) detects word boundaries in phrase 102 and tokenizes the plain text pattern that comprises phrase 102 to generate a set of input tokens.
  • step 208 generator 104 (see FIG. 1A and FIG. 1B ) maps each of the aforementioned input tokens to a specific, internal representation for the system (e.g., system 100 of FIG. 1A ) to produce a token list (i.e., a sequence of tokens).
  • step 210 generator 104 (see FIG. 1A and FIG. 1B ) replaces regular expression special characters in each entry of the token list produced in step 208 with escaped characters to generate a transformed token list (i.e., a tokenized and escaped phrase).
  • a transformed token list i.e., a tokenized and escaped phrase.
  • Java® regular expression characters in a token list produced in step 208 are replaced with escaped characters.
  • step 212 generator 104 (see FIG. 1A and FIG. 1B ) applies one or more rules from the predefined rule set loaded in step 202 to the token list generated in step 210 in an order specified in relaxation rule file 106 .
  • the application of the one or more rules in step 212 generates a modified token list (a.k.a. a tokenized and modified phrase) that is a transformed version of input phrase 102 .
  • step 212 includes applying the modification operator to the token list generated in step 210 or to an intermediate token list generated during the execution of step 210 .
  • step 214 generator 104 converts the modified token list generated in step 212 into a string, which represents output regular expression 108 (see FIG. 1A and FIG. 1B ).
  • step 214 the regular expression generation process ends at step 216 .
  • step 204 if generator 104 (see FIG. 1A and FIG. 1B ) determines that input phrase 102 is already a regular expression, then the above-described processing of steps 206 , 208 , 210 , 212 and 214 is not performed, the input is passed to the output unchanged, and the regular expression generation process ends at step 216 .
  • input phrase 102 is:
  • generator 104 recognizes that the input phrase is a regular expression and returns the input phrase unchanged as output 108 (see FIG. 1A and FIG. 1B ).
  • input phrase 102 is the following phrase:
  • generator 104 (see FIG. 1A and FIG. 1B ) outputs the following relaxed regular expression as the result of performing the transformations of steps 206 , 208 , 210 , 212 and 214 :
  • Section 5 presented below describes experiments that demonstrate that utilizing the process of FIG. 2 to generate such relaxed regular expressions results in significantly higher recall and similar precision when compared to the input plain text pattern.
  • This section includes a sample rule set and algorithms for applying rules in the sample rule set.
  • Relaxation rules are defined in a special file 106 (see FIG. 1A and FIG. 1B ), which is loaded when the regular expression generator 104 (see FIG. 1A and FIG. 1B ) is started.
  • the rules are composed using a predefined set of modification operators. While the framework for relaxation disclosed herein is generic and can be customized by any number of modification operators, this section restricts its attention to three basic operators: WHITESPACE, REPLACE_WORD and SPLIT_AT_CHARACTER.
  • FIG. 3 depicts an example of a rule set 300 that is included in relaxation rule file 106 (see FIG. 1A and FIG. 1B ).
  • Rule set 300 includes four rules that are expressed in a simple Extensible Markup Language (XML) format and that include the aforementioned basic operators. Note that in rule set 300 , each rule has an attribute ⁇ stackposition> that controls the order in which the rules must be applied.
  • the operators included in the rules of rule set 300 are briefly described below:
  • WHITESPACE This operator replaces whitespace which has been identified as token delimiters with the replacement regular expression defined in the attribute ⁇ replacement>.
  • REPLACE_WORD This operator replaces a sequence of one or more tokens with a replacement regular expression.
  • the tokens “did not” are replaced by a regular expression that matches either the phrase did ⁇ s+not or the phrase didn't.
  • a token consisting of a single colon character i.e., “:” is replaced with a regular expression that allows for arbitrary whitespace before and after the colon.
  • SPLIT_AT_CHARACTER This operator allows a particular token to be split into two tokens based on the presence of a particular character. In the example of FIG. 3 , the SPLIT_AT_CHARACTER operator splits a token based on the presence of the colon character.
  • a reference to a WHITESPACE rule, a REPLACE_WORD rule or a SPLIT_AT_CHARACTER rule indicates a rule from a rule set, where the rule includes the aforementioned WHITESPACE, REPLACE_WORD or SPLIT_AT_CHARACTER operator, respectively.
  • FIG. 4A depicts an algorithm 400 whose execution applies a REPLACE_WORD rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
  • Algorithm 400 takes as input three parameters: (1) a search phrase, which is did not in this example; (2) the replacement regular expression which replaces the search phrase (e.g., ((did ⁇ s+not)
  • the input to algorithm 400 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.
  • Algorithm 400 produces an output list of tokens which includes the replacements made by using the aforementioned replacement regular expression to replace any occurrence of the search phrase.
  • all offsets i.e., ordered from their left to right occurrences
  • the search phrase matches the tokenized input (see line 1 of algorithm 400 ).
  • an empty list of tokens is initialized (see line 2 of algorithm 400 ) to eventually hold the set of modified tokens.
  • all tokens before the offset are copied to the output token set (see line 7 of algorithm 400 ).
  • the token for the replacement regular expression is added (see line 8 of algorithm 400 ).
  • the tokens from the last replacement tokens are added until the end of the input list is reached (see line 11 of algorithm 400 ).
  • the input phrase I did not call is transformed initially into a tokenized representation that is illustrated in FIG. 4B as a tokenized phrase 420 .
  • tokenized phrase 420 is escaped in step 210 of FIG. 2
  • the resulting tokenized and escaped phrase is stored in tokenizedInput, the list of input tokens that is input into algorithm 400 (see FIG. 4A ).
  • algorithm 400 applies the REPLACE_WORD rule of sample rule set 300 (see FIG. 3 ) to replace all occurrences of the search phrase did not in tokenizedInput by the replacement regular expression ((did ⁇ s+not)
  • FIG. 4C depicts an exemplary set of tokens 440 that result from executing algorithm 400 (see FIG. 4A ) to apply the REPLACE_WORD rule of rule set 300 (see FIG. 3 ) to tokenized phrase 420 (see FIG. 4B ).
  • the set of tokens 440 is generated by performing step 212 of FIG. 2 .
  • step 214 (see FIG. 2 ) generates a conversion of the set of tokens 440 by replacing each DELIM token with the WHITESPACE token defined in rule set 300 (see FIG. 3 ) (i.e., ⁇ W+) and by replacing each BOUNDARY token with ⁇ b (i.e., the regular expression syntax for denoting word boundaries).
  • the result of the aforementioned replacements in step 214 is an output regular expression 460 depicted in FIG. 4D .
  • FIG. 5A depicts an algorithm 500 whose execution applies a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
  • Algorithm 500 takes as input a list of input tokens (i.e., tokenizedinput in algorithm 500 ), which is a tokenized and escaped set of tokens resulting from step 210 of FIG. 2 .
  • the input to algorithm 500 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.
  • Algorithm 500 applies the SPLIT_AT_CHARACTER rule, which splits up a token based on the presence of a colon character. For example, consider the following input phrase to algorithm 500 :
  • Executing algorithm 500 in step 212 applies the SPLIT_AT_CHARACTER rule of rule set 300 (see FIG. 3 ) to the token list shown above.
  • the application of the SPLIT_AT_CHARACTER rule splits on the colon included the token list shown above and generates a token list 520 shown in FIG. 5B .
  • the second REPLACE_WORD rule of rule set 300 is applied to generate a token list 540 shown in FIG. 5C . That is, the REPLACE_WORD rule in FIG. 3 that includes the colon as the search phrase is applied to generate token list 540 .
  • Token list 540 is the result of executing algorithm 400 of FIG. 4A in step 212 (see FIG. 2 ).
  • step 214 converts token list 540 into a regular expression by replacing the BOUNDARY tokens with ⁇ b (i.e., the regular expression syntax for denoting word boundaries).
  • ⁇ b i.e., the regular expression syntax for denoting word boundaries.
  • the result of the conversion in step 214 is an output regular expression 560 depicted in FIG. 5D .
  • FIG. 6 is a table 600 of entities and relationships selected for the experiments in this section.
  • the entities and relationships of table 600 are selected from the Enron email dataset. A constant window of 30 characters was used for each selected relationship, as indicated by the values in the #chars column of table 600 .
  • Precision determines the number of matched annotations against the number of correct annotations.
  • Correct entity type The entities must match the correct type. For example, I can be reached at is not counted as a correct match if the requested entity is a Person and not the Author of the email. As another example, Paul can be reached at his fax number 5223 is not counted as a correct match since the requested entity is not a phone number.
  • FIG. 7A is a table 700 of results of the investigation of the person . . . phone number relationship.
  • an annotator for a person . . . phone number relationship relates the phone number and a verb.
  • this person . . . phone number relationship is modeled using multiple different handcrafted expressions based on the following native English phrases: can be reached at, can be contacted at, a call at, #, number is, and at. All of the aforementioned native English phrases express the relationship give me the phone number of a person, and therefore are handled as a single semantic relationship.
  • the high precision and recall of this set of experiments shown in table 700 is mainly due to the influence of the “strong” pattern of the phrase “at”.
  • an entity recognizer is a known component that recognizes entities (e.g., persons, phone numbers, organizations, etc.) for an information extraction task.
  • An entity recognizer may be a component (not shown) of a system that includes relaxed regular expression generator 104 (see FIG. 1A and FIG. 1B ).
  • FIG. 7B is a table 720 of results of the investigation of the person . . . person relationship.
  • the reason for the high precision of the handcrafted regular expression is the usage of the right regular expression line limiter $ and the definition of selected optional words before (e.g., research and executive) and after (e.g., to and is) the noun assistant.
  • detecting semantically relevant words before and after the native English input is far beyond the scope of a pure syntactic regular expression generator.
  • improving the performance of the entity recognizer will enhance the precision of the generated regular expressions significantly.
  • FIG. 7C is a table 740 of results of the investigation of the person . . . organization relationship.
  • the reason for the low recall of the handcrafted regular expression is the line boundary tokens ⁇ and $, in particular for the phrase working with.
  • the regular expression generator is improved by including an option to switch this line boundary functionality off or on.
  • the regular expression generator is improved by including an option that allows a user to define how many words are ignored before and after the native English input.
  • FIG. 7D is a table 760 of results of the investigation of the organization . . . organization relationship.
  • FIG. 8 is a block diagram of a computing unit 800 that includes a relaxed regular expression generator 104 of the system of FIG. 1A or FIG. 1B and that implements the process of FIG. 2 , in accordance with embodiments of the present invention.
  • Computing unit 800 generally comprises a central processing unit (CPU) 802 , a memory 804 , an input/output (I/O) interface 806 and a bus 808 , and is coupled to I/O devices 810 and a storage unit 812 .
  • CPU 802 performs computation and control functions of computing unit 800 .
  • CPU 802 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations (e.g., on a client and server).
  • Memory 804 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc.
  • Cache memory elements of memory 804 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
  • Storage unit 812 is, for example, a magnetic disk drive or an optical disk drive that stores data including relaxation rule file 106 .
  • memory 804 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 804 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown).
  • SAN storage area network
  • I/O interface 806 comprises any system for exchanging information to or from an external source.
  • I/O devices 810 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc.
  • Bus 808 provides a communication link between each of the components in computing unit 800 , and may comprise any type of transmission link, including electrical, optical, wireless, etc.
  • I/O interface 806 also allows computing unit 800 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 812 ).
  • the auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk).
  • Computing unit 800 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device.
  • DASD direct access storage device
  • Memory 804 includes program code for relaxed regular expression generator 104 . Further, memory 804 may include other systems not shown in FIG. 8 , such as an operating system (e.g., Linux) that runs on CPU 802 and provides control of various components within and/or connected to computing unit 102 .
  • an operating system e.g., Linux
  • the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
  • the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code 104 for use by or in connection with a computing system 800 or any instruction execution system to provide and facilitate the capabilities of the present invention.
  • a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 804 , ROM, a rigid magnetic disk and an optical disk.
  • Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • any of the components of the present invention can be deployed, managed, serviced, etc. by a service provider that offers to deploy or integrate computing infrastructure with respect to the method of automatically generating regular expressions for relaxed matching of text patterns.
  • the present invention discloses a process for supporting computer infrastructure, comprising integrating, hosting, maintaining and deploying computer-readable code into a computing system (e.g., computing unit 800 ), wherein the code in combination with the computing unit is capable of performing a method of automatically generating regular expressions for relaxed matching of text patterns.
  • the invention provides a business method that performs the process steps of the invention on a subscription, advertising and/or fee basis. That is, a service provider, such as a Solution Integrator, can offer to create, maintain, support, etc. a method of automatically generating regular expressions for relaxed matching of text patterns. In this case, the service provider can create, maintain, support, etc. a computer infrastructure that performs the process steps of the invention for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
  • a service provider such as a Solution Integrator
  • the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.

Abstract

A method for automatically generating regular expressions for relaxed matching of text patterns. A received input phrase expressed in a natural language is determined to be a plain text pattern. The plain text pattern is automatically tokenized, thereby generating a first token list. Rules loaded from a predefined rule set are automatically applied to the first token list in an order specified by the predefined rule set to automatically modify a token list by applying a replace word, split-at-character or whitespace operator. The modified token list is automatically converted into a regular expression that matches the plain text pattern and one or more variations of the plain text pattern. A utilization of the regular expression for an information extraction facilitates a recall and a precision of the information extraction.

Description

    FIELD OF THE INVENTION
  • The present invention relates to a method and system for automatically generating regular expressions for relaxed matching of text patterns.
  • BACKGROUND OF THE INVENTION
  • One category of information extraction employs query expansion and other query processing techniques in search engines. Conventional query expansion techniques generate an expanded output query from an original query, where the expanded output query includes additional words obtained from a synonym dictionary. The results of the expanded output query are documents that contain either the keywords of the original query or the additional words from the synonym dictionary. Being based on a natural language dictionary (e.g., standard English dictionary), the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc. Further, known query processing techniques that employ stemming and stop word removal decrease precision in information retrieval results. Another category of information extraction is rule-based and utilizes regular expressions. Conventional tools (e.g., Expresso offered by Ultrapico) in this second category allow a programmer to generate a regular expression using a graphical user interface and to check the syntax of a generated regular expression. These known regular expression generation tools are hampered by restricted usability because their users are required to have knowledge of the formulation and usage of syntactic constructs in regular expressions. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.
  • SUMMARY OF THE INVENTION
  • The present invention provides a computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
  • receiving, by a computing system, an input phrase expressed in a natural language;
  • determining, by the computing system, that the input phrase is a plain text pattern;
  • automatically tokenizing, by the computing system, the plain text pattern, wherein the automatically tokenizing includes automatically generating a first token list;
  • automatically applying, by the computing system, one or more rules to the first token list, wherein the automatically applying includes automatically modifying the first token list and automatically generating a modified token list in response to the automatically modifying the first token list; and
  • automatically converting, by the computing system, the modified token list into a regular expression, wherein the regular expression matches the plain text pattern and one or more variations of the plain text pattern.
  • A system and computer program product corresponding to the above-summarized method are also described and claimed herein.
  • Advantageously, the present invention provides a technique for automatically generating regular expressions for a relaxed matching of text patterns. Further, the present invention provides a generic, extensible, and widely applicable rule-based framework in which the automatic generation of regular expressions is based on the creation and updating of rules without requiring the writing and maintenance of complex and customized software programs.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
  • FIG. 2 is a flow diagram of a regular expression generation process implemented by the system of FIG. 1A or FIG. 1B, in accordance with embodiments of the present invention.
  • FIG. 3 depicts an example of a rule set included in the relaxation rule file of the system of FIG. 1A or FIG. 1B, in accordance with embodiments of the present invention.
  • FIG. 4A depicts an algorithm to apply a REPLACE_WORD rule included in the rule set of FIG. 3, in accordance with embodiments of the present invention.
  • FIG. 4B depicts an exemplary tokenized phrase generated via the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 4C depicts an exemplary set of tokens resulting from executing the algorithm of FIG. 4A to apply the REPLACE_WORD rule included in the rule set of FIG. 3 to an escaped version of the phrase of FIG. 4B, in accordance with embodiments of the present invention.
  • FIG. 4D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 4C via the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 5A depicts an algorithm to apply a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3, in accordance with embodiments of the present invention.
  • FIG. 5B depicts an exemplary set of tokens resulting from applying the rule of FIG. 5A via the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 5C depicts an exemplary set of tokens resulting from applying the rule of FIG. 4A to the set of tokens of FIG. 5B via the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 5D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 5C via the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 6 is a table of entities and relationships used in experiments for determining recall and precision of regular expressions generated by the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIGS. 7A-7D are tables of results of four sets of experiments organized according to the table of FIG. 6, where the experiments are for determining recall and precision of regular expressions generated by the process of FIG. 2, in accordance with embodiments of the present invention.
  • FIG. 8 is a block diagram of a computing unit that includes a relaxed regular expression generator of the system of FIG. 1A or FIG. 1B, in accordance with embodiments of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION 1 Overview
  • The goal of information extraction (IE) is to extract structured information from unstructured text (a.k.a. plain text) (e.g., documents, files, emails, web pages, etc.). In rule-based IE, rules are written that describe textual patterns of interest, which are to be extracted from unstructured text. Regular expressions are used for expressing such textual patterns of interest. As used herein, a regular expression is defined as a compact representation that describes a set of strings without listing all the elements of the set. A regular expression matches each of the strings in the set.
  • For example, consider the information extraction task of identifying text patterns that associate a person with his or her phone number. A text pattern of interest for this example is the phrase “can be reached at”. Using such a pattern, a rule-based IE system identifies occurrences of the form “<Person> can be reached at <Phone>” and generates the corresponding pairs of related Persons and Phones. In free-form text, however, the phrase “can be reached at” may occur with several variations: extra punctuation, multiple spaces or new lines between words, arbitrary capitalization, colloquial abbreviations for words (e.g., “reached” abbreviated as “rchd”). Such variation in text is particularly true for informal communication mediums such as email where the formatting and style of the text is not strictly controlled. A regular expression is used to account for the original input phrase “can be reached at” as well as the multiple variations.
  • The task of creating a regular expression that not only matches an original input phrase like “can be reached at” in the example presented above, but also the other variations is beyond the knowledge of the average untrained user of an information extraction system. The present invention addresses this problem by providing a generic and extensible rule-based framework for automatically generating a regular expression from a given input phrase (i.e., a plain text pattern) provided by a user. The input phrase is provided in a natural, human language (e.g., a user's native English). The regular expression output by the present invention improves the recall (i.e., increase the set of occurrences of the input phrase and its variations that are identified in the text) with little or no decrease in precision (i.e., without increasing the identification of spurious instances in the text).
  • As used herein, relaxation is the method of the present invention that converts a plain text pattern to an output regular expression that matches the original plain text pattern and that matches other strings that are variations of the original plain text pattern. The overall algorithm whose execution provides relaxation is referred to herein as the relaxed regular expression generator. The relaxation disclosed herein includes syntactic relaxation and semantic relaxation. Syntactic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on primarily syntactic aspects of the original plain text pattern such as punctuation and whitespace between words (i.e., matching to patterns that have different punctuation and/or whitespace while having the same words and the same meaning as the original plain text pattern). Semantic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on a modification of the words of the original plain text pattern while retaining the meaning of the original plain text pattern.
  • 2 Regular Expression Generation System
  • FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention. First system 100 includes a user input phrase 102, a relaxed regular expression generator 104, a relaxation rule file 106 and an output regular expression 108. User input phrase 102 is input into relaxed regular expression generator 104 as a phrase expressed in a natural, human language (e.g., a native English phrase). Relaxed regular expression generator 104 obtains relaxation rules from relaxation rule file 106 and applies the obtained rules to user input phrase 102 to automatically generate regular expression 108 as output. The relaxation rules in file 106 are predefined manually by, for example, an administrator of system 100. Hereinafter, relaxed regular expression generator 104 is also referred to simply as regular expression generator 104 or generator 104. Relaxation rules included in relaxation rule file 106 are also referred to herein simply as rules. The functionalities of the components of system 100 are described in more detail below relative to FIG. 2.
  • In one embodiment, system 100 includes an information extraction system (not shown) that includes an annotator generator (not shown). The annotator generator is coupled to relaxed regular expression generator 104. In this embodiment, generator 104 receives as input an annotator rule expressed in a natural, human language and outputs an annotator rule as regular expression 108. The output regular expression is a relaxed regular expression in that it matches the original input annotator rule as well as variations of the annotator rule. The annotator generator then uses output regular expression 108 to generate an annotator that facilitates information extraction.
  • In another embodiment, system 100 includes a search engine (not shown) that is coupled to relaxed regular expression generator 104. In this embodiment, generator 104 receives as input a search query expressed in a natural, human language and outputs a query as regular expression 108. The output regular expression 108 is a relaxed regular expression in that it matches the input search query as well as variations of the search query. The search engine then uses output regular expression 108 to generate results (e.g., documents) of a search that uses the input search query and its variations.
  • FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention. Second system 120 implements another embodiment of the present invention and includes user input phrase 102, relaxed regular expression generator 104, a software-based rule learning component 122, one or more output relaxation rules 124 and an output regular expression 108. Again, user input phrase 102 is expressed in a natural, human language and is input into generator 104. In this embodiment, rule learning component 122 automatically learns one or more relaxation rules and outputs one or more rules 124, which are then obtained by generator 104 and applied by generator 104 to user input phrase 102 to generate regular expression 108.
  • Similar to the embodiment described above relative to system 100 (see FIG. 1A), system 120 may include an information extraction system (not shown) that includes an annotator generator (not shown) coupled to relaxed regular expression generator 104. The functionality of the information extraction system, the annotator generator and generator 104 is the same as described above relative to system 100 (see FIG. 1A). In another embodiment similar to an embodiment described above relative to system 100 (see FIG. 1A), system 120 may include a search engine coupled to generator 104. The functionality of the search engine and generator 104 is the same as described above relative to system 100 (see FIG. 1A).
  • 3 Regular Expression Generation Process
  • FIG. 2 is a flow diagram of a regular expression generation process implemented by the system of FIG. 1A or FIG. 1B, in accordance with embodiments of the present invention. The regular expression generation process starts at step 200. In step 202 regular expression generator 104 (see FIG. 1A and FIG. 1B) loads and parses a predefined rule set from relaxation rule file 106. The predefined rule set loaded in step 202 includes rules that include predefined modification operators (e.g., replacement and splitting operators). Each predefined modification operator may be employed by one or more rules in the rule set. Each rule in the predefined rule set that employs a predefined modification operator specifies one or more attributes or one or more parameters used when applying the modification operator to a token list.
  • In step 204, regular expression generator 104 (see FIG. 1A and FIG. 1B) receives user input phrase 102 and determines whether user input phrase 102 is already a regular expression or whether phrase 102 is a plain text pattern.
  • If step 204 determines that phrase 102 is a plain text pattern, then in step 206 generator 104 (see FIG. 1A and FIG. 1B) detects word boundaries in phrase 102 and tokenizes the plain text pattern that comprises phrase 102 to generate a set of input tokens. In step 208, generator 104 (see FIG. 1A and FIG. 1B) maps each of the aforementioned input tokens to a specific, internal representation for the system (e.g., system 100 of FIG. 1A) to produce a token list (i.e., a sequence of tokens).
  • In step 210, generator 104 (see FIG. 1A and FIG. 1B) replaces regular expression special characters in each entry of the token list produced in step 208 with escaped characters to generate a transformed token list (i.e., a tokenized and escaped phrase). For example, in step 210, Java® regular expression characters in a token list produced in step 208 are replaced with escaped characters.
  • In step 212, generator 104 (see FIG. 1A and FIG. 1B) applies one or more rules from the predefined rule set loaded in step 202 to the token list generated in step 210 in an order specified in relaxation rule file 106. The application of the one or more rules in step 212 generates a modified token list (a.k.a. a tokenized and modified phrase) that is a transformed version of input phrase 102. For any applied rule that includes a modification operator, step 212 includes applying the modification operator to the token list generated in step 210 or to an intermediate token list generated during the execution of step 210.
  • In step 214, generator 104 converts the modified token list generated in step 212 into a string, which represents output regular expression 108 (see FIG. 1A and FIG. 1B). Following step 214, the regular expression generation process ends at step 216.
  • Returning to step 204, if generator 104 (see FIG. 1A and FIG. 1B) determines that input phrase 102 is already a regular expression, then the above-described processing of steps 206, 208, 210, 212 and 214 is not performed, the input is passed to the output unchanged, and the regular expression generation process ends at step 216. For example, given that input phrase 102 is:

  • meet\s+(\w+\s+){0,5}<RoomNumber>
  • generator 104 (see FIG. 1A and FIG. 1B) recognizes that the input phrase is a regular expression and returns the input phrase unchanged as output 108 (see FIG. 1A and FIG. 1B).
  • If, however, input phrase 102 is the following phrase:

  • meet at <RoomNumber>
  • then generator 104 (see FIG. 1A and FIG. 1B) outputs the following relaxed regular expression as the result of performing the transformations of steps 206, 208, 210, 212 and 214:

  • \bmeet\b\W+\bat\b
  • which matches any string in which meet and at are adjacent words with an arbitrary whitespace between meet and at. Section 5 presented below describes experiments that demonstrate that utilizing the process of FIG. 2 to generate such relaxed regular expressions results in significantly higher recall and similar precision when compared to the input plain text pattern.
  • 4 Examples
  • This section includes a sample rule set and algorithms for applying rules in the sample rule set.
  • 4.1 Relaxation Rules
  • Relaxation rules are defined in a special file 106 (see FIG. 1A and FIG. 1B), which is loaded when the regular expression generator 104 (see FIG. 1A and FIG. 1B) is started. The rules are composed using a predefined set of modification operators. While the framework for relaxation disclosed herein is generic and can be customized by any number of modification operators, this section restricts its attention to three basic operators: WHITESPACE, REPLACE_WORD and SPLIT_AT_CHARACTER. FIG. 3 depicts an example of a rule set 300 that is included in relaxation rule file 106 (see FIG. 1A and FIG. 1B). Rule set 300 includes four rules that are expressed in a simple Extensible Markup Language (XML) format and that include the aforementioned basic operators. Note that in rule set 300, each rule has an attribute <stackposition> that controls the order in which the rules must be applied. The operators included in the rules of rule set 300 are briefly described below:
  • WHITESPACE: This operator replaces whitespace which has been identified as token delimiters with the replacement regular expression defined in the attribute <replacement>.
  • REPLACE_WORD: This operator replaces a sequence of one or more tokens with a replacement regular expression. In the example shown in FIG. 3, the tokens “did not” are replaced by a regular expression that matches either the phrase did\s+not or the phrase didn't. Similarly, a token consisting of a single colon character (i.e., “:”) is replaced with a regular expression that allows for arbitrary whitespace before and after the colon.
  • SPLIT_AT_CHARACTER: This operator allows a particular token to be split into two tokens based on the presence of a particular character. In the example of FIG. 3, the SPLIT_AT_CHARACTER operator splits a token based on the presence of the colon character.
  • Hereinafter, a reference to a WHITESPACE rule, a REPLACE_WORD rule or a SPLIT_AT_CHARACTER rule indicates a rule from a rule set, where the rule includes the aforementioned WHITESPACE, REPLACE_WORD or SPLIT_AT_CHARACTER operator, respectively.
  • 4.2 Analyzing and Applying Rules
  • FIG. 4A depicts an algorithm 400 whose execution applies a REPLACE_WORD rule included in the rule set of FIG. 3, in accordance with embodiments of the present invention. Algorithm 400 takes as input three parameters: (1) a search phrase, which is did not in this example; (2) the replacement regular expression which replaces the search phrase (e.g., ((did\s+not)|(didn\'t)) is the replacement regular expression that replaces did not); and (3) a list of input tokens (i.e., tokenizedInput in algorithm 400), which is a tokenized and escaped set of tokens resulting from step 210 of FIG. 2. For example, the input to algorithm 400 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.
  • Algorithm 400 produces an output list of tokens which includes the replacements made by using the aforementioned replacement regular expression to replace any occurrence of the search phrase.
  • During an initialization phase, all offsets (i.e., ordered from their left to right occurrences) are determined where the search phrase matches the tokenized input (see line 1 of algorithm 400). Furthermore, an empty list of tokens is initialized (see line 2 of algorithm 400) to eventually hold the set of modified tokens. After the initialization, for each offset, all tokens before the offset are copied to the output token set (see line 7 of algorithm 400). Next, the token for the replacement regular expression is added (see line 8 of algorithm 400). Finally, after considering all offsets, the tokens from the last replacement tokens are added until the end of the input list is reached (see line 11 of algorithm 400).
  • In the example of Section 4, the input phrase I did not call is transformed initially into a tokenized representation that is illustrated in FIG. 4B as a tokenized phrase 420. After tokenized phrase 420 is escaped in step 210 of FIG. 2, the resulting tokenized and escaped phrase is stored in tokenizedInput, the list of input tokens that is input into algorithm 400 (see FIG. 4A). Then algorithm 400 applies the REPLACE_WORD rule of sample rule set 300 (see FIG. 3) to replace all occurrences of the search phrase did not in tokenizedInput by the replacement regular expression ((did\s+not)|(didn\'t)).
  • FIG. 4C depicts an exemplary set of tokens 440 that result from executing algorithm 400 (see FIG. 4A) to apply the REPLACE_WORD rule of rule set 300 (see FIG. 3) to tokenized phrase 420 (see FIG. 4B). The set of tokens 440 is generated by performing step 212 of FIG. 2. Following the generation of the set of tokens 440, step 214 (see FIG. 2) generates a conversion of the set of tokens 440 by replacing each DELIM token with the WHITESPACE token defined in rule set 300 (see FIG. 3) (i.e., \W+) and by replacing each BOUNDARY token with \b (i.e., the regular expression syntax for denoting word boundaries). The result of the aforementioned replacements in step 214 (see FIG. 2) is an output regular expression 460 depicted in FIG. 4D.
  • FIG. 5A depicts an algorithm 500 whose execution applies a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3, in accordance with embodiments of the present invention. Algorithm 500 takes as input a list of input tokens (i.e., tokenizedinput in algorithm 500), which is a tokenized and escaped set of tokens resulting from step 210 of FIG. 2. For example, the input to algorithm 500 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already. Algorithm 500 applies the SPLIT_AT_CHARACTER rule, which splits up a token based on the presence of a colon character. For example, consider the following input phrase to algorithm 500:

  • phonenumber: 123-4567-890
  • which is represented as the following token list following step 210 of FIG. 2:

  • <BOUNDARY> <TXT>phonenumber:123-4567-890<TXT> <BOUNDARY>
  • Executing algorithm 500 in step 212 (see FIG. 2) applies the SPLIT_AT_CHARACTER rule of rule set 300 (see FIG. 3) to the token list shown above. The application of the SPLIT_AT_CHARACTER rule splits on the colon included the token list shown above and generates a token list 520 shown in FIG. 5B.
  • Following the application of the SPLIT_AT_CHARACTER rule, the second REPLACE_WORD rule of rule set 300 (see FIG. 3) is applied to generate a token list 540 shown in FIG. 5C. That is, the REPLACE_WORD rule in FIG. 3 that includes the colon as the search phrase is applied to generate token list 540. Token list 540 is the result of executing algorithm 400 of FIG. 4A in step 212 (see FIG. 2).
  • Following the generation of token list 540, step 214 (see FIG. 2) converts token list 540 into a regular expression by replacing the BOUNDARY tokens with \b (i.e., the regular expression syntax for denoting word boundaries). The result of the conversion in step 214 (see FIG. 2) is an output regular expression 560 depicted in FIG. 5D.
  • 5 Experiments
  • This section describes experiments for determining recall and precision of regular expressions generated by the process of FIG. 2. Experiments in this section are based on the Enron email dataset, which was collected and prepared by the CALO Project led by SRI International of Menlo Park, Calif.
  • 5.1 Experimental Setup
  • FIG. 6 is a table 600 of entities and relationships selected for the experiments in this section. The entities and relationships of table 600 are selected from the Enron email dataset. A constant window of 30 characters was used for each selected relationship, as indicated by the values in the #chars column of table 600.
  • 5.2 Evaluation Measures
  • The following metrics are used in this section to measure the efficiency and effectiveness of the selected relationships in table 600:
  • Precision: determines the number of matched annotations against the number of correct annotations.
  • Recall: determines the number of relevant annotations against the number of all possible relevant annotations.
  • Each generated annotation is manually evaluated using the following constraints:
  • Sentence boundaries: Both entities and the relationship must be within the same sentence. Thus, examples like the following are not counted:

  • . . . Peter Meyer. He can be reached at 56666.
  • Correct entity type: The entities must match the correct type. For example, I can be reached at is not counted as a correct match if the requested entity is a Person and not the Author of the email. As another example, Paul can be reached at his fax number 5223 is not counted as a correct match since the requested entity is not a phone number.
  • 5.3 Experimental Results
  • Four sets of experiments were conducted regarding the recall and precision of the generated regular expressions in contrast to handcrafted regular expressions.
  • 5.3.1 Person . . . Phone Number
  • In the first set of experiments, the relationship between a person and phone number is investigated and is hereinafter referred to as the person . . . phone number relationship. FIG. 7A is a table 700 of results of the investigation of the person . . . phone number relationship. Typically, an annotator for a person . . . phone number relationship relates the phone number and a verb. Currently, this person . . . phone number relationship is modeled using multiple different handcrafted expressions based on the following native English phrases: can be reached at, can be contacted at, a call at, #, number is, and at. All of the aforementioned native English phrases express the relationship give me the phone number of a person, and therefore are handled as a single semantic relationship. The high precision and recall of this set of experiments shown in table 700 is mainly due to the influence of the “strong” pattern of the phrase “at”.
  • Improvement potential for the regular expression generator: In the experiment regarding the person . . . phone number relationship, the main reason for false positives are sentence boundaries. A careful sentence boundary detection combined with a co-reference resolution could help to improve the precision. All handcrafted regular expressions use the line limiter ̂ and $. This operator lowers the recall significantly, while increasing the precision only slightly. In one embodiment, the regular expression generator interface is improved by allowing the user to turn off or turn on this sentence boundary detection feature. Another reason for the loss in precision is the poor performance of an entity recognizer, which influences the precision of the generated regular expressions indirectly. As used herein, an entity recognizer is a known component that recognizes entities (e.g., persons, phone numbers, organizations, etc.) for an information extraction task. An entity recognizer may be a component (not shown) of a system that includes relaxed regular expression generator 104 (see FIG. 1A and FIG. 1B).
  • 5.3.2 Person . . . Person
  • In the second set of experiments, the relationship expressing that one person works for another person is investigated and is hereinafter referred to as the person . . . person relationship. To express the person . . . person relationship, versions of the phrase works for and the noun assistant were used in the second set of experiments. FIG. 7B is a table 720 of results of the investigation of the person . . . person relationship.
  • Improvement potential for the regular expression generator: The reason for the high precision of the handcrafted regular expression is the usage of the right regular expression line limiter $ and the definition of selected optional words before (e.g., research and executive) and after (e.g., to and is) the noun assistant. However, detecting semantically relevant words before and after the native English input is far beyond the scope of a pure syntactic regular expression generator. Again, improving the performance of the entity recognizer will enhance the precision of the generated regular expressions significantly.
  • 5.3.3 Person . . . Organization
  • In the third set of experiments, the relationship expressing the semantics that a person works for a particular organization is investigated and is hereinafter referred to as the person . . . organization relationship. To express the person . . . organization relationship, the following variants of the verb work and the prepositions with and for were used: works for, working for, work with, and working with. FIG. 7C is a table 740 of results of the investigation of the person . . . organization relationship.
  • Improvement potential for the regular expression generator: The reason for the low recall of the handcrafted regular expression is the line boundary tokens ̂ and $, in particular for the phrase working with. In one embodiment, the regular expression generator is improved by including an option to switch this line boundary functionality off or on. In another embodiment, the regular expression generator is improved by including an option that allows a user to define how many words are ignored before and after the native English input.
  • 5.3.4 Organization . . . Organization
  • In the fourth set of experiments, the relationship expressing the semantics that an organization has been merged with or has been acquired by another organization is investigated and is hereinafter referred to as the organization . . . organization relationship. To express the organization . . . organization relationship, the following variants were used: agreed to buy, merged with, acquisition of, acquired, and acquires. FIG. 7D is a table 760 of results of the investigation of the organization . . . organization relationship.
  • Improvement potential for the regular expression generator: Again, this experiment shows that the main value of a handcrafted regular expression is the careful disjunctive combination of relevant verbs for a particular relationship (e.g., the combination of the verbs merge and acquire). An ideal generated regular expression is a disjunctive expression consisting of relevant variants for merge and acquire (e.g., merge OR merged OR acquire OR acquired).
  • 6 Conclusions
  • The experiments described above in Section 5 show that generated regular expressions based on native English user input can replace handcrafted regular expressions for derived annotators in Avatar. Generated regular expressions are a powerful concept and, in terms of recall and precision, perform similarly to handcrafted regular expressions. However, for some of the experiments described above, false positives were observed which lower precision and recall. To overcome these shortcomings, the following conclusions for the Avatar implementation are derived:
  • 1. The usage of line boundaries, such as ̂ and $, enhances the precision slightly, but lowers the recall drastically. Therefore, the regular expression generator does not consider line boundaries.
  • 2. Regular expressions matching entities across sentences are a minor source for false positives in one of the experiments. To overcome this problem, only text matches within the boundaries of one sentence are considered. However, a few matches may be missed using this approach. To overcome this problem, further investigations are needed to allow the capture of matching entities across sentences.
  • 3. Another major source for false positives is incorrectly identified entities, as recognized from the entity recognizer, which is not part of the regular expression generator. The base annotator for entity recognition has been improved so these false positives will no longer appear.
  • 7 Computing System
  • FIG. 8 is a block diagram of a computing unit 800 that includes a relaxed regular expression generator 104 of the system of FIG. 1A or FIG. 1B and that implements the process of FIG. 2, in accordance with embodiments of the present invention. Computing unit 800 generally comprises a central processing unit (CPU) 802, a memory 804, an input/output (I/O) interface 806 and a bus 808, and is coupled to I/O devices 810 and a storage unit 812. CPU 802 performs computation and control functions of computing unit 800. CPU 802 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations (e.g., on a client and server).
  • Memory 804 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Cache memory elements of memory 804 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Storage unit 812 is, for example, a magnetic disk drive or an optical disk drive that stores data including relaxation rule file 106. Moreover, similar to CPU 802, memory 804 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 804 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown).
  • I/O interface 806 comprises any system for exchanging information to or from an external source. I/O devices 810 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc. Bus 808 provides a communication link between each of the components in computing unit 800, and may comprise any type of transmission link, including electrical, optical, wireless, etc.
  • I/O interface 806 also allows computing unit 800 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 812). The auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk). Computing unit 800 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device.
  • Memory 804 includes program code for relaxed regular expression generator 104. Further, memory 804 may include other systems not shown in FIG. 8, such as an operating system (e.g., Linux) that runs on CPU 802 and provides control of various components within and/or connected to computing unit 102.
  • The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code 104 for use by or in connection with a computing system 800 or any instruction execution system to provide and facilitate the capabilities of the present invention. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 804, ROM, a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • Any of the components of the present invention can be deployed, managed, serviced, etc. by a service provider that offers to deploy or integrate computing infrastructure with respect to the method of automatically generating regular expressions for relaxed matching of text patterns. Thus, the present invention discloses a process for supporting computer infrastructure, comprising integrating, hosting, maintaining and deploying computer-readable code into a computing system (e.g., computing unit 800), wherein the code in combination with the computing unit is capable of performing a method of automatically generating regular expressions for relaxed matching of text patterns.
  • In another embodiment, the invention provides a business method that performs the process steps of the invention on a subscription, advertising and/or fee basis. That is, a service provider, such as a Solution Integrator, can offer to create, maintain, support, etc. a method of automatically generating regular expressions for relaxed matching of text patterns. In this case, the service provider can create, maintain, support, etc. a computer infrastructure that performs the process steps of the invention for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
  • The flow diagrams depicted herein are provided by way of example. There may be variations to these diagrams or the steps (or operations) described herein without departing from the spirit of the invention. For instance, in certain cases, the steps may be performed in differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the present invention as recited in the appended claims.
  • While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention.

Claims (2)

1. A computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
loading, by a computing system, a predefined set of rules from a rule file in a repository coupled to said computing system, wherein each rule of said predefined set of rules is expressed in an Extensible Markup Language (XML) format;
receiving, by a computing system, an input phrase expressed in a natural language;
determining, by said computing system, that said input phrase is a plain text pattern, wherein said determining that said input phrase is said plain text pattern includes determining that said input phrase is not a regular expression;
automatically tokenizing, by said computing system, said plain text pattern, wherein said automatically tokenizing includes automatically generating a first token list;
automatically applying, by said computing system, one or more rules to said first token list, wherein said automatically applying includes applying said one or more rules in an order specified by said predefined set of rules, automatically modifying said first token list and automatically generating a modified token list in response to said automatically modifying said first token list, wherein said one or more rules are included in said predefined set of rules, wherein said automatically modifying said first token list includes applying a predefined modification operator to said first token list, wherein said predefined modification operator is included in a rule of said one or more rules, wherein said predefined modification operator is an operator selected from the group consisting of a replace word operator, a split-at-character operator, and a whitespace operator, wherein said automatically modifying said first token list further includes:
replacing a sequence of one or more tokens in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said replace word operator,
detecting a character specified by said rule and splitting a token of said first token list into two tokens in response to said detecting said character if said predefined modification operator is said split-at-character operator, and
replacing whitespace in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said whitespace operator; and
automatically converting, by said computing system, said modified token list into a regular expression, wherein said regular expression matches said plain text pattern and one or more variations of said plain text pattern.
2-20. (canceled)
US11/850,987 2007-09-06 2007-09-06 Method for automatically generating regular expressions for relaxed matching of text patterns Abandoned US20090070327A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/850,987 US20090070327A1 (en) 2007-09-06 2007-09-06 Method for automatically generating regular expressions for relaxed matching of text patterns
US12/125,290 US8484238B2 (en) 2007-09-06 2008-05-22 Automatically generating regular expressions for relaxed matching of text patterns

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/850,987 US20090070327A1 (en) 2007-09-06 2007-09-06 Method for automatically generating regular expressions for relaxed matching of text patterns

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/125,290 Continuation US8484238B2 (en) 2007-09-06 2008-05-22 Automatically generating regular expressions for relaxed matching of text patterns

Publications (1)

Publication Number Publication Date
US20090070327A1 true US20090070327A1 (en) 2009-03-12

Family

ID=40432984

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/850,987 Abandoned US20090070327A1 (en) 2007-09-06 2007-09-06 Method for automatically generating regular expressions for relaxed matching of text patterns
US12/125,290 Expired - Fee Related US8484238B2 (en) 2007-09-06 2008-05-22 Automatically generating regular expressions for relaxed matching of text patterns

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/125,290 Expired - Fee Related US8484238B2 (en) 2007-09-06 2008-05-22 Automatically generating regular expressions for relaxed matching of text patterns

Country Status (1)

Country Link
US (2) US20090070327A1 (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050132034A1 (en) * 2003-12-10 2005-06-16 Iglesia Erik D.L. Rule parser
US20050132198A1 (en) * 2003-12-10 2005-06-16 Ahuja Ratinder P.S. Document de-registration
US20100174718A1 (en) * 2009-01-05 2010-07-08 International Business Machines Corporation Indexing for Regular Expressions in Text-Centric Applications
US20100191732A1 (en) * 2004-08-23 2010-07-29 Rick Lowe Database for a capture system
US20100312764A1 (en) * 2005-10-04 2010-12-09 West Services Inc. Feature engineering and user behavior analysis
US20110004599A1 (en) * 2005-08-31 2011-01-06 Mcafee, Inc. A system and method for word indexing in a capture system and querying thereof
US20110093258A1 (en) * 2009-10-15 2011-04-21 2167959 Ontario Inc. System and method for text cleaning
US20110149959A1 (en) * 2005-08-12 2011-06-23 Mcafee, Inc., A Delaware Corporation High speed packet capture
US20110167212A1 (en) * 2004-08-24 2011-07-07 Mcafee, Inc., A Delaware Corporation File system for a capture system
US20110197284A1 (en) * 2006-05-22 2011-08-11 Mcafee, Inc., A Delaware Corporation Attributes of captured objects in a capture system
US20110208861A1 (en) * 2004-06-23 2011-08-25 Mcafee, Inc. Object classification in a capture system
US20120114119A1 (en) * 2010-11-04 2012-05-10 Ratinder Paul Singh Ahuja System and method for protecting specified data combinations
US8504537B2 (en) 2006-03-24 2013-08-06 Mcafee, Inc. Signature distribution in a document registration system
US8601537B2 (en) 2008-07-10 2013-12-03 Mcafee, Inc. System and method for data mining and security policy management
US20140059078A1 (en) * 2012-08-27 2014-02-27 Microsoft Corporation Semantic query language
US8667121B2 (en) 2009-03-25 2014-03-04 Mcafee, Inc. System and method for managing data and policies
US8700561B2 (en) 2011-12-27 2014-04-15 Mcafee, Inc. System and method for providing data protection workflows in a network environment
US8706709B2 (en) 2009-01-15 2014-04-22 Mcafee, Inc. System and method for intelligent term grouping
US8762386B2 (en) 2003-12-10 2014-06-24 Mcafee, Inc. Method and apparatus for data capture and analysis system
US8850591B2 (en) 2009-01-13 2014-09-30 Mcafee, Inc. System and method for concept building
US8918359B2 (en) 2009-03-25 2014-12-23 Mcafee, Inc. System and method for data mining and security policy management
US9195937B2 (en) 2009-02-25 2015-11-24 Mcafee, Inc. System and method for intelligent state management
US9235639B2 (en) 2013-03-28 2016-01-12 Hewlett Packard Enterprise Development Lp Filter regular expression
US9253154B2 (en) 2008-08-12 2016-02-02 Mcafee, Inc. Configuration management for a capture/registration system
CN106407168A (en) * 2016-09-06 2017-02-15 首都师范大学 Automatic generation method for practical writing
WO2019241428A1 (en) * 2018-06-13 2019-12-19 Oracle International Corporation User interface for regular expression generation
US11321525B2 (en) 2019-08-23 2022-05-03 Micro Focus Llc Generation of markup-language script representing identity management rule from natural language-based rule script defining identity management rule
US11354305B2 (en) 2018-06-13 2022-06-07 Oracle International Corporation User interface commands for regular expression generation
US11494558B2 (en) 2020-01-06 2022-11-08 Netiq Corporation Conversion of script with rule elements to a natural language format
US11580166B2 (en) 2018-06-13 2023-02-14 Oracle International Corporation Regular expression generation using span highlighting alignment
US11941018B2 (en) 2018-06-13 2024-03-26 Oracle International Corporation Regular expression generation for negative example using context

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7730011B1 (en) 2005-10-19 2010-06-01 Mcafee, Inc. Attributes of captured objects in a capture system
US8812459B2 (en) * 2009-04-01 2014-08-19 Touchstone Systems, Inc. Method and system for text interpretation and normalization
US11423029B1 (en) * 2010-11-09 2022-08-23 Google Llc Index-side stem-based variant generation
US9317499B2 (en) 2013-04-11 2016-04-19 International Business Machines Corporation Optimizing generation of a regular expression
US9898467B1 (en) * 2013-09-24 2018-02-20 Amazon Technologies, Inc. System for data normalization
US9471875B2 (en) * 2013-12-31 2016-10-18 International Business Machines Corporation Using ontologies to comprehend regular expressions
CN105868166B (en) * 2015-01-22 2020-01-17 阿里巴巴集团控股有限公司 Regular expression generation method and system
US9916296B2 (en) 2015-09-24 2018-03-13 International Business Machines Corporation Expanding entity and relationship patterns to a collection of document annotators using run traces
WO2017131774A1 (en) * 2016-01-29 2017-08-03 AppDynamics, Inc. Log event summarization for distributed server system
US9767094B1 (en) 2016-07-07 2017-09-19 International Business Machines Corporation User interface for supplementing an answer key of a question answering system using semantically equivalent variants of natural language expressions
US9910848B2 (en) 2016-07-07 2018-03-06 International Business Machines Corporation Generating semantic variants of natural language expressions using type-specific templates
US9928235B2 (en) 2016-07-07 2018-03-27 International Business Machines Corporation Type-specific rule-based generation of semantic variants of natural language expression
US10474750B1 (en) * 2017-03-08 2019-11-12 Amazon Technologies, Inc. Multiple information classes parsing and execution
CN110928793B (en) * 2019-11-28 2023-07-28 Oppo广东移动通信有限公司 Regular expression detection method and device and computer readable storage medium
US11520831B2 (en) * 2020-06-09 2022-12-06 Servicenow, Inc. Accuracy metric for regular expression

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225999A1 (en) * 2003-05-06 2004-11-11 Andrew Nuss Grammer for regular expressions
US20060020937A1 (en) * 2004-07-21 2006-01-26 Softricity, Inc. System and method for extraction and creation of application meta-information within a software application repository

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6754650B2 (en) * 2001-05-08 2004-06-22 International Business Machines Corporation System and method for regular expression matching using index
US6842796B2 (en) * 2001-07-03 2005-01-11 International Business Machines Corporation Information extraction from documents with regular expression matching
US7502788B2 (en) * 2005-11-08 2009-03-10 International Business Machines Corporation Method for retrieving constant values using regular expressions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225999A1 (en) * 2003-05-06 2004-11-11 Andrew Nuss Grammer for regular expressions
US20060020937A1 (en) * 2004-07-21 2006-01-26 Softricity, Inc. System and method for extraction and creation of application meta-information within a software application repository

Cited By (72)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092471B2 (en) 2003-12-10 2015-07-28 Mcafee, Inc. Rule parser
US20050132198A1 (en) * 2003-12-10 2005-06-16 Ahuja Ratinder P.S. Document de-registration
US8762386B2 (en) 2003-12-10 2014-06-24 Mcafee, Inc. Method and apparatus for data capture and analysis system
US20050132034A1 (en) * 2003-12-10 2005-06-16 Iglesia Erik D.L. Rule parser
US8656039B2 (en) 2003-12-10 2014-02-18 Mcafee, Inc. Rule parser
US8548170B2 (en) 2003-12-10 2013-10-01 Mcafee, Inc. Document de-registration
US9374225B2 (en) 2003-12-10 2016-06-21 Mcafee, Inc. Document de-registration
US20110208861A1 (en) * 2004-06-23 2011-08-25 Mcafee, Inc. Object classification in a capture system
US20100191732A1 (en) * 2004-08-23 2010-07-29 Rick Lowe Database for a capture system
US8560534B2 (en) 2004-08-23 2013-10-15 Mcafee, Inc. Database for a capture system
US20110167212A1 (en) * 2004-08-24 2011-07-07 Mcafee, Inc., A Delaware Corporation File system for a capture system
US8707008B2 (en) 2004-08-24 2014-04-22 Mcafee, Inc. File system for a capture system
US20110149959A1 (en) * 2005-08-12 2011-06-23 Mcafee, Inc., A Delaware Corporation High speed packet capture
US8730955B2 (en) 2005-08-12 2014-05-20 Mcafee, Inc. High speed packet capture
US20110004599A1 (en) * 2005-08-31 2011-01-06 Mcafee, Inc. A system and method for word indexing in a capture system and querying thereof
US8554774B2 (en) 2005-08-31 2013-10-08 Mcafee, Inc. System and method for word indexing in a capture system and querying thereof
US9552420B2 (en) * 2005-10-04 2017-01-24 Thomson Reuters Global Resources Feature engineering and user behavior analysis
US20100312764A1 (en) * 2005-10-04 2010-12-09 West Services Inc. Feature engineering and user behavior analysis
US10387462B2 (en) 2005-10-04 2019-08-20 Thomson Reuters Global Resources Unlimited Company Feature engineering and user behavior analysis
US8504537B2 (en) 2006-03-24 2013-08-06 Mcafee, Inc. Signature distribution in a document registration system
US9094338B2 (en) 2006-05-22 2015-07-28 Mcafee, Inc. Attributes of captured objects in a capture system
US8683035B2 (en) 2006-05-22 2014-03-25 Mcafee, Inc. Attributes of captured objects in a capture system
US20110197284A1 (en) * 2006-05-22 2011-08-11 Mcafee, Inc., A Delaware Corporation Attributes of captured objects in a capture system
US8601537B2 (en) 2008-07-10 2013-12-03 Mcafee, Inc. System and method for data mining and security policy management
US8635706B2 (en) 2008-07-10 2014-01-21 Mcafee, Inc. System and method for data mining and security policy management
US9253154B2 (en) 2008-08-12 2016-02-02 Mcafee, Inc. Configuration management for a capture/registration system
US10367786B2 (en) 2008-08-12 2019-07-30 Mcafee, Llc Configuration management for a capture/registration system
US20100174718A1 (en) * 2009-01-05 2010-07-08 International Business Machines Corporation Indexing for Regular Expressions in Text-Centric Applications
US8548979B2 (en) * 2009-01-05 2013-10-01 International Business Machines Corporation Indexing for regular expressions in text-centric applications
US8266135B2 (en) * 2009-01-05 2012-09-11 International Business Machines Corporation Indexing for regular expressions in text-centric applications
US8850591B2 (en) 2009-01-13 2014-09-30 Mcafee, Inc. System and method for concept building
US8706709B2 (en) 2009-01-15 2014-04-22 Mcafee, Inc. System and method for intelligent term grouping
US9602548B2 (en) 2009-02-25 2017-03-21 Mcafee, Inc. System and method for intelligent state management
US9195937B2 (en) 2009-02-25 2015-11-24 Mcafee, Inc. System and method for intelligent state management
US9313232B2 (en) 2009-03-25 2016-04-12 Mcafee, Inc. System and method for data mining and security policy management
US8918359B2 (en) 2009-03-25 2014-12-23 Mcafee, Inc. System and method for data mining and security policy management
US8667121B2 (en) 2009-03-25 2014-03-04 Mcafee, Inc. System and method for managing data and policies
US20110093258A1 (en) * 2009-10-15 2011-04-21 2167959 Ontario Inc. System and method for text cleaning
US8868469B2 (en) 2009-10-15 2014-10-21 Rogers Communications Inc. System and method for phrase identification
US8380492B2 (en) 2009-10-15 2013-02-19 Rogers Communications Inc. System and method for text cleaning by classifying sentences using numerically represented features
US20110093414A1 (en) * 2009-10-15 2011-04-21 2167959 Ontario Inc. System and method for phrase identification
US11316848B2 (en) * 2010-11-04 2022-04-26 Mcafee, Llc System and method for protecting specified data combinations
US10313337B2 (en) 2010-11-04 2019-06-04 Mcafee, Llc System and method for protecting specified data combinations
US20120114119A1 (en) * 2010-11-04 2012-05-10 Ratinder Paul Singh Ahuja System and method for protecting specified data combinations
US8806615B2 (en) * 2010-11-04 2014-08-12 Mcafee, Inc. System and method for protecting specified data combinations
US20150067810A1 (en) * 2010-11-04 2015-03-05 Ratinder Paul Singh Ahuja System and method for protecting specified data combinations
US10666646B2 (en) 2010-11-04 2020-05-26 Mcafee, Llc System and method for protecting specified data combinations
US9794254B2 (en) * 2010-11-04 2017-10-17 Mcafee, Inc. System and method for protecting specified data combinations
US9430564B2 (en) 2011-12-27 2016-08-30 Mcafee, Inc. System and method for providing data protection workflows in a network environment
US8700561B2 (en) 2011-12-27 2014-04-15 Mcafee, Inc. System and method for providing data protection workflows in a network environment
US20170220673A1 (en) * 2012-08-27 2017-08-03 Microsoft Technology Licensing, Llc Semantic query language
US10579656B2 (en) * 2012-08-27 2020-03-03 Microsoft Technology Licensing, Llc Semantic query language
US20140059078A1 (en) * 2012-08-27 2014-02-27 Microsoft Corporation Semantic query language
US9659082B2 (en) * 2012-08-27 2017-05-23 Microsoft Technology Licensing, Llc Semantic query language
US9235639B2 (en) 2013-03-28 2016-01-12 Hewlett Packard Enterprise Development Lp Filter regular expression
CN106407168A (en) * 2016-09-06 2017-02-15 首都师范大学 Automatic generation method for practical writing
US11941018B2 (en) 2018-06-13 2024-03-26 Oracle International Corporation Regular expression generation for negative example using context
US11354305B2 (en) 2018-06-13 2022-06-07 Oracle International Corporation User interface commands for regular expression generation
WO2019241428A1 (en) * 2018-06-13 2019-12-19 Oracle International Corporation User interface for regular expression generation
CN112236763A (en) * 2018-06-13 2021-01-15 甲骨文国际公司 Regular expression generation using longest common subsequence algorithm on regular expression code
US11263247B2 (en) * 2018-06-13 2022-03-01 Oracle International Corporation Regular expression generation using longest common subsequence algorithm on spans
US11269934B2 (en) 2018-06-13 2022-03-08 Oracle International Corporation Regular expression generation using combinatoric longest common subsequence algorithms
WO2019241425A1 (en) * 2018-06-13 2019-12-19 Oracle International Corporation Regular expression generation based on positive and negative pattern matching examples
WO2019241416A1 (en) * 2018-06-13 2019-12-19 Oracle International Corporation Regular expression generation using longest common subsequence algorithm on regular expression codes
US11321368B2 (en) 2018-06-13 2022-05-03 Oracle International Corporation Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes
US11347779B2 (en) 2018-06-13 2022-05-31 Oracle International Corporation User interface for regular expression generation
WO2019241422A1 (en) * 2018-06-13 2019-12-19 Oracle International Corporation Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes
US11797582B2 (en) 2018-06-13 2023-10-24 Oracle International Corporation Regular expression generation based on positive and negative pattern matching examples
US11580166B2 (en) 2018-06-13 2023-02-14 Oracle International Corporation Regular expression generation using span highlighting alignment
US11755630B2 (en) 2018-06-13 2023-09-12 Oracle International Corporation Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes
US11321525B2 (en) 2019-08-23 2022-05-03 Micro Focus Llc Generation of markup-language script representing identity management rule from natural language-based rule script defining identity management rule
US11494558B2 (en) 2020-01-06 2022-11-08 Netiq Corporation Conversion of script with rule elements to a natural language format

Also Published As

Publication number Publication date
US20090070328A1 (en) 2009-03-12
US8484238B2 (en) 2013-07-09

Similar Documents

Publication Publication Date Title
US8484238B2 (en) Automatically generating regular expressions for relaxed matching of text patterns
US11113304B2 (en) Techniques for creating computer generated notes
Padró et al. Freeling 3.0: Towards wider multilinguality
Howard et al. Automatically mining software-based, semantically-similar words from comment-code mappings
KR101139903B1 (en) Semantic processor for recognition of Whole-Part relations in natural language documents
US9846692B2 (en) Method and system for machine-based extraction and interpretation of textual information
US9678941B2 (en) Domain-specific computational lexicon formation
US20160292153A1 (en) Identification of examples in documents
TW201314476A (en) Automated self-service user support based on ontology
EP2162833A1 (en) A method, system and computer program for intelligent text annotation
Hill Integrating natural language and program structure information to improve software search and exploration
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
AU2018250372A1 (en) Method to construct content based on a content repository
Nguyen et al. Vietnamese treebank construction and entropy-based error detection
Fatima et al. STEMUR: An automated word conflation algorithm for the Urdu language
Deeptimahanti et al. An innovative approach for generating static UML models from natural language requirements
JP3139658B2 (en) Document display method
JP2997469B2 (en) Natural language understanding method and information retrieval device
WO2020026229A2 (en) Proposition identification in natural language and usage thereof
CN112836477B (en) Method and device for generating code annotation document, electronic equipment and storage medium
Love Benchmarking the performance of Two Automated Term-extraction systems: LOGOS and ATAO
WO2001024053A2 (en) System and method for automatic context creation for electronic documents
Plant et al. A natural language help system shell through functional programming
JP2001034630A (en) System and method for document base retrieval
Pasquale Automatic generation of a navigation tree for conversational web browsing

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LOESER, ALEXANDER STEPHAN;RAGHAVEN, SRIRAM;VAITHYANATHAN, SHIVAKUMAR;REEL/FRAME:019791/0695;SIGNING DATES FROM 20070829 TO 20070831

AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE THE LAST NAME OF THE SECOND ASSIGNOR PREVIOUSLY RECORDED ON REEL 019791 FRAME 0695;ASSIGNORS:LOESER, ALEXANDER STEPHAN;RAGSHAVAN, SRIRAM;VAITHYANATHAN, SHIVAKUMAR;REEL/FRAME:019928/0045;SIGNING DATES FROM 20070829 TO 20070831

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION