WO2016114433A1

WO2016114433A1 - Unstructured data processing system and method

Info

Publication number: WO2016114433A1
Application number: PCT/KR2015/000498
Authority: WO
Inventors: 이경일; 김아로; 김선호
Original assignee: 주식회사 솔트룩스
Priority date: 2015-01-16
Filing date: 2015-01-16
Publication date: 2016-07-21

Abstract

An unstructured data processing system and method are disclosed. The unstructured data processing system according to an exemplary embodiment of the present invention can comprise: a pattern providing unit for providing a pattern of unstructured data on the basis of an unstructured data format; a rule providing unit for making at least one item correspond to the pattern and providing a rule including a correspondence relationship of the item and the pattern; and a rule execution engine for generating an attribute by applying the rule to the unstructured data.

Description

Unstructured Data Processing Systems and Methods

The technical idea of the present invention relates to a system and method for processing unstructured data, and more particularly, to a system and method for extracting features from unstructured data.

The present invention is derived from a study conducted and conducted by Saltlux Co., Ltd. as part of the SW Computing Industry Source Technology Development Project (SW) of the Ministry of Science, ICT and Future Planning. [Research period: 2014.05.01 ~ 2015.02.28, Specialized research management organization: Information and communication technology research promotion center, Project title: WiseKB: Development of self-learning knowledge base and reasoning technology based on big data understanding, Assignment number: 10044494]

Knowledge Base construction can be done by classifying the collected data (data) into a lexical system and storing it in the database. The data collected to build a knowledge base can come from a variety of sources. For example, the data collected for building the knowledge base may be data collected through the Internet for news, scholarly information, dictionaries, etc., or online or from another pre-built knowledge base (eg, expertise base). The data may be collected offline or may be data directly input by the user. In addition, the data collected can vary widely in format. For example, the data collected for building the knowledge base may be text-based data, image-based data, or voice and video-based data. As such, extracting necessary information from various kinds of data and managing the extracted information may be very important in building a knowledge base.

The technical idea of the present invention provides an unstructured data processing system and method for effectively extracting features from unstructured data.

In order to achieve the above object, the atypical data processing system according to an aspect of the present invention, the data interface unit for receiving the unstructured data from the outside, and the feature extraction unit for extracting the characteristics of the unstructured data and the And a property information generation unit including a property relationship setting unit for generating property information by setting the relationship information for the property, wherein the property extraction unit is configured to provide a pattern of the atypical data based on a format of the unstructured data. A rule providing unit for studying, at least one item corresponding to the pattern, and providing a rule including a correspondence between the item and the pattern, and a rule execution engine for generating the characteristic by applying the rule to the unstructured data. It may include.

According to an exemplary embodiment of the present invention, the atypical data processing system includes a data storage unit including a pattern storage unit for storing a plurality of patterns and a rule storage unit for storing a plurality of rules, and an input signal from a user and receiving an input signal. The apparatus may further include a user interface configured to provide an output signal to the pattern providing unit. The pattern providing unit may generate a pattern based on the input signal and store the pattern in the pattern storage unit. The rule providing unit may generate a rule based on the input signal. It may be generated and stored in the rule storage unit.

According to an exemplary embodiment of the present invention, the pattern providing unit provides a pattern recommendation unit for providing at least one recommendation pattern selected from a plurality of patterns stored in the pattern storage unit based on a format of the unstructured data, the input signal and A pattern definition unit that determines a pattern corresponding to the unstructured data based on the recommendation pattern, and a pattern execution engine that extracts data included in an information area from the unstructured data based on the pattern defined by the pattern definition unit. It may include.

According to an exemplary embodiment of the present invention, the pattern recommendation unit may select the recommendation pattern based on the type or source of the knowledge data.

According to an exemplary embodiment of the present invention, the pattern definition unit may identify at least one information area by analyzing the format of the unstructured data, and pattern the information area based on the input signal and / or the recommendation pattern. It can be set to or excluded from the pattern.

According to an exemplary embodiment of the present invention, the pattern definition unit may group a plurality of information areas having the same format.

According to an exemplary embodiment of the present invention, the rule provider provides at least one recommendation rule selected from a plurality of rules stored in the rule storage unit based on data extracted from the information area of the unstructured data according to the pattern. And a rule definition unit for defining a rule corresponding to the unstructured data based on the input signal and / or the recommendation rule.

According to an exemplary embodiment of the present invention, the rule recommendation unit may select the recommendation rule further based on the type or source of the knowledge data.

According to an exemplary embodiment of the present disclosure, the rule definition unit may identify an item corresponding to the information area by analyzing the extracted data, and may correspond the information area to the item.

According to an exemplary embodiment of the present invention, the rule definition unit may store a plurality of candidate items, update the candidate items based on the input signal, and correspond the information area to one of the candidate items. You can.

According to an exemplary embodiment of the present invention, the data storage unit may further include a knowledge data storage unit for storing the knowledge data, wherein the atypical data processing system may include external knowledge data received from the interface and the knowledge data storage unit. The apparatus may further include a knowledge data manager configured to convert the characteristic information into knowledge data based on the stored knowledge data and verify the converted knowledge data.

According to an exemplary embodiment of the present invention, the feature information generation unit may further include a feature extraction management unit for classifying the unstructured data according to data type and generating a control signal to change the extraction method according to the corresponding data type. The pattern providing unit may analyze a format of the atypical data based on the control signal.

According to the atypical data processing system and method according to the technical idea of the present invention, information included in the unstructured data can be effectively extracted by using patterns and rules.

In addition, according to the atypical data processing system and method according to the technical concept of the present invention, by having a plurality of patterns and rules, by recommending a rule and pattern suitable for the received unstructured data valid information can be automatically extracted from the unstructured data Can be.

1 is a block diagram illustrating an unstructured data processing system according to an exemplary embodiment of the present invention.

FIG. 2 is a block diagram illustrating an embodiment of the feature extraction unit of FIG. 1 in accordance with an exemplary embodiment of the present invention. FIG.

3 to 5 are diagrams for describing an operation of the feature extraction unit of FIG. 1. 4 is a diagram illustrating an example of a rule.

FIG. 6 is a block diagram illustrating an example of an implementation of the pattern provider of FIG. 2, according to an exemplary embodiment of the invention. FIG.

7 is a block diagram illustrating an implementation of the rule provider of FIG. 2, in accordance with an exemplary embodiment of the present invention.

8 is a flowchart schematically illustrating a method of processing unstructured data according to an exemplary embodiment of the present invention.

Hereinafter, with reference to the accompanying drawings will be described in detail an embodiment of the present invention. The embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art. As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, like reference numerals are used for like elements. In the accompanying drawings, the dimensions of the structures are shown to be enlarged or reduced than actual for clarity of the invention.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof described on the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, parts, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art, and shall not be construed in ideal or excessively formal meanings, as expressly defined herein. .

1 is a block diagram illustrating an unstructured data processing system 10 according to an exemplary embodiment of the present invention. As shown in FIG. 1, the unstructured data processing system 10 may include a data interface unit 100, a characteristic information generator 200, a user interface 300, and a data storage unit 400. The unstructured data processing system 10 shown in FIG. 1 may be referred to as a knowledge base construction system. In the following, each of the components may be a hardware block or a software block. For example, each of the components may be independent hardware blocks that communicate with each other, or may be software blocks that are executed on one processor.

The data interface unit 100 may receive data from a data pool external to the unstructured data processing system 10. The data pool may represent that data may be generated, retained, and distributed, such as the Internet, a database, cloud sourcing, or a social network. In addition, the data pool may include data provided directly to the unstructured data processing system 10 by the public or by individuals.

The data interface unit 100 may receive informal data or unstructured data or knowledge data from the data pool. Unstructured data is data that is not implemented in a fixed form, and is contrasted with formal data or structured data including contents corresponding to corresponding fields. For example, a database, a spreadsheet, or the like may be structured data, and text documents, audio data, and image data may be unstructured data. Although not stored in a fixed field, the data includes metadata or schemas, but XML or HTML may be classified as semi-structured data. Note that it can be premised on the type of work. Unstructured data may be generated, retained, and distributed through cloud sourcing or social networks, among the examples of data pools described above.

The above-described structured data or unstructured data may be referred to as data before processing, and the secondary data that is processed significantly may be referred to as information. The knowledge data received by the data interface unit 100 may be meta information about how to use the information. For example, wind speed, wind direction, and humidity obtained in observing the climate may correspond to data, and the weather predicted by modeling the data may correspond to information. At this time, the knowledge can be concluded through trial and error and analysis of cumulative information, for example, when the snow, driving accident rate increases, which may correspond to the knowledge data. Hereinafter, in order to distinguish between knowledge data input from the outside and knowledge data generated and managed by the atypical data processing system 10, the former is divided into external knowledge data and the latter is classified into internal knowledge data. The data interface unit 100 may receive external knowledge data from Wiki, DBpedia, FreeBase, or the like.

As such, the data interface unit 100 may automatically receive unstructured data or external knowledge data from the outside through a search engine. In addition, in response to a request of the characteristic information generation unit 200 or the knowledge data management unit 500 or a request generated by another functional block of the unstructured data processing system 10, the unstructured data or the external knowledge data may be received from the data pool. Can be.

The user interface 300 may exchange signals with an external user of the unstructured data processing system 10. For example, the user may input an input signal for setting a method of analyzing the unstructured data through the user interface 300. In addition, the user interface 300 may provide an output signal indicating a result of analyzing the unstructured data to the user.

Although the data interface unit 100 and the user interface unit 300 are respectively shown as independent components in the example shown in FIG. 1, it is only an example and it will be understood that the technical spirit of the present invention is not limited thereto. . For example, when the unstructured data processing system 10 receives data through the Internet and exchanges signals with a user through the Internet, the unstructured data processing system 10 may use the unstructured data processing system 10 through one interface unit. Can exchange information with outside

The characteristic information generator 200 extracts the characteristic of the input unstructured data, sets the relation information on the characteristic, and generates the characteristic information of the unstructured data. The atypical data characteristic information generator 200 may include a characteristic extraction manager 220, a characteristic extractor 240, and a characteristic relationship setter 260.

The feature extraction management unit 220 may classify the unstructured data according to the data type, and generate a control signal to change the method of extracting the feature according to the corresponding data type. For example, when the unstructured data is text-based data, the feature extraction manager 220 may generate a control signal to extract a feature based on the frequency of words included in the unstructured data. Alternatively, when the data type of the unstructured data is audio or video, the feature extraction manager 220 may generate a control signal to extract the feature based on the frequency spectrum of the unstructured data.

The feature extractor 240 may extract a feature from the unstructured data in response to the control signal. For example, the feature extractor 240 may extract words having a high frequency of occurrence as features. Alternatively, the feature extractor 240 may define an object in the image according to the analysis result of the frequency spectrum. In this case, the feature extractor 240 may extract an object such as an eye, a nose, and a mouth as a feature from the face image. The feature extractor 240 may include a module (not shown) for converting a format from the frequency spectrum into an object.

The characteristic relationship setting unit 260 may set relationship information on the characteristic by assigning semantic information to the characteristic extracted from the characteristic extracting unit 240. For example, the characteristic relationship setting unit 260 may assign semantic information to a word having a high frequency by tagging the entity name using a lexical dictionary. Furthermore, the characteristic relationship setting unit 260 may analyze the association relationship between at least two semantic information on the characteristic and give newly set or generated semantic information to the characteristic. For example, when the word included in the text is a mobile phone, a home appliance, or the like, the characteristic relationship setting unit 260 may assign meaning information of electronic products to these characteristics. In this case, the characteristic relationship setting unit 260 may perform the above analysis by using internal knowledge data stored in the knowledge data storage unit 420 of the data storage unit 400.

The characteristic information generated as described above is transmitted to the knowledge data manager 500. The knowledge data management unit 500 converts the characteristic information received from the characteristic information generation unit 200 into internal knowledge data based on the knowledge data received from the data interface unit 100, and heterogeneous information on the converted internal knowledge data. Verify by verification method. To this end, the knowledge data management unit 500 may include a knowledge data conversion unit (not shown) and a conversion verification unit (not shown).

The knowledge data converter may convert the characteristic information into structured data using semantic technology. Semantic technology refers to intelligent technology that enables computers to communicate by setting language and rules that a computer can understand, just as people read the screen and understand the meaning. Semantic technology aims to express the relationship-semiteme between objects belonging to the environment in the form of an ontology that can be processed by a machine, that is, a computer, and to process it by an automated machine. An ontology is a model that abstracts and shares what people think about things. It is a technology that is formalized and explicitly defines the types of concepts or usage constraints. In computer science, ontology is a data model that represents a specific domain and is defined as structured data that describes the concepts and the relationships between them. Ontology is a tool that can implement semantic technology. It is used as a tool to connect data semantically, and it can process and process the concept of human things in a form of database in computer.

In this field of semantic technology, the expression form of triple is used as a means for expressing a relationship. Triple refers to expressing concepts in the form of subjects, predicates, and objects. Each subject, predicate, and object can be expressed as a Uniform Resource Identifier (URI) in XML. Currently, the standard language describing semantic web ontology is RDF, OWL, and TopicMaps, which are proposed by the W3C.

The knowledge data converting unit may use the external knowledge data in converting the characteristic information into the internal knowledge data of the triple form. For example, the knowledge data converter may use external knowledge data to form internal knowledge data by connecting the subject, predicate, and object included in the characteristic information, or to connect additional objects. For example, the knowledge data transformation unit may transform the characteristic information of person A, person B, and marriage, such as "A has married B." Using knowledge data about A and B's wedding in Wikipedia, "A And B were married at the Hyatt Hotel on August 10, 2013. " Since the above example is for illustrative purposes, it may be irrelevant to the knowledge data according to the classification of the above-described data and knowledge.

The knowledge data converting unit may assign weights to the characteristic information (or the characteristic) or the external knowledge data in converting the characteristic information into the triple form of internal knowledge data. For example, for characteristic information that implies semantic information about a property such as furniture and household appliances included in an arbitrary text, weights for household appliances over furniture are considered in consideration of other characteristics included in the text. The height can be generated as internal knowledge data related to newlyweds. Alternatively, the knowledge data conversion unit adds weights to the characteristic information of person A, person B, and marriage, rather than external knowledge data of person A, person C, love affair and unmarried, so that A contradicts external knowledge data of unmarried. Based on the characteristic information, A may generate internal knowledge data of married. At this time, the knowledge data converting unit A generates pending internal knowledge data such as unconfirmed once for marital status, and then, on the basis of the accumulated characteristic information or external knowledge data, A converts the final internal knowledge data for marital status. Can be generated.

The conversion verification unit may verify the internal knowledge data (temporary internal knowledge data) generated from the knowledge data conversion unit by using a heterogeneous verification method and process the verified internal knowledge data. The internal knowledge data verified by the conversion verification unit is stored in the knowledge data storage unit 420 of the data storage unit 400.

2 is a block diagram illustrating an implementation of the feature extraction unit 240 of FIG. 1 in accordance with an exemplary embodiment of the present invention. 3 to 5 are diagrams for describing an operation of the feature extractor 240 of FIG. 1. Specifically, FIG. 3 is a diagram illustrating an example of a pattern, FIG. 4 is a diagram illustrating an example of a rule, and FIG. 5 is a diagram illustrating a characteristic generated from unstructured data by executing a rule.

As described above, the feature extractor 240 may extract a feature of the unstructured data. As shown in FIG. 2, the feature extractor 240 may include a pattern provider 242, a rule provider 244, and a rule execution engine 246. In addition, as shown in FIG. 2, the data storage unit 400 may further include a pattern storage unit 440 and a rule storage unit 460 as well as the knowledge data storage unit 420. Hereinafter, an embodiment in which the unstructured data received from the data interface unit 100 is a text-based document will be described, but this is only an example. As described above, the unstructured data processing system according to the exemplary embodiment of the present invention may have various types of unstructured data. It will be appreciated that it can be applied to data.

According to an exemplary embodiment of the present invention, the pattern provider 242 may provide a pattern corresponding to the unstructured data based on the format of the unstructured data. The pattern may serve as a reference used to extract an information area included in the unstructured data, and the information area may refer to an area containing information useful in the unstructured data. For example, in an HTML document which is a type of unstructured data, the information area may be a text area. The information area may be extracted from the unstructured data according to the pattern in the unstructured data. The pattern used to extract the information area from the unstructured data may be defined by the user through the user interface 300, and may be selected from a plurality of patterns stored in the pattern storage 440 of the data storage 400. It may be.

Referring to FIG. 3, a pattern in an HTML document of a social network as a kind of unstructured data may be used to extract a text area. As shown on the left side of FIG. 3, the HTML document of the social network may include a plurality of text areas separated from each other, and the pattern may classify a total of seven text areas as keywords included in a class. . As shown on the right side of FIG. 3, the information areas, ie text areas, extracted from the HTML document of the social network using the pattern may each include text. As such, the pattern provider 242 may extract the information area from the unstructured data by providing a pattern corresponding to the unstructured data based on the format of the unstructured data. A detailed description of the operation of the pattern provider 242 will be described later with reference to FIG. 6.

Meanwhile, when the unstructured data is voice or video based data according to an exemplary embodiment of the present invention, data related to a specific reference value may be extracted from the unstructured data as an information area, and the pattern may determine such reference value. For example, in the case of voice-based data, a pattern may be used to extract a sound including a specific db or more or a sound including a specific frequency.

According to an exemplary embodiment of the present invention, the rule provider 244 may match at least one item with a pattern provided by the pattern provider 242, and provide a rule including a correspondence between the item and the pattern. have. That is, the rule may include at least one item, and the item may correspond to a pattern provided by the pattern provider 242. In addition, the rule may determine for each information area a manner of extracting only necessary data from the information area extracted by the pattern. The rule may be defined by the user through the user interface 300, or may be selected from a plurality of rules stored in the rule storage 460 of the data storage 400.

In the example shown in FIG. 4, a rule may correspond to a plurality of items in the pattern of FIG. 3. That is, as shown in the left column of FIG. 4, the rule selects seven items such as' fullname ',' username ',' time ',' tweet-text ',' reply ',' retweet 'and favorite'. Each item may correspond to the information area (or text area) of FIG. 3 including the item as a keyword. In addition, the rule may determine a method of processing data in the information area, as shown in the right column of FIG. 4, to extract only necessary data from data included in the information area. For example, the rule may determine that the data corresponding to the 'fullname' item extracts the entire text included in the text area, while the data corresponding to the 'retweet' item extracts only numbers.

In accordance with an exemplary embodiment of the present invention, rule execution engine 246 may generate the characteristics of the unstructured data by applying the rule to the unstructured data. That is, as shown in FIG. 5, the rule execution engine 246 generates a characteristic by matching each item with a value (i.e., the result of processing the data contained in the information area) by executing the rule of FIG. Can be. In the example shown in FIG. 5, the characteristic of the HTML document of the social network may include seven items and a value corresponding to the items. As described in FIG. 1, the characteristics of the unstructured data generated by the rule execution engine 246 may be used to generate the characteristic information by the characteristic relationship setting unit 260, and the characteristic information may be used by the knowledge data management unit 500. Can be converted into knowledge data.

As such, the unstructured data processing system 10 according to an exemplary embodiment of the present invention processes data in a pattern and an information area that defines an information area useful for unstructured data based on the format of the unstructured data and the information area. A characteristic can be extracted from the unstructured data using a rule that maps the information area to an item based on the data of < RTI ID = 0.0 > Accordingly, the unstructured data can be effectively analyzed and the characteristics of the unstructured data for generating the knowledge data can be effectively extracted.

6 is a block diagram illustrating an implementation of the pattern provider 242 of FIG. 2, according to an exemplary embodiment of the present invention. As illustrated in FIG. 6, the pattern providing unit 242 may include a pattern recommending unit 242_2, a pattern defining unit 242_4, and a pattern execution engine 242_6.

The pattern recommendation unit 242_2 may provide a recommendation pattern determined to be suitable for the unstructured data. The pattern recommendation unit 242_2 may receive unstructured data from the data interface unit 100, and access the pattern storage unit 420. The pattern recommendation unit 242_2 may select at least one of the plurality of patterns stored in the pattern storage unit 440 based on the format of the unstructured data received from the data interface unit 100, and recommend the selected at least one pattern. It can be provided to the pattern definition part 242_4 as a pattern.

According to an exemplary embodiment of the present invention, the pattern recommendation unit 242_2 may select a recommendation pattern based on the type and / or source of the unstructured data. For example, the pattern recommendation unit 242_2 may be an unstructured data received from the data interface unit 100 as an HTML document, and analyze a source of the HTML document, for example, domain information. When the domain information corresponds to a service providing a social network, a pattern shown in FIG. 3 may be selected from among a plurality of patterns stored in the pattern storage unit 420, and the pattern definition unit 242_4 is used as the recommendation pattern. Can be provided to

The pattern definition unit 242_4 may determine a pattern to be applied to the unstructured data. That is, the pattern corresponding to the unstructured data may be determined based on the input signal received from the user through the user interface 300 and / or the recommendation pattern received from the pattern recommendation unit 242_2. For example, the pattern definition unit 242_4 may identify at least one information area included in the unstructured data by analyzing the format of the unstructured data. For example, the pattern definition unit 242_4 may identify a plurality of text areas in the HTML document. The pattern definition unit 242_4 may exclude some of the plurality of information areas according to the recommendation pattern received from the pattern recommendation unit 242_2 based on the input signal, or set an additional information area in the recommendation pattern. For example, the pattern definition unit 242_4 may define a pattern so that an unnecessary information area included in the unstructured data, for example, an information area including advertisement information, is not extracted based on a user input signal. Accordingly, a new pattern may be defined, and the pattern definition unit 242_4 may store the new pattern in the pattern storage unit 420.

The pattern definition unit 242_4 may group a plurality of information areas having the same format. For example, replies of users in a plurality of search results or social networks derived by a search engine may exist as plural in one unstructured data, and may have the same format as each other. The pattern definition unit 242_4 may group or hierarchize information areas having the same format.

The pattern execution engine 242_6 may generate a result of applying the pattern to the unstructured data. That is, the pattern execution engine 242_6 may extract data of the information area from the unstructured data based on the pattern defined by the pattern definition unit 242_4. The pattern execution engine 242_6 may provide the extracted data to the user through the user interface 300, and the input signal fed back by the user through the user interface 300 with respect to the provided data is the pattern definition unit 242_4. ) May be reflected in the defining pattern. Accordingly, the user may set the pattern while checking the result of applying the pattern to the unstructured data. In addition, the data extracted from the pattern execution engine 242_6 may be provided to the rule provider 244.

7 is a block diagram illustrating an implementation of the rule provider 244 of FIG. 2, in accordance with an exemplary embodiment of the present invention. As shown in FIG. 7, the rule provider 244 may include a rule recommender 244_2 and a rule definer 244_4.

The rule recommender 244_2 may provide a recommendation rule determined to be suitable for the unstructured data. The rule recommender 244_2 may receive the pattern and the extracted data from the pattern execution engine 242_6 of the pattern provider 242, and access the rule storage 460. The rule recommending unit 244_2 may select at least one of a plurality of rules stored in the rule storage unit 460 based on the pattern and the extracted data, and the rule defining unit 244_4 as the recommendation rule as the selected at least one rule. Can be provided to For example, as shown in FIGS. 3 and 4, the rule recommending unit 244_2 is a rule storing unit based on a feature of each information area inferred based on a keyword included in a class or a combination of a plurality of keywords. At least one of the plurality of rules stored at 460 may be selected. That is, the rule recommending unit 244_2 may determine that the unstructured data is an HTML document of a social network based on the seven keyword combinations shown in FIG. 3, and thus, the plurality of rules stored in the rule storage unit 460. At least one may be selected. According to an exemplary embodiment of the present invention, the pattern recommendation unit 242_2 may select a recommendation pattern based on the type and / or source of the unstructured data.

The rule definition unit 244_4 may determine a rule to be applied to the unstructured data. That is, the rule corresponding to the unstructured data may be determined based on the input signal received from the user through the user interface 300 and / or the recommendation rule received from the rule recommender 244_2. For example, the rule definition unit 244_4 analyzes the information included in the information area extracted from the unstructured data (eg, analyzing keywords included in the class in FIG. 3, or whether the format of the text indicates a date or By analyzing whether the number is indicated or not), an item corresponding to the information area can be identified, and the information area can be associated with the item.

According to an exemplary embodiment of the present invention, the rule definition unit 244_4 may store a plurality of candidate items, and may update candidate items based on an input signal received from a user through the user interface unit 300. The information region may correspond to one of the candidate items. For example, in the example of FIG. 4, when "October 31" corresponds to the "time" item, but there is "date" among candidate items stored in the rule definition unit 244_4, the rule definition unit 244_4 is unstructured. In data, a text area containing "time" in a class can be mapped to a "date" item.

As such, the unstructured data processing system 10 may store a plurality of patterns and a plurality of rules and provide a recommendation pattern and a recommendation rule determined to be suitable for the unstructured data. In addition, by providing an interface for defining patterns and rules from a user, patterns and rules suitable for unstructured data can be defined, and as a result, characteristics can be effectively extracted from unstructured data.

8 is a flowchart schematically illustrating a method 20 for processing unstructured data according to an exemplary embodiment of the present invention. As shown in FIG. 8, the unstructured data processing method 20 according to an exemplary embodiment of the present invention may include receiving unstructured data (S10). Referring to FIG. 1, the data interface unit 100 may receive unstructured data from a data pool.

The unstructured data processing method 20 may include a step S20 of defining a pattern based on an input signal and / or a recommendation pattern. Referring to FIG. 6, the pattern defining unit 242_4 corresponds to a pattern corresponding to the unstructured data based on an input signal received from the user through the user interface unit 300 and / or a recommendation pattern received from the pattern recommendation unit 242_2. Can be defined. The unstructured data processing method 20 may then comprise a step (S30) of storing and executing the defined pattern. Referring to FIG. 6, the pattern defining unit 242_4 may store the defined pattern in the pattern storage unit 440, and the pattern execution engine 242_6 may infer the information area by executing the defined pattern.

The unstructured data processing method 20 may include a step S40 of defining a rule based on an input signal and / or a recommendation rule. Referring to FIG. 7, the rule defining unit 244_4 corresponds to a rule corresponding to unstructured data based on an input signal received from a user through the user interface unit 300 and / or a recommendation rule received from the rule recommender 244_2. Can be defined. The unstructured data processing method 20 may then include the step of storing and executing the defined pattern (S50). Referring to FIG. 7, the rule definition unit 244_4 may store the defined rule in the rule storage unit 460, and the rule execution engine 246 may extract the characteristic of the unstructured data by executing the defined rule. .

As described above, exemplary embodiments have been disclosed in the drawings and the specification. Although embodiments have been described using specific terms in this specification, they are used only for the purpose of describing the technical spirit of the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. . Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

Claims

A data interface for receiving unstructured data from the outside; And

A feature information generation unit including a feature extraction unit for extracting a feature of the atypical data and a feature relationship setting unit for generating feature information by setting relationship information with respect to the feature,

The feature extraction unit,

A pattern provider for providing a pattern of the unstructured data based on a format of the unstructured data;

A rule provider corresponding to at least one item corresponding to the pattern and providing a rule including a correspondence relationship between the item and the pattern; And

And a rule execution engine that generates the property by applying the rule to the unstructured data.
The method of claim 1,

The unstructured data processing system,

A data storage unit including a pattern storage unit storing a plurality of patterns and a rule storage unit storing a plurality of rules; And

A user interface unit for receiving an input signal from a user and providing an output signal to the user,

The pattern providing unit generates a pattern based on the input signal and stores the pattern in the pattern storage unit,

The rule providing unit generates a rule based on the input signal and stores the rule in the rule storage unit.
The method of claim 2,

The pattern providing unit,

A pattern recommending unit providing at least one recommendation pattern selected from a plurality of patterns stored in the pattern storage unit based on a format of the unstructured data;

A pattern definition unit to determine a pattern corresponding to the atypical data based on the input signal and / or the recommendation pattern; And

And a pattern execution engine that extracts data contained in an information area from the unstructured data based on the pattern defined by the pattern definition unit.
The method of claim 3,

The pattern recommendation unit selects the recommendation pattern based on the type or source of the unstructured data.
The method of claim 3,

The pattern definition unit identifies at least one information area by analyzing a format of the unstructured data, and sets the information area to a pattern or excludes the pattern based on the input signal and / or the recommendation pattern. Unstructured data processing system.
The method of claim 5,

And the pattern definition unit groups a plurality of information areas having the same format.
The method of claim 2,

The rule provider,

A rule recommending unit for providing at least one recommendation rule selected from a plurality of rules stored in the rule storage unit based on data extracted from the information area of the unstructured data according to the pattern; And

And a rule definition unit for defining a rule corresponding to the unstructured data based on the input signal and / or the recommendation rule.
The method of claim 7, wherein

The rule recommending unit selects the recommendation rule further based on the type or source of the unstructured data.
The method of claim 7, wherein

The rule definition unit identifies an item corresponding to the information area by analyzing the extracted data, and maps the information area to the item.
The method of claim 7, wherein

And the rule definition unit stores a plurality of candidate items, updates the candidate items based on the input signal, and maps the information area to one of the candidate items.
The method of claim 2,

The data storage unit further includes a knowledge data storage unit for storing the knowledge data,

The atypical data processing system further includes a knowledge data management unit converting the characteristic information into knowledge data and verifying the converted knowledge data based on external knowledge data received from the interface and knowledge data stored in the knowledge data storage unit. Atypical data processing system, characterized in that.
The method of claim 1,

The characteristic information generation unit may further include a characteristic extraction manager configured to classify the unstructured data according to a data type and to generate a control signal to vary an extraction method according to a corresponding data type.

The pattern providing unit analyzes the format of the unstructured data based on the control signal.