US20090282066A1 - Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables - Google Patents

Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables Download PDF

Info

Publication number
US20090282066A1
US20090282066A1 US12/177,712 US17771208A US2009282066A1 US 20090282066 A1 US20090282066 A1 US 20090282066A1 US 17771208 A US17771208 A US 17771208A US 2009282066 A1 US2009282066 A1 US 2009282066A1
Authority
US
United States
Prior art keywords
data
semantic
physical
project
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/177,712
Inventor
John Russell
Steven FRECHETTE
Brian DELLERT
George MCCULLOCH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qliktech International AB
Original Assignee
Expressor Software
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Expressor Software filed Critical Expressor Software
Priority to US12/177,712 priority Critical patent/US20090282066A1/en
Assigned to EXPRESSOR SOFTWARE reassignment EXPRESSOR SOFTWARE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: DELLERT, BRIAN R, FRECHETTE, STEVEN R, MCCULLOCH, GEORGE P, RUSSELL, JOHN D
Priority to PCT/US2009/043267 priority patent/WO2009140154A1/en
Priority to CA2723933A priority patent/CA2723933C/en
Priority to EP09747239.3A priority patent/EP2300913A4/en
Publication of US20090282066A1 publication Critical patent/US20090282066A1/en
Assigned to QLIKTECH INTERNATIONAL AB reassignment QLIKTECH INTERNATIONAL AB ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: EXPRESSOR SOFTWARE CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management

Definitions

  • the present invention relates to data integration applications, and, more specifically, to abstracting data used in a data integration application by using semantic names.
  • Data integration applications must reconcile data from several potentially incompatible storage systems, convert these data into a unified format, and load the new data into the target database. These are complicated tasks, and they require careful planning and detailed knowledge of the structure of the source databases. Errors in data integration are common, difficult to diagnose, and expensive to fix.
  • This invention provides methods and systems for developing data integration applications with reusable semantic identifiers to represent application data sources and variables.
  • a method includes receiving a set of physical data identifiers that specify fields of physical data sources, storing in a database a set of semantic names for use in defining data integration applications, defining, in terms of the received semantic names, a data integration application comprising functional rules to extract, transform, and store data, and executing these rules by replacing each of the semantic names with data from the specified field of the physical data source.
  • the method further includes automatically converting the input data values from one datatype to another as required by the functional operators.
  • the method further includes providing a set of suggested semantic names and associating one of the suggested semantic names with a field of a physical data source.
  • FIG. 1 is a dataflow diagram that illustrates the operation of an example application, according to certain embodiments of the invention.
  • FIG. 2 is a UML package diagram that depicts the coarse dependencies and relationships among the basic components of the semantic data integration system, according to certain embodiments of the invention.
  • FIG. 3 is a relationship diagram that illustrates the relationships among the various types of project objects stored in the repository, according to certain embodiments of the invention.
  • FIG. 4 is a UML state diagram that depicts the relationships among the various project stages, according to certain embodiments of the invention.
  • FIG. 4.1 is a flowchart that depicts the various stages of project development, according to certain embodiments of the invention.
  • FIG. 5 is a relationship diagram that depicts the components of the semantic model, according to certain embodiments of the invention.
  • FIG. 6 is a relationship diagram that depicts the structure of a semantic data integration function within an application, according to certain embodiments of the invention.
  • FIG. 7 is a relationship diagram that depicts an output-oriented rule definition, according to certain embodiments of the invention.
  • FIG. 8 is a relationship diagram that depicts the use of output-oriented rules in a function, according to certain embodiments of the invention.
  • FIG. 9 is a relationship diagram that depicts the preferred embodiment of function-level synthetic debugging and testing for semantic data integration.
  • FIG. 10 is a diagram that illustrates the separation of development roles in a semantic data integration project, according to certain embodiments of the invention.
  • FIG. 11 is a control flow relationship diagram that illustrates the control flow within the data integration engine when a sample data integration application is executed on a single host, according to certain embodiments of the invention.
  • FIG. 12 is a data flow relationship diagram that illustrates the flow of data within the data integration engine when a sample data integration application is executed on a single host, according to certain embodiments of the invention.
  • FIG. 13 is a modified UML collaboration diagram that illustrates the startup sequence that results when a sample data integration application is executed in a distributed environment, according to certain embodiments of the invention.
  • FIG. 14 is a modified UML collaboration diagram that illustrates the process of distributed shared memory replication when a sample data integration application is executed in a distributed environment, according to certain embodiments of the invention.
  • FIG. 15 is a data flow relationship diagram that illustrates the flow of data in the data integration engine when a sample data integration application is run in a distributed environment, according to certain embodiments of the invention.
  • FIG. 16 is a diagram that depicts the various components of a data integration engine, according to certain embodiments of the invention.
  • Preferred embodiments of the present invention provide semantic systems and methods for developing, deploying, running, maintaining, and analyzing data integration applications and environments.
  • Those data integration applications that are relevant to the techniques described herein are broadly described by the class of applications concerned with the movement and transformation of data between systems and commonly represented by, but not limited to: data warehousing or ETL (extract-transform-load) applications, data profiling and data quality applications, and data migration applications that are concerned with moving data from old to new systems.
  • ETL extract-transform-load
  • a semantic development model enables an application to be partially or fully developed without knowledge of the physical data identities (locations, structures, names, types, etc.) being integrated. Physical identities are present in the system but they are abstracted with semantic identities. There are several advantages to this approach: changes to physical data locations or structures do not automatically prevent the application developer from accomplishing real work; a high or intimate level of knowledge of the data being integrated is not required; business rules and other application logic developed using semantic identities can easily be reused and tested from application to application regardless of the physicality of the underlying data structures; and the costs of data mapping exercises can be significantly reduced over time as the system learns about fields that are semantically equivalent.
  • a data integration application developed using the techniques described herein is preferably stored in a common repository or database.
  • This database includes a semantic metadata model that correlates physical locations and datatypes, representing the source and target data, with semantic identities.
  • the database also includes representations of business rules that are defined using semantic identities instead of physical identities.
  • the business rules and the semantic model are stored and maintained separately. Thus, application developers do not need to know the physical locations or datatypes of the source data in order to implement data transformation functions.
  • the repository is preferably maintained and updated using a hybrid versioning system for data integration projects.
  • This system provides version control for project artifacts, and also provides a fine-grained locking mechanism that controls the ability to edit and execute a project in various ways according to the project's current stage in the development process.
  • the hybrid versioning system also interfaces with a relational database, which can be used to efficiently calculate and report project metrics.
  • the system's data integration engine executes data integration applications using a parallel, distributed architecture. Parallelism is achieved where possible by leveraging multiple redundant data sources and distributing the execution of the application across multiple hosts.
  • the techniques disclosed herein are scalable to execution environments that comprise a large number of hosts.
  • FIG. 16 is a diagram that depicts the various components of a data integration system, according to certain embodiments of the invention.
  • the functional logic of the data integration is performed by a host computer [ 1601 ], that contains volatile memory [ 1602 ], a persistent storage device such as a hard drive [ 1608 ], a processor [ 1603 ], and a network interface [ 1604 ].
  • the computer can interact with databases [ 1605 , 1606 ].
  • the computer extracts data from some of these databases, transforms it according to programmatic data transformation rules, and loads the transformed data into other databases.
  • FIG. 16 is a diagram that depicts the various components of a data integration system, according to certain embodiments of the invention.
  • the functional logic of the data integration is performed by a host computer [ 1601 ], that contains volatile memory [ 1602 ], a persistent storage device such as a hard drive [ 1608 ], a processor [ 1603 ], and a network interface [ 1604 ].
  • the computer can interact with databases [ 1605 , 1606
  • FIG. 16 illustrates a system in which the computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface.
  • the data transformation rules may be executed on a single host, as shown in FIG. 16 , or they may be distributed across multiple hosts.
  • the host computer shown in FIG. 16 may also serve as a development workstation.
  • Development workstations are preferably connected to a graphical display device [ 1607 ], and to input devices such as a mouse [ 1609 ], and a keyboard [ 1610 ].
  • One preferred embodiment of the present invention includes a graphical development environment that displays a data integration application as a diagram, in which the data transformation rules are represented by shapes and the flow of data between rules is represented by arrows.
  • This visual interface allows developers to create and manipulate data integration applications at a more intuitive level than, for example, a text-based interface.
  • the techniques described herein may also be applied to non-graphical development environments.
  • FIG. 1 is a dataflow diagram that illustrates the operation of an example data integration application that will be referenced in following sections.
  • An application organizes the execution of a set of functions, which perform individual units of work, and the flow of data between those functions.
  • the sample application [ 101 ] has three functions, represented by boxes, and data flow between those functions, represented by arrows.
  • the Read-Data function [ 102 ] reads monthly transactional bank account data from a VSAM file [ 105 ] and outputs that data for use as input in the next function.
  • the Transform-Data function [ 103 ] receives its input from the Read-Data function. Its transformation logic aggregates those bank account transactions to compute end-of-month status for each account, and outputs the end-of-month status for use as input to the next function.
  • the Write-Data function [ 104 ] receives the end-of-month status from the Transform-Data function and writes that data to a flat RDBMS table [ 106 ] which will be used to produce monthly snapshot reports for each bank account.
  • Development of the sample application [ 101 ] begins when a project is created for managing the application's development and deployment.
  • a semantic model separate from the project, is used to store and maintain the association between physical identities (i.e. the physical locations and datatypes of the project's source data) and semantic identities. If no semantic models have been created for the relevant data, a new semantic model is initialized. If a semantic model for the project's source data had already been created (e.g. by a prior project, or through ongoing maintenance) then the new project may use the existing semantic model; thus, it is not necessary to create a new semantic model for each new project.
  • project-specific artifacts may be created. These artifacts, discussed in more detail below, are tested and checked-in to the repository.
  • the project entity also contains an identifier that represents the current stage of project development. At each stage of the project the application is executed by the data integration engine in a stage-specific environment. Eventually the application is considered complete and the project, and all applications contained within the project, are moved into production.
  • FIG. 2 is a UML package diagram that depicts the coarse dependencies and relationships among the basic components of the semantic data integration system, according to certain embodiments of the invention.
  • the system repository [ 201 ] is a database used by the system's tools and engine. It is centrally deployed in order to capture and share system objects across applications, and to provide visibility into data integration projects, data usage, application performance, and various metrics.
  • the repository consists of three high-level subsystems: a relational database, a source control system, and business logic to implement functionality such as creating a project, publishing, staging, etc.
  • the database and source control subsystems are provided using conventional third party technologies.
  • the business logic is implemented with a J2EE application but could easily be .NET or some other web-application technology.
  • the various system tools (semantic maintenance tool [ 204 ], project maintenance tool [ 205 ], and development tool [ 206 ]) connect to these repository subsystems directly as required.
  • the primary contents of the repository include: the semantic model [ 202 ] which captures metadata that describes the contextual or semantic identities in an enterprise, the actual or physical identities in an enterprise, and various relationships between the semantic identities and physical identities, and projects [ 203 ] which are system objects that group related artifacts necessary for defining and deploying a data integration application
  • the repository is manipulated by system tools including: the semantic maintenance tool [ 204 ], which maintains the semantic model, the project maintenance tool [ 205 ], which maintains projects and associated data and generates reports to various levels of detail across the system, the development tool [ 206 ] which is used to develop data integration applications, and the integration engine [ 207 ], which executes applications using a parallel, distributed system, computes runtime statistics, and stores these statistics in the repository. Additional description of these components and how they interact is included below.
  • system tools including: the semantic maintenance tool [ 204 ], which maintains the semantic model, the project maintenance tool [ 205 ], which maintains projects and associated data and generates reports to various levels of detail across the system, the development tool [ 206 ] which is used to develop data integration applications, and the integration engine [ 207 ], which executes applications using a parallel, distributed system, computes runtime statistics, and stores these statistics in the repository. Additional description of these components and how they interact is included below.
  • FIG. 3 is a relationship diagram that illustrates the relationships among the various types of project objects stored in the repository, according to certain embodiments of the invention.
  • a relationship diagram is a modified UML class diagram that conveys the relationships between objects or components. The object or component is labeled in a rectangular box and a relationship to another object or component is represented with a labeled arrow from one box to the other. The relationship reads from arrow begin to arrow end (the end of the line with the actual arrow). Like UML class diagrams, these relationship diagrams allow for containment to be expressed with an arrow or by placing the child object visually within the parent object. In some cases the rectangle for an object or component is dashed indicating that it is not an actual object but it is really a conceptual group (like an abstract class) for the objects shown therein.
  • a project [ 203 . 1 ] as depicted is one of the many projects [ 203 ] shown in FIG. 2 .
  • a project is a system object that is used to organize and manage the development and deployment of data integration applications through various stages.
  • a project's stage [ 301 ] specifies the current state of the project within the development and deployment process. The various stages that may be associated with a project are described in more detail below.
  • a project's measures [ 308 ] are metrics or statistics relevant to the project that are collected after the project is created.
  • a project's artifacts [ 302 ] define the project's applications and supporting metadata. These are preferably captured as XML files that are versioned using the standard source control functionality implemented by the development tool. Project artifacts are accumulated after inception and include: drawings [ 303 ] which are visual descriptions of the functions, transformations, and data flow for one or more applications; data access definitions [ 309 ] which individually describe a set of parallel access paths (expressed as URIs) to physical data resources; semantic records [ 304 ] which primarily describe the data structures for one or more applications; documentation [ 305 ] for the project and its artifacts, and other artifacts [ 306 ] that may be created during the life of the project.
  • the project model [ 307 ] is a relational model that represents the project, its artifacts, and other data and metadata for the project.
  • the project model and project measures provide a basis for introspection and analysis across all projects in the system.
  • a project's stage also controls where the project can be run.
  • individual machines where the engine can run are designated to allow execution only for a specific stage.
  • host machine SYSTEST 288 may be designated as a system testing machine. Any instance of the system's engine that is deployed on SYSTEST 288 will only allow projects in the “system testing” stage [ 301 . 2 ] to run. This additional level of control is compatible with how IT departments prefer to isolate business processes by hardware.
  • the simple application described above [ 101 ] might be developed as part of a new project implemented by the IT department of a financial institution that wishes to gather and analyze additional monthly status for individual bank accounts.
  • the project [ 203 . 1 ] would be created by a project manager using the project maintenance tool [ 205 ] and the project would begin in the development stage [ 301 . 1 ] (described below).
  • Preliminary project artifacts [ 302 ] such as semantic records [ 304 ] (described below) would then be defined and added to the project by a data architect or equivalent.
  • a developer would then use these artifacts to create drawings [ 303 ] which define the transformation logic of the application [ 101 ].
  • the project would move through various stages (see FIG. 4 ) until it is finally placed into production.
  • the project measures [ 308 ] would allow the project manager and others to analyze the project using relational reporting and analysis techniques in order to improve the company's data integration and business processes.
  • the development tool [ 206 ] is conventional, and similar in layout and purpose to many other existing graphical programming tools that may be used for defining workflow, process flow, or data integrations. Examples of such tools include Microsoft BizTalk Orchestrator, Vignette Business Integration Studio, and FileNet Process Designer, among others.
  • the primary workspace consists of a palette of symbols corresponding to various functions that may be performed by the engine [ 207 ], and a canvas area for creating a drawing. Prior to creating drawings for a project, the user is given permission to work on that project by another user, typically a project manager, of the project maintenance tool [ 205 ]. These permissions are stored in the repository [ 201 ].
  • the developer is allowed to “check out” a snapshot of the artifacts for any project for which the user has permission (as defined in the repository).
  • the project artifacts must include any semantic records [ 305 ] and data access configurations [ 309 ] that the developer will need to build the drawing; these requisite artifacts were previously defined by another user, typically a data architect, using the project maintenance tool [ 205 ].
  • the user drags functions from the palette to the canvas area.
  • the user would drag 3 different functions from the palette: one to read data from a file (necessary for the Read-Data function [ 101 ]), one to transform data (necessary for the Transform-Data function [ 102 ]), and one to write data to a table (necessary for the Write-Data function [ 103 ]).
  • the user would then visually “connect” the functions in the drawing according to the direction of the data flow for the sample application.
  • Each function has properties that must be configured to define its specific behavior for the engine.
  • the user will then edit these properties with standard property editor user interfaces.
  • the properties specified by the user for the Read-Data function include the name of its output semantic record [ 305 . 2 ] which specifies the data being read from the file, and the name of a data access configuration [ 309 ] which specifies one or more parallel access paths (expressed as URIs) to the file.
  • the properties specified by the user for the Write-Data function include the name of its input semantic record [ 305 . 1 ] which specifies the data being written to the table, and the name of a data access definition [ 309 ] which specifies one or more parallel access paths (expressed as URIs) to the table.
  • the input semantic record [ 305 . 1 ] for the Transform-Data function [ 103 ] is automatically derived from the output semantic record of the Read-Data function [ 102 ] and because the user connected the Write-Data function to the Transform-Data function, the output semantic record [ 305 . 2 ] for the Transform-Data function [ 103 ] is automatically derived from the input semantic record of the Write-Data function [ 104 ].
  • the user will further configure the Transform-Data function in the drawing by specifying its transformation logic in a transformation editor.
  • the semantic identities of the input semantic record and output semantic record are presented to the user in this editor. In the transformation editor, the user provides logic that specifies how output values in the semantic record are calculated.
  • the development tool will validate the drawing according to the rules of the engine and present warnings and errors to the user so that the user may correct the drawing.
  • the user has the ability to synthetically debug ( FIG. 9 ) the Transform-Data function from within the development tool.
  • the user may also execute the drawing from the development tool; in this scenario the execution may be performed by a local instance of the engine which is installed with the development tool, or on a remote instance of the engine which has been installed in an environment configured to support such testing.
  • the machine hosting the engine requires that any client access technologies relied on by the data access configurations [ 309 ] for each function in the drawing already be configured on the same machine; for example, in order to write to a table, the correct database drivers must be configured on the machine whose engine will be using those drivers to perform that operation.
  • the developer may “check-in” the drawing to the repository. This process is conventional in terms of workflow and implementation; the user may provide a comment for the change and a new version of the new drawing will be added to the source control system in the repository.
  • the project artifacts which are maintained via traditional source control mechanisms as described, the project staging controls (described below), and the project model which models those sources in a relational database, are maintained using a hybrid version control system, comprising both a standard version control system and a relational database.
  • version control systems have made it possible to record individual changes made to versioned artifacts, but do not allow for the analysis of these changes using standard relational database query techniques. Using pure relational database systems, however, it is extremely difficult to provide traditional version control functionality.
  • a traditional source control system does not inherently control access to system sources based on the development life-cycle stage of the project; such systems must rely on externally defined and enforced business practices to control access.
  • the hybrid version control system disclosed herein allows for both traditional artifact versioning/source control and relational data modeling of the same artifacts.
  • the hybrid version control system also provides built-in support for controlling access to project sources according to the current stage of the project.
  • FIG. 4 is a UML state diagram that depicts the relationships among the various project stages [ 301 ], according to a preferred embodiment.
  • the staging model provides control for moving a project through a development life-cycle.
  • a project may be in at most one of the following deployment stages (analogous to states in a state transition diagram): development [ 301 . 1 ], system testing [ 301 . 2 ], integration testing [ 301 . 3 ], user acceptance testing [ 301 . 4 ], readiness testing [ 301 . 5 ], or production [ 301 . 6 ].
  • the first superstate signifies whether a project is unlocked [ 403 ], which means that changes to project artifacts are allowed, or locked [ 404 ] which means that changes are not allowed.
  • the second superstate signifies whether a project is unpublished [ 401 ], which means that the project model has not been refreshed from the most recent changes to project artifacts, or published [ 402 ] which means that the project model is fully representative of the current project artifacts.
  • a project is created, published, and staged using the project maintenance tool [ 205 ].
  • Individual artifacts and changes to them are stored as separate versions in the repository's source-control system using the system's tools such as the project maintenance tool [ 205 ] and the development tool [ 206 ].
  • User permissions related to project development may be implemented using any user authentication/control databases, such as LDAP and ActiveDirectory.
  • the project artifacts are moved from the source control system to the relational database using conventional serialization methods and systems. When it is published to the database, it does not replace the older published version of the project, but is stored as a separate publication. Thus, queries executed against the database may gather information and statistics about multiple publications.
  • the project will again be in an unpublished state until being explicitly published again. From a published superstate a project in the development stage may be staged forward to any post-development stage including production [ 301 . 6 ]. After being staged out of development, the project is in a locked superstate such that artifacts cannot be modified until the project is staged back to development.
  • the project for the sample application [ 101 ] may be published and moved to a system testing stage [ 301 . 2 ]. While in this stage, various system tests are performed on the application and changes to the project's artifacts are prohibited. If system testing is successful, the project may be moved to an integration testing stage [ 301 . 3 ]. While in this stage, one of the tests uncovers an issue that must be addressed by a slight change to the configuration of the Write-Data [ 104 ] function in the drawing for the application. The project is moved back to the development stage [ 301 . 1 ] so that a developer can make this change.
  • the project is published again and moved back to the integration testing stage [ 301 . 3 ] for re-test.
  • the application might then pass testing at this stage and each subsequent stage until it is finally put into production [ 301 . 6 ].
  • the project model [ 307 ] and project measures [ 308 ] are updated. Both the project model and measures are maintained as a relational model in the repository. This enables project managers, data architects, decision makers, and other system users to query and analyze the project and projects in interesting ways. For example, a project manager may quickly learn in which projects a developer user has used a particular semantic record (which may be known to be broken); or cumulative usage across projects of a certain table; or which output rules for a certain semantic identity are used most. This type of inquiry and analysis is possible because of the publish functionality in the repository.
  • Some project metrics may use information from the source control system as well as the repository. Because a source file may be checked-out and checked-in multiple times between publications, only the source control system contains information about these intermediate file-versions.
  • FIG. 4.1 is a flowchart that depicts the separation of roles across the various stages of project development, according to certain embodiments of the invention.
  • the project manager [ 4101 ] creates a project called “foo” [ 4102 ] in the source control system, and assigns users [ 4103 ] to it.
  • the data architect [ 4104 ] then checks out the project [ 4105 ] and creates or modifies the semantic records and data access definitions that will be used by the project “foo” [ 4112 ] (these are discussed in more detail below).
  • the developer [ 4113 ] checks out the project [ 4106 ] and creates and modifies the project's drawings [ 4107 ] in the source repository, which specify the data transformation, extraction, and load rules used by the project and determine how data flows among these rules.
  • the developer checks the project in [ 4108 ].
  • the project manager [ 4101 ] publishes the project [ 4109 ], which moves the project artifacts into the relational database [ 4110 ]. After the project has been published, it may be moved into the “staging” phase [ 4111 ].
  • the project state will be set to “production,” the final phase of the project development process.
  • FIG. 5 is a relationship diagram (as described above) that depicts the components of the semantic model [ 202 ] in the repository [ 201 ], according to a preferred embodiment.
  • the semantic identity [ 501 ] is metadata that represents the abstract concept or meaning for a single business object that may be used in an enterprise; for example, an employee's last name. Additional properties of the semantic identity pertaining to its semantic type, subject area, and composition are also captured in the semantic model.
  • the output rule [ 701 ] defines the business logic for calculating a value for the semantic identity within a data integration application.
  • a semantic identity may have multiple output rules. The output rule and its usage is described in more detail in a later section.
  • the physical identity [ 502 ] is metadata that captures the external (physical) name of a specific business object (e.g., a database column).
  • the physical datatype [ 504 ] captures the external (physical) datatype of the associated physical identity (e.g., “20 character ASCII string”).
  • the semantic datatype [ 505 ] is associated with the semantic identity and specifies the datatype of the data referenced by the semantic identity, as used internally by the data integration application.
  • the physical datatype is used by the engine when it is reading or writing actual physical data.
  • the semantic datatype is used by the engine when processing transformation logic in the application (described later).
  • the semantic binding [ 503 ] associates a physical identity with a particular semantic identity.
  • Many physical identities and their physical attributes may be associated with the same semantic identity. For example, fields from various physical data locations such as lastName with a physical datatype of CHAR( 30 ) in one RDBMS table, last_name with a physical datatype of VARCHAR( 32 ) in another RDBMS table, and LST NM with a physical datatype of PICX ( 20 ) in a COBOL copybook, may all be physical instantiations of the semantic identity last_name, which could be universally associated with a semantic datatype string.
  • a semantic record [ 304 ] describes the layout of a physical data structure such as an employee table. Each field in the table would be described with a semantic binding that captures the actual column name (the physical identity) and the semantic identity. Other metadata specific to each field in the employee table, such as data type information, would also be described for each field in the semantic record.
  • a user would create and maintain the semantic model as follows.
  • the user would first locate the actual metadata for the physical data that must be represented. As an example, using the sample application, this would be the metadata for the VSAM file being read and the metadata for the RDBMS table being written. The names and types of each field or column would be preserved as physical identities and physical datatypes.
  • a rationalization process using conventional string matching techniques and statistical methods, is then performed by the tool that takes each physical identity, decomposes it, analyzes it, and suggests zero or more semantic identities. The user makes the final decision as to which semantic identity most applies to each physical identity. When an existing semantic identity does not apply, the user may define a new one and its semantic datatype. The physical identity, semantic identity, semantic binding, and other metadata gathered during the rationalization process, are saved in the repository.
  • FIG. 6 is a relationship diagram (as described above) that depicts the structure of a semantic data integration function within an application (such as the sample application described above) according to a preferred embodiment.
  • a function [ 601 ] performs an individual body of work within an application.
  • the function in FIG. 6 is a generic representation of any particular function in the present data integration system and could represent any of the functions [ 102 ], [ 103 ], or [ 104 ] in the sample application [ 101 ].
  • the function may have the following types of input: input data [ 603 ], which is an actual input data value that the function will consume when it runs in the engine, an input semantic identity [ 501 . 1 ] is a semantic identity [ 501 ] from the semantic model ( FIG. 5 ) that identifies an individual piece of data in a record that will be input to the function, and an input semantic record [ 305 . 1 ] is a semantic record [ 305 ] from the semantic model ( FIG. 5 ) that describes the exact structure and format of a data record that will be input to the function.
  • input data [ 603 ] is an actual input data value that the function will consume when it runs in the engine
  • an input semantic identity [ 501 . 1 ] is a semantic identity [ 501 ] from the semantic model ( FIG. 5 ) that identifies an individual piece of data in a record that will be input to the function
  • an input semantic record [ 305 . 1 ] is a semantic record [ 305 ] from the semantic model ( FIG. 5
  • the function may have the following types of output: output data [ 604 ], which is an actual output data value that the function will produce when it runs in the engine, an output semantic identity [ 501 . 2 ] is a semantic identity [ 501 ] from the semantic model ( FIG. 5 ) that identifies an individual piece of data in a record that will be output from the function, and an output semantic record [ 305 . 2 ] is a semantic record [ 305 ] from the semantic model ( FIG. 5 ) that describes the exact structure and format of a data record that will be output from the function.
  • a data access definition [ 309 ] will also be associated with the function.
  • the data access definition When the purpose of the function is to read or write data from or to a physical data source, the data access definition will specify one or more URIs for accessing the physical data being read or written, each of which constitutes a parallel processing path (or channel) for the operation.
  • the function When the function is an internal operation whose job is to manipulate data that has already been read (prior to writing), the data access definition identifies the particular channels that are relevant to the functions it is connected to.
  • the function may also have transformation logic [ 609 ] which may be used to calculate the output values for the function.
  • a semantic function is able to correlate input data to output data using the semantic identities. For example, if the input semantic record [ 305 . 1 ] includes a field with semantic identity last_name [ 501 . 1 ] whose actual source is from a column named lastName [ 502 ] and if the output semantic record [ 305 . 2 ] includes a field with semantic identity last_name [ 501 . 1 ] whose actual data source is a field named lstNm in a file [ 502 ], provided that the semantic model captures these relationships, the function will know that the two fields are semantically equivalent because they share the same semantic identity last_name, and thus can move the correct input data [ 603 ] to the correct output data [ 604 ] with little or no additional specification.
  • the output semantic record [ 305 . 2 ] for the Read-Data function [ 102 ] may include a semantic binding [ 503 ] that binds the output semantic identity [ 501 . 2 ] last_name to a physical field named LST_NM in the data being read from the VSAM file [ 105 ].
  • the input semantic record [ 305 . 1 ] for the Transform-Data function [ 103 ] may include the same semantic binding.
  • the data coming from the VSAM file on the mainframe stores all last names in upper case; ex: SMITH.
  • the transformation logic [ 602 ] in the Transform-Data function [ 103 ], which is a semantic function [ 601 ] like all functions in an application for the present invention, may be written to convert the input data [ 603 ] for input semantic identity [ 501 . 1 ] named last_name to title case; ex: Smith.
  • the developer only needs to know the semantic name last_name, and does not require any knowledge about the associated physical identity or the attributes of the VSAM source where the data is physically located. For example, suppose that in a different application in a different project, the physical identity for last name data pulled from a mainframe was called NAME_LAST. As part of that effort, the semantic model were updated and a new additional semantic binding that associated NAME_LAST to the last_name semantic identity were created. The same transformation logic responsible for converting last_name to title case could be used because the transformation uses the semantic identity last_name that is common to both physical identities, LST NM and NAME_LAST.
  • a user would use the semantic maintenance tool to import the physical identities specified in Tables 1 and 2, in order to rationalize these physical identities to semantic identities, as described above (if the repository already contains semantic records corresponding to these two data tables, then it would not be necessary to import these physical identities again; for present purposes we assume that they are being imported for the first time).
  • the semantic maintenance tool will suggest corresponding semantic identities. The user can affirm or override these suggestions.
  • mapping (physical identity, semantic identity) pairs.
  • the user may associate the title-case rule (as described above) with the semantic identity last_name.
  • This rule along with any other rules created by the user and associated with semantic identities, are stored in the repository.
  • semantic record SR 1 corresponding to the VSAM file, would contain the following:
  • Semantic Record for VSAM file (SR1) Phys. Phys. Ident. Datatype Semantic Ident. Semantic Datatype ACC-NO PICX(20) account_number string TRANS-TYPE PICX(1) transaction_type string TRANS-AMT 9(12)V9(2) transaction_amount number LAST-NAME PICX(20) last_name string FIRST-NAME PICX(20) first_name string . . . . . . . . .
  • the semantic record SR 2 corresponding to the RDBMS table, would contain the following:
  • the user would specify metadata that identifies the location of the VSAM file from which the data must be read, and associate the previously-defined semantic record SRI with the function as the function's output semantic record.
  • the developer would first connect the output of the Read-Data function to the input of the Transform-Data function, preferably via the graphical development tool, which represents functions and the connections between them using a graphical representation.
  • the developer would configure the output of the Transform-Data function to include the semantic identities listed in the semantic record SR 2 .
  • the user When specifying semantic entities in the output of Transform-Data, the user will be presented with a menu of rules stored in the repository that operate on those identities (allowing the user to select only valid, predefined rules). In this case, suppose that when the user specifies last_name, the user selects the title-case rule (as defined above) from the rules menu.
  • the developer would connect the output of Transform-Data to the input of the Write-Data function, and specify the location of the RDBMS table to which the data must be written.
  • the task of connecting two functions can be performed visually, using the graphical development tool. Throughout the process of configuring the application rules, the development tool never reveals the physical identities or datatypes of the source and target data to the user; this information is encapsulated in the semantic records SR 1 and SR 2 , which are opaque to the application developer.
  • the rule may have input which is characterized as follows: input data [ 603 ], which comprises one or more actual input data values that the function will consume when it runs; an input semantic identity [ 501 . 1 ] is a semantic identity [ 501 ] from the semantic model ( FIG. 5 ) that identifies an individual piece of data that will be input to the rule (an input parameter).
  • the rule [ 701 ] is defined to calculate a value for an output data field [ 604 ] with a given semantic identity [ 501 . 1 ]. All input data [ 603 ] required by the rule is identified using semantic identities [ 501 ].
  • semantic maintenance tool [ 204 ] ( FIG. 2 )
  • these rules can be developed independently from the application or function, tested (described in more detail below), stored and indexed by semantic identity in the repository [ 201 ] ( FIG. 2 ), and then used in the transformation logic of a function.
  • semantic data integration system optimizes the definition and employment of rules by semantically orienting them explicitly to output calculation as described.
  • the transformation logic associated with converting a last name to title case could be captured as a reusable output-oriented rule for the last_name semantic identity. This rule could be used in the sample application, as well as other applications in the same or different projects.
  • FIG. 8 is a relationship diagram (as described above) that extends FIG. 7 and adds the concept in FIG. 7 to depict the preferred embodiment of output-oriented rules employment in a function [ 601 ].
  • Transformation logic is configured for the function to perform various transformations or manipulations on the input data [ 603 ] in order to produce the correct output data [ 604 ].
  • Rules refer to input data and output data using semantic identities [ 501 . 1 ] and [ 501 . 2 ] respectively.
  • an example of a rule used in a function may be something trivial such as changing the case of last name. It may also be used for something more complex such as calculating a weighted account average balance.
  • rules are specified using a standard programming language that has been extended to include primitives that operate on typical database fields.
  • the Transform-Data [ 103 ] function when the Transform-Data [ 103 ] function is being configured by a developer, the calculations for its individual output fields are defined.
  • the predefined output-oriented rule for title casing last name may be referenced and used to define the calculation for that field in the function.
  • An example of an alternative embodiment of this process allows for a new output-oriented rule to be defined at the same time that the transformation logic for Transform-Data is being configured. In this case the pre-existing title casing rule might not already exist and the developer might add it and save it to the repository for general use.
  • semantic identities account_number and next_account_number might have semantic datatypes of string, and would therefore be represented internally as sequences of characters.
  • a developer might treat account_number as an integer, as illustrated in Table 6, where next_account_number is defined as account_number+1.
  • the system will recognize that “+” is an operator that applies to integers, convert account_number to an integer and perform the requested calculation. Finally, it will convert the result to a string, since the semantic datatype of next_account_number is string.
  • FIG. 9 is a relationship diagram (as described above) that extends FIG. 6 to depict the preferred embodiment of function-level synthetic debugging and testing for semantic data integration.
  • a function [ 601 ] and its transformation logic [ 602 ] may be tested using test data [ 901 ].
  • Test data for each input semantic identity [ 501 . 1 ] may come from a variety of sources including: derived test data [ 902 ], which is be automatically derived from the input semantic record by a generator function [ 905 ], specified test data [ 903 ], which is manually specified by the user [ 906 ], and existing test data [ 904 ], which is retrieved from the repository [ 201 ].
  • the system preserves data security by not exposing actual business data values within the development tool or while a data integration application is being executed by the engine.
  • synthetic debugging and testing is employed at the function level.
  • the ability to provide synthetic test data also allows for offline development in situations when the actual data sources might not be available.
  • test data values can come from multiple sources.
  • a test data generator function [ 905 ] can use information derived from the input semantic record [ 305 . 1 ] to synthetically generate test values [ 902 ], the user [ 906 ] may manually specify the test data values [ 903 ], or existing test data values [ 904 ] that are cataloged in the repository [ 201 ] by semantic identity may be used.
  • the user may choose to store test data values for each semantic identity back to the repository for future debugging and testing.
  • a developer may want to synthetically debug the Transform-Data function [ 103 ], in particular the logic described above that changes the case of last_name.
  • the developer may first try to re-use existing test data [ 904 ] from the repository. If no test data for last_name is found, the developer may try to generate test data [ 902 ].
  • the test data generator [ 905 ] may generate test data [ 902 ] that looks like this: ‘AaBbCcDdEeFfGgHhIiJj’. Upon testing this data with the function, the output correctly produces ‘Aabbccddeeffgghhiijj’.
  • the developer specifies his own test data [ 903 ]: ‘sT. jOHn’. Upon testing this data with the function, the output correctly produces ‘St. John’.
  • the developer saves this new test data to the repository so that it may be re-used the next time a developer needs test data for last_name. This is accomplished by, e.g., associating the new test data value with the associated semantic identity in a relational database table.
  • custom test data values allows the developer to ensure that a function responds appropriately to certain problematic input values that might not be generated by the random test data generator (e.g. integers that include random letters, negative account numbers, etc.).
  • These custom values are associated with a semantic identity (e.g., last_name), so once entered, they can automatically be reused as test data for any function, in any project, that uses the same semantic identity.
  • FIG. 10 combines a relationship diagram (as described above) and UML use-case diagram to depict the high-level separation of semantic data integration user activities, according to a preferred embodiment. Activities performed by system users can be classified either as enterprise maintenance or as application development.
  • Enterprise maintenance [ 1001 ] is performed with the semantic maintenance tool [ 204 ] and the project maintenance tool [ 205 ], and has two basic subcategories.
  • Semantic maintenance deals with the maintenance of the semantic model [ 202 ] including semantic identities [ 501 ] and physical identities [ 502 ] ( FIG. 5 ), and output-oriented semantic rules [ 701 ] ( FIG. 7 ).
  • Project maintenance is concerned with the maintenance of the project state and architecture-level objects such as semantic records [ 305 ] ( FIG. 5 ) and data access definition [ 309 ] ( FIG. 3 ) which may be defined at the project level or across the enterprise when reusability is possible.
  • Application development [ 1002 ] is performed with the development tool [ 206 ] ( FIG. 2 ) and is concerned with the development of data integration drawings [ 303 ] within or across projects.
  • Application development involves many of the objects that fall within the purview of enterprise maintenance, such as semantic identities and output-oriented rules. However, physical identities are never referenced in the context of application development.
  • the developer would then use that semantic record when creating the drawing [ 303 ] that describes the actual application [ 101 ]. At no point does the developer need to know anything about the physical nature of the VSAM file structure including the physical identities of its data. The developer can work strictly with semantic identities to define the data integration application.
  • FIG. 11 is a control flow relationship diagram that illustrates the control flow within the data integration engine when the example application [ 101 ] is executed on a single host, according to a preferred embodiment.
  • a control flow relationship diagram is a hybrid UML class/activity diagram that conveys the directionality of communication or contact between objects or components. Solid arrows indicate synchronous communication or contact and broken arrows indicate asynchronous communication or contact.
  • the data integration engine [ 207 ] has a single parent process [ 1102 ] which is a top-level operating system process whose task is to execute a data integration application defined by artifacts [ 302 ] within a specific project [ 203 . 1 ] ( FIG. 3 ).
  • the data integration engine uses these artifacts (e.g., the application drawing) to set up, initialize, and perform the data integration.
  • Distributed shared memory [ 1101 ] is a structured area of operating system memory that is used by the parent process [ 1102 ] and the child processes [ 1103 . 1 , 1103 . 2 , 1103 . 3 ] running on that host. Each of these child processes is responsible for performing a single function within the application.
  • child process A [ 1103 . 1 ] executes to the Read-Data function [ 102 ]
  • child process B [ 1103 . 2 ] executes to the Transform-Data function [ 103 ]
  • child process C [ 1103 . 3 ] executes to the Write-Data function [ 104 ].
  • Worker threads [ 1104 . 1 - 1104 . 9 ] subdivide the processing for each child process.
  • the parent process [ 1102 ] When the parent process [ 1102 ] starts, it analyzes the application drawing [ 303 ] ( FIG. 2 ) and related metadata in other project artifacts to initialize and run the application.
  • the parent process creates and initializes a shared properties file (not shown) with control flow characteristics for the child processes and threads.
  • the parent process also creates and initializes the distributed shared memory [ 1101 ], a section of which is specifically created for and assigned to each child process and thread.
  • Each child process writes information about its execution status to its assigned portion of the distributed shared memory, and this is used by the parent process [ 1102 ] to provide updates about the execution status of the data integration engine.
  • the parent process will create each child process [ 1103 . 1 , 1103 . 2 , 1103 . 3 ], synchronously or asynchronously depending on the nature of the function.
  • One child process is created for each function in the drawing (e.g. Read-Data, Write-Data, and Transform-Data, in the example application [ 101 ]).
  • the engine runs each function in parallel so that one function does not need to complete in order for the next to begin.
  • each child process Upon creation, each child process will read characteristics relevant to its execution from the shared properties file. These characteristics include information about how many threads should be running simultaneously to maximize parallelism. For example, if the shared properties file indicates that there are three different physical data sources for the data read by child process A [ 1103 . 1 ], then child process A will spawn three worker threads [ 1104 . 1 , 1104 . 2 , 1104 . 3 ], each of which loads its data from a different source.
  • Child process A is reading data from three sources using three different threads, it has three outputs.
  • child process B which transforms the data read by child process A, has three sources of input.
  • Child process B accordingly spawns three worker threads [ 1104 . 4 , 1104 . 5 , 1105 . 6 ], each thread reading the data output by one of the worker threads spawned by child process A.
  • child process C which writes the output of child process B to the specified target, spawns three threads, each of which corresponds to a thread spawned by child process B.
  • This thread system allows the data integration engine to take advantage of the parallelism made possible by multiple data sources.
  • the data integration engine When a function involves reading from or writing to a data source, the data integration engine examines the application drawing to determine the type of the data source involved. Based on the type of the data source, the appropriate interface methods are selected, and the data is read or written accordingly.
  • Control flow is asynchronous and non-locking between parent process, child processes, and threads. This is achieved by combining an update-then-signal asynchronous protocol for all communication (except for communication between threads, described above) and signaler-exclusive distributed shared memory segmentation.
  • update-then-signal protocol when a parent or child process needs to communicate with its child process or thread, respectively, it may update the distributed shared memory of the child and then asynchronously signal the child. When the child handles the signal, it will read its updated distributed shared memory (if necessary) and react. Communication in the other direction is the same.
  • a thread or child process needs to communicate with its parent (child process or parent process, respectively), it may first update its distributed shared memory and then asynchronously signal the parent. When the parent handles the signal, it will read the updated distributed shared memory (if necessary) and react.
  • the distributed shared memory areas used for communication are exclusively written by the signaler, ensuring that two processes never attempt to access the same memory simultaneously.
  • FIG. 12 is a data flow relationship diagram that extends FIG. 11 to depict the flow of data within the data integration engine [ 207 ] when the example application is executed on a single host.
  • a data flow relationship diagram is an extension of the control flow relationship diagram (as described above) whose purpose is to convey the directionality and flow of data between objects or components. Relevant, previously described, control flow may be shown as muted or grayed, while the objects pertinent to the data flow within that control flow will be prominent or black.
  • the data flow is captured with an arrow indicating the source of the data (no arrow pointer) and the target of the data (arrow pointer) that is optionally labeled with the resource responsible for the data flow.
  • channels [ 1201 . 1 - 1201 . 6 ] which are resources that are used for passing data from a worker thread for one function to a worker thread for another.
  • the role of child process A [ 1103 . 1 ] is to read data (see the Read-Data function [ 102 ] in FIG. 1 )
  • the role of child process B [ 1103 . 2 ] is to produce new data by applying transformation logic to that data (see the Transform-Data function [ 103 ] in FIG. 1 )
  • the role of child process C [ 1103 . 3 ] is to write the data produced by child process B (see the Write-Data function [ 104 ] in FIG. 1 ).
  • data flows through a dedicated channel from a worker thread spawned by one child process to a worker thread spawned by another child process.
  • Channels are implemented directly or indirectly through any means of interprocess communication, e.g. named pipes, sockets, riiop, rpc, soap, oob, and mpi.
  • each child process subdivides its work using parallel worker threads.
  • child process B [ 1103 . 2 ] does not wait for child process A [ 1103 . 1 ] to read all of the data before it begins; it can start transforming data received from child process A as soon as child process A outputs any data.
  • child process C [ 1103 . 3 ] does not wait for child process B to transform all of the data before it begins; it can start writing data received from child process B as soon as child process B outputs any data.
  • each semantic record is associated with a list of Universal Resource Indicators (URIs) that point to the relevant data.
  • URIs Universal Resource Indicators
  • These URIs might point to redundant copies of identical data or to data sources containing different data, but all of the indicated data sources must conform to the semantic record format that is specified in the file.
  • each URI in the list will be unique, allowing the engine to leverage parallelism by reading data simultaneously from several different locations. However, this is not a requirement, and if desired, two or more identical URIs can be listed.
  • Channel data flow in the sample application is structured as follows: each worker thread on child process A will read data in parallel from a data source specified by one of the listed URIs. As each thread [ 1104 . 1 , 1104 . 2 , 1104 . 3 ] spawned by child process A reads data, it makes that data available as output from child process A to be used as input for child process B [ 1103 . 2 ] by moving the data through a dedicated channel.
  • channel A 1 B 1 [ 1201 . 1 ] is a resource that is defined to pass data from thread A 1 [ 1104 . 1 ] on child process A to thread B 1 [ 1104 . 4 ] on child process B, channel A 2 B 2 [ 1201 .
  • each worker thread When each worker thread is spawned, it receives information that can be used to identify an input channel and an output channel, using a predetermined channel identification scheme. The thread connects to both of these channels, reads data from the input channel, and writes output data to the output channel. Thus, each thread connects on startup to the appropriate data channels.
  • FIG. 13 is a modified UML collaboration diagram that illustrates the startup sequence that results when the sample application is executed in a distributed environment comprising three hosts [ 1301 . 1 , 1301 . 2 , 1301 . 3 ], according to a preferred embodiment.
  • UML collaboration diagrams are used to convey the order of messages passed between object or components.
  • existing control flow relationships may also be depicted in gray in order to preserve useful context.
  • the primary difference between this scenario and that depicted in FIG. 11 is that processing will be distributed across 3 hosts [ 1301 . 1 , 1301 . 2 , 1301 . 3 ] in a networked environment.
  • the first function is specified to run on host A [ 1301 . 1 ]
  • the second function is specified to run on host B [ 1301 . 2 ]
  • the third function is specified to run on host C [ 1301 . 3 ].
  • the application is started by executing the data integration engine [ 207 . 1 ] on host A, the “master” host.
  • the master host reads the application drawing to determine which application functions will be executed on the master host.
  • Each function is associated with a list of URIs, each of which represents a host on which the function can be executed.
  • the application developer selects one of the listed hosts from the list, the selection is recorded in the application drawing, and the corresponding host is used by the data processing engine to execute the function. If no host is specified, the function will execute by default on the same host as the previous function, if possible.
  • Binary data is passed between hosts using any standardized protocol and byte-order.
  • network byte-order is used to transfer binary data between hosts and to temporarily store data on execution hosts.
  • the data is automatically converted to machine-native byte-order for the operation, and converted back to the standardized byte-order (e.g., network byte-order) afterwards.
  • the host A parent process [ 1102 . 1 ] determines that only the Read-Data function will run as a child process on host A.
  • the host A parent process creates a full structure for distributed shared memory [ 1101 . 1 ] but only the sections relevant to child processes that need to run on host A will be initialized, in this case the single child process for the first function.
  • a child process [ 1103 . 1 ] for the Read-Data function is then started on host A in the manner described above. Note that the host A parent process [ 1102 . 1 ], distributed shared memory [ 1101 . 1 ], and child process A [ 1103 . 1 ] are analogous to the parent process [ 1102 ], distributed shared memory [ 1101 ], and child process A [ 1103 . 1 ] described above in FIG. 11 and FIG. 12 .
  • the host A parent process then starts a new engine parent process [ 1102 . 2 ] on host B, passing input indicating that functions already reserved for host A should be ignored.
  • the host B parent process ignores the Read-Data function since it is marked for host A and determines that only the Transform-Data function should run as a child process on host B. As explained above, this choice was optionally made by the developer during the development process and is recorded in the application drawing artifact.
  • the host B parent process creates a full structure for distributed shared memory [ 1101 . 2 ] but only the sections relevant to child processes that need to run on host B will be initialized (in this case, the child process for the Transform-Data function).
  • a child process for the Transform-Data function is then started on host B in the manner described above.
  • the host B parent process [ 1102 . 2 ], distributed shared memory [ 1101 . 2 ], and child process B [ 1103 . 2 ] are analogous to the parent process [ 1102 ], distributed shared memory [ 1101 ], and child process B [ 1103 . 2 ] described above in FIG. 11 and FIG. 12 .
  • the host B parent process then starts a new engine parent process [ 1102 . 3 ] on host C, passing input indicating that functions already reserved for hosts A and B should be ignored.
  • the host C parent process ignores the Read-Data and Transform-Data functions because they have been reserved for the other hosts, and determines that only the Write-Data function should run as a child process on host C. As explained above, this choice was optionally made by the developer during the development process and is recorded in the application drawing artifact.
  • the host C parent process creates a full structure for distributed shared memory [ 1101 . 3 ] but only the sections relevant to child processes that need to run on host C will be initialized (in this case, the child process for the Write-Data function). A child process for the Write-Data function is then started on host C in the manner described above. Note that the host C parent process [ 1102 . 3 ], distributed shared memory [ 1101 . 3 ], and child process C [ 1103 . 3 ] are analogous to the parent process [ 1102 ], distributed shared memory [ 1101 ], and child process C [ 1103 . 2 ] described above in FIG. 11 and FIG. 12 .
  • FIG. 14 is a modified UML collaboration diagram (as described above) that extends FIG. 13 to illustrate the process of distributed shared memory replication when the sample application is executed in a distributed environment comprising three hosts.
  • Each child process is responsible for updating its portion of the distributed shared memory structure at regular intervals.
  • the replication process begins on host C [ 1301 . 3 ] when its update interval arrives.
  • the parent process [ 1102 . 3 ] writes [ 1401 ] its output [ 1402 ].
  • the host B parent process [ 1102 . 2 ] will read [ 1403 ] the output from the host C parent process, and update [ 1404 ] its distributed shared memory [ 1101 . 2 ] with the control data it read as output from host C.
  • a cumulative update of control data including control data from host B and host C is then written [ 1405 ] as output [ 1406 ] from the parent process.
  • the host A parent process [ 1102 . 1 ] When the update interval for host A is reached, the host A parent process [ 1102 . 1 ] will read [ 1407 ] the output from the host B parent process (which it started), and update [ 1408 ] its own host A distributed shared memory [ 1101 . 1 ] with the control data it read as output from host B.
  • the parent process on the master host periodically reads the contents of the distributed shared memory to obtain information related to each of the child processes. This process occurs at regular intervals, and is timed according to a user-configurable parameter.
  • the information read from the distributed shared memory is used to monitor the progress of the child processes and to provide status updates.
  • Using a distributed shared memory structure to provide status updates allows the child processes to process data in an uninterrupted fashion, without pausing periodically to send status messages. Essentially, this creates a system by which data are transferred from child process to child process “in-band” while status messages and updates are transferred “out-of-band,” separate from the flow of data.
  • FIG. 15 is a data flow relationship diagram (as described above) that merges FIG. 12 and FIG. 13 to illustrate the flow of data in the data integration engine when the sample application is run in a distributed environment comprising three hosts.
  • the channel data flow method being employed is identical to the single-host/single-instance method described above ( FIG. 12 ), except that the channels now operate to pass data between threads running on different hosts.
  • the communication channels that pass data between threads across hosts can be implemented using any inter-host communication means, including sockets, riiop, rpc, soap, oob, and mpi.

Abstract

A method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables. Methods include receiving a set of physical data identifiers that identify physical data fields, associating semantic names with these fields, and executing rules expressed in terms of these semantic names.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit under 35 U.S.C. §119(e) of the following application, the contents of which are incorporated by reference herein:
      • U.S. Provisional Application No. 61/052,548, entitled Method and System for Structuring and Performing Data Integration and Conversion Using a Semantic Model, filed on May 12, 2008.
  • This application is related to the following applications, filed on an even date herewith:
      • U.S. patent application Ser. No. [TBA], entitled Method and System for Debugging Data Integration Applications with Reusable Synthetic Data Values;
      • U.S. patent application Ser. No. [TBA], entitled Method and System for Managing the Development of Data Integration Projects to Facilitate Project Development and Analysis Thereof;
      • U.S. patent application Ser. No. [TBA], entitled Method and System for Developing Data Integration Applications with Reusable Functional Rules that are Managed According to their Output Variables;
      • U.S. patent application Ser. No. [TBA], entitled Method and System for Executing a Data Integration Application Using Executable Units that Operate Independently of Each Other.
    BACKGROUND
  • 1. Field of the Invention
  • The present invention relates to data integration applications, and, more specifically, to abstracting data used in a data integration application by using semantic names.
  • 2. Discussion of Related Art
  • When a database system is upgraded or replaced, the existing data must be transferred to the new system. This process, called data migration, is becoming increasingly expensive as database systems become larger and more complex. Planning and executing a data migration consumes valuable resources and can often result in considerable downtime. Also, mistakes in data migration can lead to data corruption, which is not an acceptable risk for institutions that handle sensitive data.
  • These difficulties are compounded when it is necessary to combine data from several different data storage systems, a process known as data integration. Data integration applications must reconcile data from several potentially incompatible storage systems, convert these data into a unified format, and load the new data into the target database. These are complicated tasks, and they require careful planning and detailed knowledge of the structure of the source databases. Errors in data integration are common, difficult to diagnose, and expensive to fix.
  • In the past, data integration applications have typically been developed for a specific database upgrade or merger task, and they become useless after this task is complete. This ad hoc approach makes it impossible to reuse program code, substantially increasing the cost of development. Also, it tends to produce applications that are written from scratch and not comprehensively tested, increasing the likelihood of data corruption.
  • In light of these problems, there exists a need for an improved method of developing database applications that minimizes the costs and risks associated with data migration and data integration.
  • SUMMARY OF THE INVENTION
  • This invention provides methods and systems for developing data integration applications with reusable semantic identifiers to represent application data sources and variables.
  • Under one aspect of the invention, a method is presented that includes receiving a set of physical data identifiers that specify fields of physical data sources, storing in a database a set of semantic names for use in defining data integration applications, defining, in terms of the received semantic names, a data integration application comprising functional rules to extract, transform, and store data, and executing these rules by replacing each of the semantic names with data from the specified field of the physical data source.
  • Under another aspect of the invention, the method further includes automatically converting the input data values from one datatype to another as required by the functional operators.
  • Under another aspect of the invention, the method further includes providing a set of suggested semantic names and associating one of the suggested semantic names with a field of a physical data source.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
  • FIG. 1 is a dataflow diagram that illustrates the operation of an example application, according to certain embodiments of the invention.
  • FIG. 2 is a UML package diagram that depicts the coarse dependencies and relationships among the basic components of the semantic data integration system, according to certain embodiments of the invention.
  • FIG. 3 is a relationship diagram that illustrates the relationships among the various types of project objects stored in the repository, according to certain embodiments of the invention.
  • FIG. 4 is a UML state diagram that depicts the relationships among the various project stages, according to certain embodiments of the invention.
  • FIG. 4.1 is a flowchart that depicts the various stages of project development, according to certain embodiments of the invention.
  • FIG. 5 is a relationship diagram that depicts the components of the semantic model, according to certain embodiments of the invention.
  • FIG. 6 is a relationship diagram that depicts the structure of a semantic data integration function within an application, according to certain embodiments of the invention.
  • FIG. 7 is a relationship diagram that depicts an output-oriented rule definition, according to certain embodiments of the invention.
  • FIG. 8 is a relationship diagram that depicts the use of output-oriented rules in a function, according to certain embodiments of the invention.
  • FIG. 9 is a relationship diagram that depicts the preferred embodiment of function-level synthetic debugging and testing for semantic data integration.
  • FIG. 10 is a diagram that illustrates the separation of development roles in a semantic data integration project, according to certain embodiments of the invention.
  • FIG. 11 is a control flow relationship diagram that illustrates the control flow within the data integration engine when a sample data integration application is executed on a single host, according to certain embodiments of the invention.
  • FIG. 12 is a data flow relationship diagram that illustrates the flow of data within the data integration engine when a sample data integration application is executed on a single host, according to certain embodiments of the invention.
  • FIG. 13 is a modified UML collaboration diagram that illustrates the startup sequence that results when a sample data integration application is executed in a distributed environment, according to certain embodiments of the invention.
  • FIG. 14 is a modified UML collaboration diagram that illustrates the process of distributed shared memory replication when a sample data integration application is executed in a distributed environment, according to certain embodiments of the invention.
  • FIG. 15 is a data flow relationship diagram that illustrates the flow of data in the data integration engine when a sample data integration application is run in a distributed environment, according to certain embodiments of the invention.
  • FIG. 16 is a diagram that depicts the various components of a data integration engine, according to certain embodiments of the invention.
  • DETAILED DESCRIPTION
  • I. Introduction
  • Preferred embodiments of the present invention provide semantic systems and methods for developing, deploying, running, maintaining, and analyzing data integration applications and environments.
  • Those data integration applications that are relevant to the techniques described herein are broadly described by the class of applications concerned with the movement and transformation of data between systems and commonly represented by, but not limited to: data warehousing or ETL (extract-transform-load) applications, data profiling and data quality applications, and data migration applications that are concerned with moving data from old to new systems.
  • Data integration applications developed and maintained using these techniques are developed using a semantic model. At its core, a semantic development model enables an application to be partially or fully developed without knowledge of the physical data identities (locations, structures, names, types, etc.) being integrated. Physical identities are present in the system but they are abstracted with semantic identities. There are several advantages to this approach: changes to physical data locations or structures do not automatically prevent the application developer from accomplishing real work; a high or intimate level of knowledge of the data being integrated is not required; business rules and other application logic developed using semantic identities can easily be reused and tested from application to application regardless of the physicality of the underlying data structures; and the costs of data mapping exercises can be significantly reduced over time as the system learns about fields that are semantically equivalent.
  • A data integration application developed using the techniques described herein is preferably stored in a common repository or database. This database includes a semantic metadata model that correlates physical locations and datatypes, representing the source and target data, with semantic identities. The database also includes representations of business rules that are defined using semantic identities instead of physical identities. The business rules and the semantic model are stored and maintained separately. Thus, application developers do not need to know the physical locations or datatypes of the source data in order to implement data transformation functions.
  • The repository is preferably maintained and updated using a hybrid versioning system for data integration projects. This system provides version control for project artifacts, and also provides a fine-grained locking mechanism that controls the ability to edit and execute a project in various ways according to the project's current stage in the development process. The hybrid versioning system also interfaces with a relational database, which can be used to efficiently calculate and report project metrics.
  • The system's data integration engine executes data integration applications using a parallel, distributed architecture. Parallelism is achieved where possible by leveraging multiple redundant data sources and distributing the execution of the application across multiple hosts. The techniques disclosed herein are scalable to execution environments that comprise a large number of hosts.
  • FIG. 16 is a diagram that depicts the various components of a data integration system, according to certain embodiments of the invention. The functional logic of the data integration is performed by a host computer [1601], that contains volatile memory [1602], a persistent storage device such as a hard drive [1608], a processor [1603], and a network interface [1604]. Using the network interface, the computer can interact with databases [1605, 1606]. During the execution of the data integration application, the computer extracts data from some of these databases, transforms it according to programmatic data transformation rules, and loads the transformed data into other databases. Though FIG. 16 illustrates a system in which the computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface. The data transformation rules may be executed on a single host, as shown in FIG. 16, or they may be distributed across multiple hosts.
  • The host computer shown in FIG. 16 may also serve as a development workstation. Development workstations are preferably connected to a graphical display device [1607], and to input devices such as a mouse [1609], and a keyboard [1610]. One preferred embodiment of the present invention includes a graphical development environment that displays a data integration application as a diagram, in which the data transformation rules are represented by shapes and the flow of data between rules is represented by arrows. This visual interface allows developers to create and manipulate data integration applications at a more intuitive level than, for example, a text-based interface. However, the techniques described herein may also be applied to non-graphical development environments.
  • Each of these features is discussed in more detail in the sections that follow.
  • II. Project Model
  • FIG. 1 is a dataflow diagram that illustrates the operation of an example data integration application that will be referenced in following sections. An application organizes the execution of a set of functions, which perform individual units of work, and the flow of data between those functions. The sample application [101] has three functions, represented by boxes, and data flow between those functions, represented by arrows.
  • In this example, the Read-Data function [102] reads monthly transactional bank account data from a VSAM file [105] and outputs that data for use as input in the next function. The Transform-Data function [103] receives its input from the Read-Data function. Its transformation logic aggregates those bank account transactions to compute end-of-month status for each account, and outputs the end-of-month status for use as input to the next function. Finally, the Write-Data function [104] receives the end-of-month status from the Transform-Data function and writes that data to a flat RDBMS table [106] which will be used to produce monthly snapshot reports for each bank account.
  • Development of the sample application [101] begins when a project is created for managing the application's development and deployment. Also, a semantic model, separate from the project, is used to store and maintain the association between physical identities (i.e. the physical locations and datatypes of the project's source data) and semantic identities. If no semantic models have been created for the relevant data, a new semantic model is initialized. If a semantic model for the project's source data had already been created (e.g. by a prior project, or through ongoing maintenance) then the new project may use the existing semantic model; thus, it is not necessary to create a new semantic model for each new project.
  • After the creation of the project, project-specific artifacts may be created. These artifacts, discussed in more detail below, are tested and checked-in to the repository. The project entity also contains an identifier that represents the current stage of project development. At each stage of the project the application is executed by the data integration engine in a stage-specific environment. Eventually the application is considered complete and the project, and all applications contained within the project, are moved into production.
  • FIG. 2 is a UML package diagram that depicts the coarse dependencies and relationships among the basic components of the semantic data integration system, according to certain embodiments of the invention.
  • The system repository [201] is a database used by the system's tools and engine. It is centrally deployed in order to capture and share system objects across applications, and to provide visibility into data integration projects, data usage, application performance, and various metrics. The repository consists of three high-level subsystems: a relational database, a source control system, and business logic to implement functionality such as creating a project, publishing, staging, etc. The database and source control subsystems are provided using conventional third party technologies. The business logic is implemented with a J2EE application but could easily be .NET or some other web-application technology. The various system tools (semantic maintenance tool [204], project maintenance tool [205], and development tool [206]) connect to these repository subsystems directly as required.
  • The primary contents of the repository include: the semantic model [202] which captures metadata that describes the contextual or semantic identities in an enterprise, the actual or physical identities in an enterprise, and various relationships between the semantic identities and physical identities, and projects [203] which are system objects that group related artifacts necessary for defining and deploying a data integration application
  • The repository is manipulated by system tools including: the semantic maintenance tool [204], which maintains the semantic model, the project maintenance tool [205], which maintains projects and associated data and generates reports to various levels of detail across the system, the development tool [206] which is used to develop data integration applications, and the integration engine [207], which executes applications using a parallel, distributed system, computes runtime statistics, and stores these statistics in the repository. Additional description of these components and how they interact is included below.
  • FIG. 3 is a relationship diagram that illustrates the relationships among the various types of project objects stored in the repository, according to certain embodiments of the invention. A relationship diagram is a modified UML class diagram that conveys the relationships between objects or components. The object or component is labeled in a rectangular box and a relationship to another object or component is represented with a labeled arrow from one box to the other. The relationship reads from arrow begin to arrow end (the end of the line with the actual arrow). Like UML class diagrams, these relationship diagrams allow for containment to be expressed with an arrow or by placing the child object visually within the parent object. In some cases the rectangle for an object or component is dashed indicating that it is not an actual object but it is really a conceptual group (like an abstract class) for the objects shown therein.
  • A project [203.1] as depicted is one of the many projects [203] shown in FIG. 2. A project is a system object that is used to organize and manage the development and deployment of data integration applications through various stages. A project's stage [301] specifies the current state of the project within the development and deployment process. The various stages that may be associated with a project are described in more detail below. A project's measures [308] are metrics or statistics relevant to the project that are collected after the project is created.
  • A project's artifacts [302] define the project's applications and supporting metadata. These are preferably captured as XML files that are versioned using the standard source control functionality implemented by the development tool. Project artifacts are accumulated after inception and include: drawings [303] which are visual descriptions of the functions, transformations, and data flow for one or more applications; data access definitions [309] which individually describe a set of parallel access paths (expressed as URIs) to physical data resources; semantic records [304] which primarily describe the data structures for one or more applications; documentation [305] for the project and its artifacts, and other artifacts [306] that may be created during the life of the project.
  • The project model [307] is a relational model that represents the project, its artifacts, and other data and metadata for the project. The project model and project measures provide a basis for introspection and analysis across all projects in the system.
  • A project's stage also controls where the project can be run. In an environment where this system is deployed, individual machines where the engine can run are designated to allow execution only for a specific stage. For example, host machine SYSTEST288 may be designated as a system testing machine. Any instance of the system's engine that is deployed on SYSTEST288 will only allow projects in the “system testing” stage [301.2] to run. This additional level of control is compatible with how IT departments prefer to isolate business processes by hardware.
  • For example, the simple application described above [101] might be developed as part of a new project implemented by the IT department of a financial institution that wishes to gather and analyze additional monthly status for individual bank accounts. The project [203.1] would be created by a project manager using the project maintenance tool [205] and the project would begin in the development stage [301.1] (described below). Preliminary project artifacts [302] such as semantic records [304] (described below) would then be defined and added to the project by a data architect or equivalent. A developer would then use these artifacts to create drawings [303] which define the transformation logic of the application [101]. As the application is developed and tested, the project would move through various stages (see FIG. 4) until it is finally placed into production. The project measures [308] would allow the project manager and others to analyze the project using relational reporting and analysis techniques in order to improve the company's data integration and business processes.
  • The development tool [206] is conventional, and similar in layout and purpose to many other existing graphical programming tools that may be used for defining workflow, process flow, or data integrations. Examples of such tools include Microsoft BizTalk Orchestrator, Vignette Business Integration Studio, and FileNet Process Designer, among others. The primary workspace consists of a palette of symbols corresponding to various functions that may be performed by the engine [207], and a canvas area for creating a drawing. Prior to creating drawings for a project, the user is given permission to work on that project by another user, typically a project manager, of the project maintenance tool [205]. These permissions are stored in the repository [201].
  • From within the development tool, which is installed on the local computer of the developer using the tool, the developer is allowed to “check out” a snapshot of the artifacts for any project for which the user has permission (as defined in the repository). The project artifacts must include any semantic records [305] and data access configurations [309] that the developer will need to build the drawing; these requisite artifacts were previously defined by another user, typically a data architect, using the project maintenance tool [205].
  • Within the development tool, the user creates a drawing. Using our sample application for descriptive purposes, this process may work like this:
  • After project checkout (defined above), the user drags functions from the palette to the canvas area. In the case of our sample application, the user would drag 3 different functions from the palette: one to read data from a file (necessary for the Read-Data function [101]), one to transform data (necessary for the Transform-Data function [102]), and one to write data to a table (necessary for the Write-Data function [103]). The user would then visually “connect” the functions in the drawing according to the direction of the data flow for the sample application. Each function has properties that must be configured to define its specific behavior for the engine.
  • The user will then edit these properties with standard property editor user interfaces. The properties specified by the user for the Read-Data function include the name of its output semantic record [305.2] which specifies the data being read from the file, and the name of a data access configuration [309] which specifies one or more parallel access paths (expressed as URIs) to the file. The properties specified by the user for the Write-Data function include the name of its input semantic record [305.1] which specifies the data being written to the table, and the name of a data access definition [309] which specifies one or more parallel access paths (expressed as URIs) to the table.
  • Because the user connected the Read-Data function to the Transform-Data function, the input semantic record [305.1] for the Transform-Data function [103] is automatically derived from the output semantic record of the Read-Data function [102] and because the user connected the Write-Data function to the Transform-Data function, the output semantic record [305.2] for the Transform-Data function [103] is automatically derived from the input semantic record of the Write-Data function [104]. The user will further configure the Transform-Data function in the drawing by specifying its transformation logic in a transformation editor. The semantic identities of the input semantic record and output semantic record are presented to the user in this editor. In the transformation editor, the user provides logic that specifies how output values in the semantic record are calculated. When the values are a direct move from input to output, a simple statement such as “output=input” can be used to automatically move data from input to output for any like-named semantic identities. When more specific rules are needed for an output field, they can be specified directly in the logic, for example:
      • output.full_name=string.concatenate(input.first_name, “”, input.last_name)
  • As the user builds the drawing and configures each function, the development tool will validate the drawing according to the rules of the engine and present warnings and errors to the user so that the user may correct the drawing. The user has the ability to synthetically debug (FIG. 9) the Transform-Data function from within the development tool. The user may also execute the drawing from the development tool; in this scenario the execution may be performed by a local instance of the engine which is installed with the development tool, or on a remote instance of the engine which has been installed in an environment configured to support such testing. In either case, the machine hosting the engine requires that any client access technologies relied on by the data access configurations [309] for each function in the drawing already be configured on the same machine; for example, in order to write to a table, the correct database drivers must be configured on the machine whose engine will be using those drivers to perform that operation. At any time during this development process, the developer may “check-in” the drawing to the repository. This process is conventional in terms of workflow and implementation; the user may provide a comment for the change and a new version of the new drawing will be added to the source control system in the repository.
  • III. Hybrid Version Control System
  • The project artifacts, which are maintained via traditional source control mechanisms as described, the project staging controls (described below), and the project model which models those sources in a relational database, are maintained using a hybrid version control system, comprising both a standard version control system and a relational database. Traditionally, version control systems have made it possible to record individual changes made to versioned artifacts, but do not allow for the analysis of these changes using standard relational database query techniques. Using pure relational database systems, however, it is extremely difficult to provide traditional version control functionality. Additionally, a traditional source control system does not inherently control access to system sources based on the development life-cycle stage of the project; such systems must rely on externally defined and enforced business practices to control access. The hybrid version control system disclosed herein allows for both traditional artifact versioning/source control and relational data modeling of the same artifacts. The hybrid version control system also provides built-in support for controlling access to project sources according to the current stage of the project.
  • FIG. 4 is a UML state diagram that depicts the relationships among the various project stages [301], according to a preferred embodiment. The staging model provides control for moving a project through a development life-cycle.
  • At any given time, a project may be in at most one of the following deployment stages (analogous to states in a state transition diagram): development [301.1], system testing [301.2], integration testing [301.3], user acceptance testing [301.4], readiness testing [301.5], or production [301.6].
  • Each one of these stages has two superstates. The first superstate signifies whether a project is unlocked [403], which means that changes to project artifacts are allowed, or locked [404] which means that changes are not allowed. The second superstate signifies whether a project is unpublished [401], which means that the project model has not been refreshed from the most recent changes to project artifacts, or published [402] which means that the project model is fully representative of the current project artifacts.
  • In the preferred embodiment, a project is created, published, and staged using the project maintenance tool [205]. Individual artifacts and changes to them are stored as separate versions in the repository's source-control system using the system's tools such as the project maintenance tool [205] and the development tool [206]. User permissions related to project development may be implemented using any user authentication/control databases, such as LDAP and ActiveDirectory.
  • After a project is created, it is unpublished [401] and in the development stage [301.1]. Artifacts may only be added, modified, or removed from source control when the project is in the development stage which also implies that the project is unlocked [403]. When a project is “published,” all of the information stored about the project in the version control system, including, e.g., new versions of drawings and functions, checkin/checkout log entries and times, etc., is moved into a relational database, the contents of which can be queried using traditional relational techniques. After a project is published it will be in a published state such that the repository's relational model of the project has been updated from all current project artifacts in source-control, making the project available for post-development deployment staging.
  • The project artifacts are moved from the source control system to the relational database using conventional serialization methods and systems. When it is published to the database, it does not replace the older published version of the project, but is stored as a separate publication. Thus, queries executed against the database may gather information and statistics about multiple publications.
  • If changes are again made to project artifacts while in the development stage, the project will again be in an unpublished state until being explicitly published again. From a published superstate a project in the development stage may be staged forward to any post-development stage including production [301.6]. After being staged out of development, the project is in a locked superstate such that artifacts cannot be modified until the project is staged back to development.
  • As an example, after development [301.1] is complete, the project for the sample application [101] may be published and moved to a system testing stage [301.2]. While in this stage, various system tests are performed on the application and changes to the project's artifacts are prohibited. If system testing is successful, the project may be moved to an integration testing stage [301.3]. While in this stage, one of the tests uncovers an issue that must be addressed by a slight change to the configuration of the Write-Data [104] function in the drawing for the application. The project is moved back to the development stage [301.1] so that a developer can make this change. After the change is tested by the developer and checked-in, the project is published again and moved back to the integration testing stage [301.3] for re-test. The application might then pass testing at this stage and each subsequent stage until it is finally put into production [301.6].
  • Each time the artifacts are published, the project model [307] and project measures [308] are updated. Both the project model and measures are maintained as a relational model in the repository. This enables project managers, data architects, decision makers, and other system users to query and analyze the project and projects in interesting ways. For example, a project manager may quickly learn in which projects a developer user has used a particular semantic record (which may be known to be broken); or cumulative usage across projects of a certain table; or which output rules for a certain semantic identity are used most. This type of inquiry and analysis is possible because of the publish functionality in the repository.
  • Some project metrics may use information from the source control system as well as the repository. Because a source file may be checked-out and checked-in multiple times between publications, only the source control system contains information about these intermediate file-versions.
  • FIG. 4.1 is a flowchart that depicts the separation of roles across the various stages of project development, according to certain embodiments of the invention. The project manager [4101] creates a project called “foo” [4102] in the source control system, and assigns users [4103] to it. The data architect [4104] then checks out the project [4105] and creates or modifies the semantic records and data access definitions that will be used by the project “foo” [4112] (these are discussed in more detail below). When this step is complete, the developer [4113] checks out the project [4106] and creates and modifies the project's drawings [4107] in the source repository, which specify the data transformation, extraction, and load rules used by the project and determine how data flows among these rules. When complete, the developer checks the project in [4108]. At this point, the project manager [4101] publishes the project [4109], which moves the project artifacts into the relational database [4110]. After the project has been published, it may be moved into the “staging” phase [4111]. Eventually, the project state will be set to “production,” the final phase of the project development process.
  • IV. Semantic Model
  • FIG. 5 is a relationship diagram (as described above) that depicts the components of the semantic model [202] in the repository [201], according to a preferred embodiment. The semantic identity [501] is metadata that represents the abstract concept or meaning for a single business object that may be used in an enterprise; for example, an employee's last name. Additional properties of the semantic identity pertaining to its semantic type, subject area, and composition are also captured in the semantic model.
  • The output rule [701] defines the business logic for calculating a value for the semantic identity within a data integration application. A semantic identity may have multiple output rules. The output rule and its usage is described in more detail in a later section.
  • The physical identity [502] is metadata that captures the external (physical) name of a specific business object (e.g., a database column). The physical datatype [504] captures the external (physical) datatype of the associated physical identity (e.g., “20 character ASCII string”). The semantic datatype [505] is associated with the semantic identity and specifies the datatype of the data referenced by the semantic identity, as used internally by the data integration application. The physical datatype is used by the engine when it is reading or writing actual physical data. The semantic datatype is used by the engine when processing transformation logic in the application (described later).
  • The semantic binding [503] associates a physical identity with a particular semantic identity. Many physical identities and their physical attributes may be associated with the same semantic identity. For example, fields from various physical data locations such as lastName with a physical datatype of CHAR(30) in one RDBMS table, last_name with a physical datatype of VARCHAR(32) in another RDBMS table, and LST NM with a physical datatype of PICX (20) in a COBOL copybook, may all be physical instantiations of the semantic identity last_name, which could be universally associated with a semantic datatype string.
  • A semantic record [304] describes the layout of a physical data structure such as an employee table. Each field in the table would be described with a semantic binding that captures the actual column name (the physical identity) and the semantic identity. Other metadata specific to each field in the employee table, such as data type information, would also be described for each field in the semantic record.
  • Using the semantic maintenance and project maintenance tools, a user would create and maintain the semantic model as follows. The user would first locate the actual metadata for the physical data that must be represented. As an example, using the sample application, this would be the metadata for the VSAM file being read and the metadata for the RDBMS table being written. The names and types of each field or column would be preserved as physical identities and physical datatypes. A rationalization process, using conventional string matching techniques and statistical methods, is then performed by the tool that takes each physical identity, decomposes it, analyzes it, and suggests zero or more semantic identities. The user makes the final decision as to which semantic identity most applies to each physical identity. When an existing semantic identity does not apply, the user may define a new one and its semantic datatype. The physical identity, semantic identity, semantic binding, and other metadata gathered during the rationalization process, are saved in the repository.
  • The components of the semantic model are described in more detail below.
  • FIG. 6 is a relationship diagram (as described above) that depicts the structure of a semantic data integration function within an application (such as the sample application described above) according to a preferred embodiment. A function [601] performs an individual body of work within an application. The function in FIG. 6 is a generic representation of any particular function in the present data integration system and could represent any of the functions [102], [103], or [104] in the sample application [101].
  • Depending on the type of function, the function may have the following types of input: input data [603], which is an actual input data value that the function will consume when it runs in the engine, an input semantic identity [501.1] is a semantic identity [501] from the semantic model (FIG. 5) that identifies an individual piece of data in a record that will be input to the function, and an input semantic record [305.1] is a semantic record [305] from the semantic model (FIG. 5) that describes the exact structure and format of a data record that will be input to the function.
  • Depending on the type of function, the function may have the following types of output: output data [604], which is an actual output data value that the function will produce when it runs in the engine, an output semantic identity [501.2] is a semantic identity [501] from the semantic model (FIG. 5) that identifies an individual piece of data in a record that will be output from the function, and an output semantic record [305.2] is a semantic record [305] from the semantic model (FIG. 5) that describes the exact structure and format of a data record that will be output from the function.
  • A data access definition [309] will also be associated with the function. When the purpose of the function is to read or write data from or to a physical data source, the data access definition will specify one or more URIs for accessing the physical data being read or written, each of which constitutes a parallel processing path (or channel) for the operation. When the function is an internal operation whose job is to manipulate data that has already been read (prior to writing), the data access definition identifies the particular channels that are relevant to the functions it is connected to.
  • Depending on the type of function, the function may also have transformation logic [609] which may be used to calculate the output values for the function.
  • A semantic function is able to correlate input data to output data using the semantic identities. For example, if the input semantic record [305.1] includes a field with semantic identity last_name [501.1] whose actual source is from a column named lastName [502] and if the output semantic record [305.2] includes a field with semantic identity last_name [501.1] whose actual data source is a field named lstNm in a file [502], provided that the semantic model captures these relationships, the function will know that the two fields are semantically equivalent because they share the same semantic identity last_name, and thus can move the correct input data [603] to the correct output data [604] with little or no additional specification.
  • Using our sample application as an example, the output semantic record [305.2] for the Read-Data function [102] may include a semantic binding [503] that binds the output semantic identity [501.2] last_name to a physical field named LST_NM in the data being read from the VSAM file [105]. The input semantic record [305.1] for the Transform-Data function [103] may include the same semantic binding. The data coming from the VSAM file on the mainframe stores all last names in upper case; ex: SMITH. The transformation logic [602] in the Transform-Data function [103], which is a semantic function [601] like all functions in an application for the present invention, may be written to convert the input data [603] for input semantic identity [501.1] named last_name to title case; ex: Smith.
  • In writing this transformation logic, the developer only needs to know the semantic name last_name, and does not require any knowledge about the associated physical identity or the attributes of the VSAM source where the data is physically located. For example, suppose that in a different application in a different project, the physical identity for last name data pulled from a mainframe was called NAME_LAST. As part of that effort, the semantic model were updated and a new additional semantic binding that associated NAME_LAST to the last_name semantic identity were created. The same transformation logic responsible for converting last_name to title case could be used because the transformation uses the semantic identity last_name that is common to both physical identities, LST NM and NAME_LAST.
  • As a more complete example, suppose the VSAM file read by the example application has the following physical description:
  • TABLE 1
    VSAM Metadata
    Physical Identity Physical Datatype
    ACC-NO PICX(20)
    TRANS-TYPE PICX(1)
    TRANS-AMT 9(12)V9(2)
    LAST-NAME PICX(20)
    FIRST-NAME PICX(20)
    . . . . . .
  • Suppose further that the RDBMS table written by the example application has the following physical description:
  • TABLE 2
    RDBMS Metadata
    Physical Identity Physical Datatype
    accId VARCHAR(32)
    accBal NUMERIC(10, 2)
    lastName VARCHAR(32)
    firstName VARCHAR(32)
    . . . . . .
  • Outside of the context of the application project, a user would use the semantic maintenance tool to import the physical identities specified in Tables 1 and 2, in order to rationalize these physical identities to semantic identities, as described above (if the repository already contains semantic records corresponding to these two data tables, then it would not be necessary to import these physical identities again; for present purposes we assume that they are being imported for the first time). At this point, for each of these physical identities, the semantic maintenance tool will suggest corresponding semantic identities. The user can affirm or override these suggestions.
  • When this process is completed, the result is a mapping of (physical identity, semantic identity) pairs. Suppose, for the purposes of the present example, that this mapping is specified as follows:
  • TABLE 3
    Mapping from Physical to Semantic Identities
    Physical Identity Semantic Identity
    ACC-NO account_number
    accId account_number
    accBal account_balance
    TRANS-TYPE transaction_type
    TRANS-AMT transaction_amount
    LAST-NAME last_name
    lastName last_name
    FIRST-NAME first_name
    firstName first_name
    . . . . . .
  • At this point, the user may associate the title-case rule (as described above) with the semantic identity last_name. This rule, along with any other rules created by the user and associated with semantic identities, are stored in the repository.
  • The user may now create semantic records corresponding to both the VSAM file and the RDBMS data sources, within the context of a specific project. These semantic records combine the physical metadata contained in Tables 1 and 2 with the semantic bindings in Table 3. For example, the semantic record SR1, corresponding to the VSAM file, would contain the following:
  • TABLE 4
    Semantic Record for VSAM file (SR1)
    Phys.
    Phys. Ident. Datatype Semantic Ident. Semantic Datatype
    ACC-NO PICX(20) account_number string
    TRANS-TYPE PICX(1) transaction_type string
    TRANS-AMT 9(12)V9(2) transaction_amount number
    LAST-NAME PICX(20) last_name string
    FIRST-NAME PICX(20) first_name string
    . . . . . . . . . . . .

    And the semantic record SR2, corresponding to the RDBMS table, would contain the following:
  • TABLE 5
    Semantic Record for RDBMS table (SR2)
    Phys. Ident. Phys. Datatype Semantic Ident. Semantic Datatype
    accId VARCHAR(32) account_number string
    accBal NUMERIC(10, 2) account_balance number
    lastName VARCHAR(32) last_name string
    firstName VARCHAR(32) first_name string
    . . . . . . . . . . . .
  • These semantic records are saved in the repository as part of the project corresponding to the sample application. In the same project, a user would use the
  • development tool to create a visual drawing for the application that references these semantic records. To configure the Read-Data function, the user would specify metadata that identifies the location of the VSAM file from which the data must be read, and associate the previously-defined semantic record SRI with the function as the function's output semantic record.
  • To configure the Transform-Data function, the developer would first connect the output of the Read-Data function to the input of the Transform-Data function, preferably via the graphical development tool, which represents functions and the connections between them using a graphical representation. Next, the developer would configure the output of the Transform-Data function to include the semantic identities listed in the semantic record SR2. When specifying semantic entities in the output of Transform-Data, the user will be presented with a menu of rules stored in the repository that operate on those identities (allowing the user to select only valid, predefined rules). In this case, suppose that when the user specifies last_name, the user selects the title-case rule (as defined above) from the rules menu.
  • Finally, the developer would connect the output of Transform-Data to the input of the Write-Data function, and specify the location of the RDBMS table to which the data must be written. As detailed above, the task of connecting two functions can be performed visually, using the graphical development tool. Throughout the process of configuring the application rules, the development tool never reveals the physical identities or datatypes of the source and target data to the user; this information is encapsulated in the semantic records SR1 and SR2, which are opaque to the application developer.
  • V. Output-Oriented Rules
  • FIG. 7 is a relationship diagram (as described above) that depicts an output-oriented rule definition, according to a preferred embodiment. The rule [701] contains the logic and instructions to perform a calculation. Output data [604] is the actual data value that the rule will calculate and produce when it runs. An output semantic identity [501.2] is a semantic identity [501] from the semantic model (FIG. 5) that identifies the output data.
  • Depending on the type of rule, the rule may have input which is characterized as follows: input data [603], which comprises one or more actual input data values that the function will consume when it runs; an input semantic identity [501.1] is a semantic identity [501] from the semantic model (FIG. 5) that identifies an individual piece of data that will be input to the rule (an input parameter).
  • The rule [701] is defined to calculate a value for an output data field [604] with a given semantic identity [501.1]. All input data [603] required by the rule is identified using semantic identities [501].
  • There may be an arbitrary number of rules associated with a given semantic identity. Using the semantic maintenance tool [204] (FIG. 2), these rules can be developed independently from the application or function, tested (described in more detail below), stored and indexed by semantic identity in the repository [201] (FIG. 2), and then used in the transformation logic of a function.
  • Traditional data integration processes and systems lack the ability to semantically reconcile fields in different systems that are being integrated. In such processes hundreds, if not thousands, of business rules are documented for the purpose of mapping fields in source systems to the appropriate fields in target systems. In the present data integration system, because application functions can automatically correlate input and output data semantically, the system does not require a process to capture or implement data mapping rules as in traditional systems. These differences are explained in more detail in the examples below.
  • However, rules that perform some operation other than a direct move between input and output are still needed. The semantic data integration system optimizes the definition and employment of rules by semantically orienting them explicitly to output calculation as described.
  • Recalling the example presented in discussion of FIG. 6, the transformation logic associated with converting a last name to title case could be captured as a reusable output-oriented rule for the last_name semantic identity. This rule could be used in the sample application, as well as other applications in the same or different projects.
  • FIG. 8 is a relationship diagram (as described above) that extends FIG. 7 and adds the concept in FIG. 7 to depict the preferred embodiment of output-oriented rules employment in a function [601].
  • Transformation logic is configured for the function to perform various transformations or manipulations on the input data [603] in order to produce the correct output data [604]. Rules refer to input data and output data using semantic identities [501.1] and [501.2] respectively.
  • As described above, an example of a rule used in a function may be something trivial such as changing the case of last name. It may also be used for something more complex such as calculating a weighted account average balance. Preferably, rules are specified using a standard programming language that has been extended to include primitives that operate on typical database fields.
  • Using our sample application [101] as an example, when the Transform-Data [103] function is being configured by a developer, the calculations for its individual output fields are defined. The predefined output-oriented rule for title casing last name may be referenced and used to define the calculation for that field in the function. An example of an alternative embodiment of this process allows for a new output-oriented rule to be defined at the same time that the transformation logic for Transform-Data is being configured. In this case the pre-existing title casing rule might not already exist and the developer might add it and save it to the repository for general use.
  • As further examples of output-oriented rules, consider the following:
  • TABLE 6
    Example Output-Oriented Rules
    Target Rule
    master_account_type_code master_account_type_code
    account_type_code account_type_code
    account_start_date datetime.moment(account_open_date,
    “C”)
    account_expiration_date account_expiration_date
    account_ever_activated_code if is.empty(account_date_first_active)
       then “N”
       else “Y”
    next_account_number account_number + 1
  • These rules are written in an untyped programming language, and type-conversions are performed by the system as necessary. Because the system performs type-conversions automatically, the application developer does not need to know the semantic datatypes of the semantic identities used in a rule. For example, the semantic identities account_number and next_account_number might have semantic datatypes of string, and would therefore be represented internally as sequences of characters. However, a developer might treat account_number as an integer, as illustrated in Table 6, where next_account_number is defined as account_number+1. In this case, the system will recognize that “+” is an operator that applies to integers, convert account_number to an integer and perform the requested calculation. Finally, it will convert the result to a string, since the semantic datatype of next_account_number is string.
  • It is not necessary to include rules in the rules repository that merely pass the value of a semantic identity from input to output without applying a transformation (e.g., the rule for master account type_code in Table 6, above). This “pass-through” operation is the default behavior for semantic identities and will be applied if no rule is specified. Thus, although there is no rule for account_number specified above, any rule that receives account_number as part of its input semantic record will pass the received value through to its output semantic record.
  • By contrast, traditional integration systems require the source and target locations, table names, and datatypes used in a rule to be stored with the rule logic. Using the traditional approach, the rules described in Table 6 might be represented as follows:
  • TABLE 7
    Traditional Representation of Transformation Rules
    Target Target Source Source
    Table Target Column Type Rule Table Source Column Type
    ACC_INFO MST_ACC_TYPE INTEGER  MST_ACC_TYPE ACCT_INF MST_ACC_TYPE INTEGER
    ACC_INFO ACC_TYPE_CD CHAR(1)  ACC _CD ACCT_INF ACC _CD CHAR(1)
    ACC_INFO ACC_ST_DATE DATE  time(ACC_OP, “C”) ACCT_INF ACC_OP DATE
    ACC_INFO ACC_EXP_DATE DATE  ACC_EXP_DT ACCT_INF ACC_EXP_DT DATE
    ACC_INFO ACCT_EVER_ACT CHAR(1) if is.empty(ACC_ACT) ACCT_INF ACC_ACT CHAR(1)
     then “N”
     else “Y”
    ACC_INFO NXT_ACC_NO CHAR(20)  tochar(toint(ACC_NO) + 1) ACCT_INF ACC_NO CHAR(20)
    ACC_INFO ACC_NO CHAR(20)  ACCT_NUM ACCT_INF ACCT_NUM CHAR(20)
    ACC_INFO ACC_NO CHAR(20)  tochar(ACCT_NO) AC_DATA ACCT_NO INTEGER

    Specifying rules in the traditional way (as illustrated in Table 7) requires that the developer know not only the physical locations and names of the business objects being referenced, but their internal data format as well. In such a system, the developer would be forced to perform type conversions explicitly: for example, to add “1” to ACC_NO, the rule “tochar(toint(ACC_NO)+1)” might be used (as opposed to the untyped rule definition account_number+1, as used above).
  • Also, it is necessary to specify pass-through rules using the traditional approach: for example, two rules are defined for ACC_NO in Table 7, both of which read source data from different physical sources whose data are stored using different formats. As explained above, output-oriented rules do not require pass-through rules to be specified, because the physical data sources and datatypes are included as part of a semantic identity.
  • VI. Synthetic Debugging
  • FIG. 9 is a relationship diagram (as described above) that extends FIG. 6 to depict the preferred embodiment of function-level synthetic debugging and testing for semantic data integration. A function [601] and its transformation logic [602] (as described above) may be tested using test data [901]. Test data for each input semantic identity [501.1] may come from a variety of sources including: derived test data [902], which is be automatically derived from the input semantic record by a generator function [905], specified test data [903], which is manually specified by the user [906], and existing test data [904], which is retrieved from the repository [201].
  • The system preserves data security by not exposing actual business data values within the development tool or while a data integration application is being executed by the engine. In order to debug and test applications, synthetic debugging and testing is employed at the function level. The ability to provide synthetic test data also allows for offline development in situations when the actual data sources might not be available.
  • After initiating a debugging exercise from within the development tool, the user will assign test data [901] for each input semantic identity [501.1]. Test data values can come from multiple sources. A test data generator function [905] can use information derived from the input semantic record [305.1] to synthetically generate test values [902], the user [906] may manually specify the test data values [903], or existing test data values [904] that are cataloged in the repository [201] by semantic identity may be used. The user may choose to store test data values for each semantic identity back to the repository for future debugging and testing. Once test data has been assigned, the user can test the function with these test values. In this test one iteration of the function will run using the input test data to produce the function's output data [604] which can then be displayed by the development tool and validated by the user.
  • Using the sample application [101] as an example, a developer may want to synthetically debug the Transform-Data function [103], in particular the logic described above that changes the case of last_name. The developer may first try to re-use existing test data [904] from the repository. If no test data for last_name is found, the developer may try to generate test data [902]. Using the metadata from the input semantic record, the test data generator [905] may generate test data [902] that looks like this: ‘AaBbCcDdEeFfGgHhIiJj’. Upon testing this data with the function, the output correctly produces ‘Aabbccddeeffgghhiijj’. In order to further validate the function, the developer specifies his own test data [903]: ‘sT. jOHn’. Upon testing this data with the function, the output correctly produces ‘St. John’.
  • The developer saves this new test data to the repository so that it may be re-used the next time a developer needs test data for last_name. This is accomplished by, e.g., associating the new test data value with the associated semantic identity in a relational database table.
  • The ability to enter custom test data values allows the developer to ensure that a function responds appropriately to certain problematic input values that might not be generated by the random test data generator (e.g. integers that include random letters, negative account numbers, etc.). These custom values are associated with a semantic identity (e.g., last_name), so once entered, they can automatically be reused as test data for any function, in any project, that uses the same semantic identity.
  • VII. Enterprise Maintenance
  • FIG. 10 combines a relationship diagram (as described above) and UML use-case diagram to depict the high-level separation of semantic data integration user activities, according to a preferred embodiment. Activities performed by system users can be classified either as enterprise maintenance or as application development.
  • Enterprise maintenance [1001] is performed with the semantic maintenance tool [204] and the project maintenance tool [205], and has two basic subcategories. Semantic maintenance deals with the maintenance of the semantic model [202] including semantic identities [501] and physical identities [502] (FIG. 5), and output-oriented semantic rules [701] (FIG. 7). Project maintenance is concerned with the maintenance of the project state and architecture-level objects such as semantic records [305] (FIG. 5) and data access definition [309] (FIG. 3) which may be defined at the project level or across the enterprise when reusability is possible.
  • Application development [1002] is performed with the development tool [206] (FIG. 2) and is concerned with the development of data integration drawings [303] within or across projects. Application development involves many of the objects that fall within the purview of enterprise maintenance, such as semantic identities and output-oriented rules. However, physical identities are never referenced in the context of application development.
  • As a result of this enforced separation between application development and physical identities, the application developer does not require knowledge of physical identities [502] or physical data locations [1003]. Applications may be developed independent of the physical data sources that they will integrate, providing a level of insulation from physical data sources whose location, connectivity, structure, and metadata may be unstable.
  • Recall the example first provided in the discussion for FIG. 3. In that example the project manager and data architect were performing enterprise maintenance [1001] activities as described above, including project maintenance. Additionally, a data steward would perform semantic maintenance, such that many of the semantic bindings [503] needed for the semantic records [305] created during project maintenance would already exist. When creating the semantic records that describe the VSAM file [105], the enterprise architect may discover that a semantic binding does not yet exist between the last_name semantic identity and the LST_NM field in the VSAM file. This binding could have been defined by a data steward during regular semantic maintenance activities. But if the binding does not exist, the data architect can also create that binding. Once all of the necessary semantic bindings exist, the data architect can complete the task of creating the semantic record for the VSAM file. Once the semantic record is complete and exists as a project artifact, the developer can be told to use that record.
  • The developer would then use that semantic record when creating the drawing [303] that describes the actual application [101]. At no point does the developer need to know anything about the physical nature of the VSAM file structure including the physical identities of its data. The developer can work strictly with semantic identities to define the data integration application.
  • VIII. Data Integration Engine
  • FIG. 11 is a control flow relationship diagram that illustrates the control flow within the data integration engine when the example application [101] is executed on a single host, according to a preferred embodiment. A control flow relationship diagram is a hybrid UML class/activity diagram that conveys the directionality of communication or contact between objects or components. Solid arrows indicate synchronous communication or contact and broken arrows indicate asynchronous communication or contact.
  • The data integration engine [207] has a single parent process [1102] which is a top-level operating system process whose task is to execute a data integration application defined by artifacts [302] within a specific project [203.1] (FIG. 3). The data integration engine uses these artifacts (e.g., the application drawing) to set up, initialize, and perform the data integration.
  • Distributed shared memory [1101] is a structured area of operating system memory that is used by the parent process [1102] and the child processes [1103.1, 1103.2, 1103.3] running on that host. Each of these child processes is responsible for performing a single function within the application. In the sample application [101], child process A [1103.1] executes to the Read-Data function [102], child process B [1103.2] executes to the Transform-Data function [103], and child process C [1103.3] executes to the Write-Data function [104]. Worker threads [1104.1-1104.9] subdivide the processing for each child process.
  • When the parent process [1102] starts, it analyzes the application drawing [303] (FIG. 2) and related metadata in other project artifacts to initialize and run the application. The parent process creates and initializes a shared properties file (not shown) with control flow characteristics for the child processes and threads. The parent process also creates and initializes the distributed shared memory [1101], a section of which is specifically created for and assigned to each child process and thread. Each child process writes information about its execution status to its assigned portion of the distributed shared memory, and this is used by the parent process [1102] to provide updates about the execution status of the data integration engine.
  • After initialization, the parent process will create each child process [1103.1, 1103.2, 1103.3], synchronously or asynchronously depending on the nature of the function. One child process is created for each function in the drawing (e.g. Read-Data, Write-Data, and Transform-Data, in the example application [101]). When possible, the engine runs each function in parallel so that one function does not need to complete in order for the next to begin.
  • Upon creation, each child process will read characteristics relevant to its execution from the shared properties file. These characteristics include information about how many threads should be running simultaneously to maximize parallelism. For example, if the shared properties file indicates that there are three different physical data sources for the data read by child process A [1103.1], then child process A will spawn three worker threads [1104.1, 1104.2, 1104.3], each of which loads its data from a different source.
  • Continuing this example, because child process A is reading data from three sources using three different threads, it has three outputs. So, child process B, which transforms the data read by child process A, has three sources of input. Child process B accordingly spawns three worker threads [1104.4, 1104.5, 1105.6], each thread reading the data output by one of the worker threads spawned by child process A. Finally, child process C, which writes the output of child process B to the specified target, spawns three threads, each of which corresponds to a thread spawned by child process B. This thread system allows the data integration engine to take advantage of the parallelism made possible by multiple data sources.
  • When a function involves reading from or writing to a data source, the data integration engine examines the application drawing to determine the type of the data source involved. Based on the type of the data source, the appropriate interface methods are selected, and the data is read or written accordingly.
  • Control flow is asynchronous and non-locking between parent process, child processes, and threads. This is achieved by combining an update-then-signal asynchronous protocol for all communication (except for communication between threads, described above) and signaler-exclusive distributed shared memory segmentation. Under the update-then-signal protocol, when a parent or child process needs to communicate with its child process or thread, respectively, it may update the distributed shared memory of the child and then asynchronously signal the child. When the child handles the signal, it will read its updated distributed shared memory (if necessary) and react. Communication in the other direction is the same. When a thread or child process needs to communicate with its parent (child process or parent process, respectively), it may first update its distributed shared memory and then asynchronously signal the parent. When the parent handles the signal, it will read the updated distributed shared memory (if necessary) and react. The distributed shared memory areas used for communication are exclusively written by the signaler, ensuring that two processes never attempt to access the same memory simultaneously.
  • FIG. 12 is a data flow relationship diagram that extends FIG. 11 to depict the flow of data within the data integration engine [207] when the example application is executed on a single host. A data flow relationship diagram is an extension of the control flow relationship diagram (as described above) whose purpose is to convey the directionality and flow of data between objects or components. Relevant, previously described, control flow may be shown as muted or grayed, while the objects pertinent to the data flow within that control flow will be prominent or black. The data flow is captured with an arrow indicating the source of the data (no arrow pointer) and the target of the data (arrow pointer) that is optionally labeled with the resource responsible for the data flow.
  • The only additional annotations in FIG. 12 are channels [1201.1-1201.6] which are resources that are used for passing data from a worker thread for one function to a worker thread for another. Recall from above that the role of child process A [1103.1] is to read data (see the Read-Data function [102] in FIG. 1), the role of child process B [1103.2] is to produce new data by applying transformation logic to that data (see the Transform-Data function [103] in FIG. 1), and the role of child process C [1103.3] is to write the data produced by child process B (see the Write-Data function [104] in FIG. 1). In this model, data flows through a dedicated channel from a worker thread spawned by one child process to a worker thread spawned by another child process. Channels are implemented directly or indirectly through any means of interprocess communication, e.g. named pipes, sockets, riiop, rpc, soap, oob, and mpi.
  • As described above, each child process subdivides its work using parallel worker threads. In addition, because the characteristics of each function in this example allow for simultaneous processing, child process B [1103.2] does not wait for child process A [1103.1] to read all of the data before it begins; it can start transforming data received from child process A as soon as child process A outputs any data. Similarly, child process C [1103.3] does not wait for child process B to transform all of the data before it begins; it can start writing data received from child process B as soon as child process B outputs any data.
  • In the application drawing, each semantic record is associated with a list of Universal Resource Indicators (URIs) that point to the relevant data. These URIs might point to redundant copies of identical data or to data sources containing different data, but all of the indicated data sources must conform to the semantic record format that is specified in the file. Generally, each URI in the list will be unique, allowing the engine to leverage parallelism by reading data simultaneously from several different locations. However, this is not a requirement, and if desired, two or more identical URIs can be listed.
  • Channel data flow in the sample application is structured as follows: each worker thread on child process A will read data in parallel from a data source specified by one of the listed URIs. As each thread [1104.1, 1104.2, 1104.3] spawned by child process A reads data, it makes that data available as output from child process A to be used as input for child process B [1103.2] by moving the data through a dedicated channel. In this example channel A1B1 [1201.1] is a resource that is defined to pass data from thread A1 [1104.1] on child process A to thread B1 [1104.4] on child process B, channel A2B2 [1201.2] passes data from thread A2 [1104.2] to thread B2 [1104.5], and channel A3B3 [1201.3] passes data fro thread A3 [1104.3] to thread 6 [1104.6].
  • When each worker thread is spawned, it receives information that can be used to identify an input channel and an output channel, using a predetermined channel identification scheme. The thread connects to both of these channels, reads data from the input channel, and writes output data to the output channel. Thus, each thread connects on startup to the appropriate data channels.
  • FIG. 13 is a modified UML collaboration diagram that illustrates the startup sequence that results when the sample application is executed in a distributed environment comprising three hosts [1301.1, 1301.2, 1301.3], according to a preferred embodiment. UML collaboration diagrams are used to convey the order of messages passed between object or components. In the collaboration diagrams used here, existing control flow relationships may also be depicted in gray in order to preserve useful context.
  • The primary difference between this scenario and that depicted in FIG. 11 is that processing will be distributed across 3 hosts [1301.1, 1301.2, 1301.3] in a networked environment. In particular, the first function is specified to run on host A [1301.1], the second function is specified to run on host B [1301.2], and the third function is specified to run on host C [1301.3].
  • The application is started by executing the data integration engine [207.1] on host A, the “master” host. During the initial setup of the integration engine, the master host reads the application drawing to determine which application functions will be executed on the master host. Each function is associated with a list of URIs, each of which represents a host on which the function can be executed. For each function, the application developer selects one of the listed hosts from the list, the selection is recorded in the application drawing, and the corresponding host is used by the data processing engine to execute the function. If no host is specified, the function will execute by default on the same host as the previous function, if possible.
  • Binary data is passed between hosts using any standardized protocol and byte-order. Preferably, network byte-order is used to transfer binary data between hosts and to temporarily store data on execution hosts. When an operation must be performed that operates on data in machine-native format, the data is automatically converted to machine-native byte-order for the operation, and converted back to the standardized byte-order (e.g., network byte-order) afterwards.
  • In the particular case of the example application [101], the host A parent process [1102.1] determines that only the Read-Data function will run as a child process on host A. The host A parent process creates a full structure for distributed shared memory [1101.1] but only the sections relevant to child processes that need to run on host A will be initialized, in this case the single child process for the first function. A child process [1103.1] for the Read-Data function is then started on host A in the manner described above. Note that the host A parent process [1102.1], distributed shared memory [1101.1], and child process A [1103.1] are analogous to the parent process [1102], distributed shared memory [1101], and child process A [1103.1] described above in FIG. 11 and FIG. 12.
  • The host A parent process then starts a new engine parent process [1102.2] on host B, passing input indicating that functions already reserved for host A should be ignored. During initial analysis of the input application, the host B parent process ignores the Read-Data function since it is marked for host A and determines that only the Transform-Data function should run as a child process on host B. As explained above, this choice was optionally made by the developer during the development process and is recorded in the application drawing artifact.
  • The host B parent process creates a full structure for distributed shared memory [1101.2] but only the sections relevant to child processes that need to run on host B will be initialized (in this case, the child process for the Transform-Data function). A child process for the Transform-Data function is then started on host B in the manner described above. Note that the host B parent process [1102.2], distributed shared memory [1101.2], and child process B [1103.2] are analogous to the parent process [1102], distributed shared memory [1101], and child process B [1103.2] described above in FIG. 11 and FIG. 12.
  • The host B parent process then starts a new engine parent process [1102.3] on host C, passing input indicating that functions already reserved for hosts A and B should be ignored. During initial analysis of the input application, the host C parent process ignores the Read-Data and Transform-Data functions because they have been reserved for the other hosts, and determines that only the Write-Data function should run as a child process on host C. As explained above, this choice was optionally made by the developer during the development process and is recorded in the application drawing artifact.
  • The host C parent process creates a full structure for distributed shared memory [1101.3] but only the sections relevant to child processes that need to run on host C will be initialized (in this case, the child process for the Write-Data function). A child process for the Write-Data function is then started on host C in the manner described above. Note that the host C parent process [1102.3], distributed shared memory [1101.3], and child process C [1103.3] are analogous to the parent process [1102], distributed shared memory [1101], and child process C [1103.2] described above in FIG. 11 and FIG. 12.
  • Because there are no more functions to be allocated at this point, the distributed startup sequence is complete.
  • FIG. 14 is a modified UML collaboration diagram (as described above) that extends FIG. 13 to illustrate the process of distributed shared memory replication when the sample application is executed in a distributed environment comprising three hosts.
  • In a distributed processing scenario, additional control flow is needed to communicate the status of each host. The parent process [1102.1] on the master host A [1301.1] is responsible for directing the entire application across hosts. As a result of this, its distributed shared memory [1101.1] must reflect the state of all child processes on all child hosts. To do this distributed shared memory is partially replicated from host to host.
  • Each child process is responsible for updating its portion of the distributed shared memory structure at regular intervals. In the example application, the replication process begins on host C [1301.3] when its update interval arrives. At this point, the parent process [1102.3] writes [1401] its output [1402]. When the update interval for host B [1301.2] is reached, the host B parent process [1102.2] will read [1403] the output from the host C parent process, and update [1404] its distributed shared memory [1101.2] with the control data it read as output from host C. A cumulative update of control data including control data from host B and host C is then written [1405] as output [1406] from the parent process. When the update interval for host A is reached, the host A parent process [1102.1] will read [1407] the output from the host B parent process (which it started), and update [1408] its own host A distributed shared memory [1101.1] with the control data it read as output from host B.
  • The parent process on the master host periodically reads the contents of the distributed shared memory to obtain information related to each of the child processes. This process occurs at regular intervals, and is timed according to a user-configurable parameter. The information read from the distributed shared memory is used to monitor the progress of the child processes and to provide status updates. Using a distributed shared memory structure to provide status updates allows the child processes to process data in an uninterrupted fashion, without pausing periodically to send status messages. Essentially, this creates a system by which data are transferred from child process to child process “in-band” while status messages and updates are transferred “out-of-band,” separate from the flow of data.
  • FIG. 15 is a data flow relationship diagram (as described above) that merges FIG. 12 and FIG. 13 to illustrate the flow of data in the data integration engine when the sample application is run in a distributed environment comprising three hosts. The channel data flow method being employed is identical to the single-host/single-instance method described above (FIG. 12), except that the channels now operate to pass data between threads running on different hosts. The communication channels that pass data between threads across hosts can be implemented using any inter-host communication means, including sockets, riiop, rpc, soap, oob, and mpi.
  • It will be appreciated that the scope of the present invention is not limited to the above-described embodiments, but rather is defined by the appended claims; and that these claims will encompass modifications of and improvements to what has been described.

Claims (6)

1. A method of developing data integration applications utilizing semantic identifiers to represent application data fields and variables, the method comprising:
a. receiving a set of physical data identifiers, each physical data identifier specifying the name of a physical data field, the network location of the physical data field, and a method of accessing the physical data field;
b. storing in a persistent data store a set of semantic names for use in defining data integration applications, wherein each semantic name is associated in the persistent data store with a physical identifier, and wherein each semantic name indicates the meaning of the data contained in the data field specified by said physical identifier;
c. defining a data integration application comprising functional rules to extract, transform, and store data, wherein each functional rule is a sequence of programmatic expressions and wherein each programmatic expression comprises semantic names, which are stored in the persistent data store and represent input values, and functional operators to transform and combine said input values to create output values; and
d. executing said functional rules by replacing each of said semantic names with data from the data field specified by the physical identifier that is associated in the persistent data store with the semantic name;
e. such that a user may define a data integration application utilizing said semantic names and be ignorant of the information contained in the physical identifier.
2. The method according to claim 1, further comprising automatically converting the input values from one datatype to another as required by the functional operators.
3. The method according to claim 1, further comprising:
a. analyzing the received physical data identifiers;
b. using the results of said analysis to provide a set of suggested semantic names for at least some of the physical data identifiers; and
c. responsive to user input, selecting a semantic name from the set of suggested semantic names and associating the selected semantic name with a physical identifier in the persistent data store.
4. A system for developing data integration applications utilizing semantic identifiers to represent application data fields and variables, the system comprising:
a. a persistent data store that is operable to store computer-readable data and to store relations among said data;
b. logic for receiving a set of physical data identifiers, each data identifier specifying the name of a physical data field, the network location of the physical data field, and a method of accessing the physical data field;
c. logic for storing in the persistent data store a set of semantic names for use in defining data integration applications, wherein each semantic name is associated in the persistent data store with a physical identifier, and wherein the semantic name indicates the meaning of the data field specified by said physical identifier;
d. logic for defining a data integration application comprising functional rules to extract, transform, and store data, wherein each functional rule is a sequence of programmatic expressions, each programmatic expression comprising semantic names, which are stored in the persistent data store and represent input values, and functional operators to transform and combine said input values to create output values; and
e. logic for executing said functional rules by replacing each of said semantic names with data from the data field specified by the physical identifier that is associated with the semantic name in the persistent data store;
f. such that a user may define a data integration application utilizing said semantic names and be ignorant of the information contained in the physical identifier.
5. The system according to claim 4, further comprising logic for automatically converting the input values from one datatype to another as required by the functional operators.
6. The system according to claim 4, further comprising:
a. logic for analyzing the received physical data identifiers;
b. logic for using the results of said analysis to provide a set of suggested semantic names for at least some of the physical data identifiers; and
c. responsive to user input, selecting a semantic name from the set of suggested semantic names and associating the selected semantic name with a physical identifier in the persistent data store.
US12/177,712 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables Abandoned US20090282066A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US12/177,712 US20090282066A1 (en) 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables
PCT/US2009/043267 WO2009140154A1 (en) 2008-05-12 2009-05-08 Methods and systems for developing, debugging, and executing data integration applications
CA2723933A CA2723933C (en) 2008-05-12 2009-05-08 Methods and systems for developing, debugging, and executing data integration applications
EP09747239.3A EP2300913A4 (en) 2008-05-12 2009-05-08 Methods and systems for developing, debugging, and executing data integration applications

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US5254808P 2008-05-12 2008-05-12
US12/177,712 US20090282066A1 (en) 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables

Publications (1)

Publication Number Publication Date
US20090282066A1 true US20090282066A1 (en) 2009-11-12

Family

ID=41267723

Family Applications (6)

Application Number Title Priority Date Filing Date
US12/177,659 Abandoned US20090282042A1 (en) 2008-05-12 2008-07-22 Method and system for managing the development of data integration projects to facilitate project development and analysis thereof
US12/177,712 Abandoned US20090282066A1 (en) 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables
US12/177,665 Abandoned US20090282058A1 (en) 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable functional rules that are managed according to their output variables
US12/177,708 Active 2031-01-04 US8141029B2 (en) 2008-05-12 2008-07-22 Method and system for executing a data integration application using executable units that operate independently of each other
US12/177,711 Active 2030-12-07 US8112742B2 (en) 2008-05-12 2008-07-22 Method and system for debugging data integration applications with reusable synthetic data values
US13/420,972 Active US8312414B2 (en) 2008-05-12 2012-03-15 Method and system for executing a data integration application using executable units that operate independently of each other

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/177,659 Abandoned US20090282042A1 (en) 2008-05-12 2008-07-22 Method and system for managing the development of data integration projects to facilitate project development and analysis thereof

Family Applications After (4)

Application Number Title Priority Date Filing Date
US12/177,665 Abandoned US20090282058A1 (en) 2008-05-12 2008-07-22 Method and system for developing data integration applications with reusable functional rules that are managed according to their output variables
US12/177,708 Active 2031-01-04 US8141029B2 (en) 2008-05-12 2008-07-22 Method and system for executing a data integration application using executable units that operate independently of each other
US12/177,711 Active 2030-12-07 US8112742B2 (en) 2008-05-12 2008-07-22 Method and system for debugging data integration applications with reusable synthetic data values
US13/420,972 Active US8312414B2 (en) 2008-05-12 2012-03-15 Method and system for executing a data integration application using executable units that operate independently of each other

Country Status (4)

Country Link
US (6) US20090282042A1 (en)
EP (1) EP2300913A4 (en)
CA (1) CA2723933C (en)
WO (1) WO2009140154A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131247A1 (en) * 2009-11-30 2011-06-02 International Business Machines Corporation Semantic Management Of Enterprise Resourses
US20120095973A1 (en) * 2010-10-15 2012-04-19 Expressor Software Method and system for developing data integration applications with reusable semantic types to represent and process application data
CN103473362A (en) * 2013-09-26 2013-12-25 杭州意能软件有限公司 Data migration method and system and receiving port module
US9767222B2 (en) 2013-09-27 2017-09-19 International Business Machines Corporation Information sets for data management
US10223401B2 (en) 2013-08-15 2019-03-05 International Business Machines Corporation Incrementally retrieving data for objects to provide a desired level of detail
US10628437B1 (en) * 2016-03-28 2020-04-21 United Services Automobile Association (Usaa) Systems and methods for traceability and impact analysis across data islands

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060009991A1 (en) * 2004-05-25 2006-01-12 Jun-Jang Jeng Method and apparatus for using meta-rules to support dynamic rule-based business systems
WO2010028395A1 (en) * 2008-09-08 2010-03-11 Credit Suisse Securities (Usa) Llc Apparatuses, methods and systems for providing a virtual development and deployment environment including real and synthetic data
US20120041989A1 (en) * 2010-08-16 2012-02-16 Tata Consultancy Services Limited Generating assessment data
US8819056B2 (en) * 2010-11-19 2014-08-26 International Business Machines Corporation Facilitation of search, list, and retrieval operations on persistent data set using distributed shared memory
US20150154097A1 (en) * 2012-10-05 2015-06-04 Compact Solutions, Llc System and method for automated testing
US9690945B2 (en) 2012-11-14 2017-06-27 International Business Machines Corporation Security analysis using relational abstraction of data structures
US9367433B2 (en) * 2013-02-13 2016-06-14 International Business Machines Corporation Generating input values for a test dataset from a datastore based on semantic annotations
US9928283B2 (en) 2013-04-15 2018-03-27 Validus Solutions, Inc. Tracing data through a transformation process using tracer codes
GB201312776D0 (en) * 2013-07-17 2013-08-28 Qatar Foundation Data integration and cleansing
US9449060B2 (en) 2013-08-06 2016-09-20 International Business Machines Corporation Post-migration validation of ETL jobs and exception management
US9516143B2 (en) * 2013-08-08 2016-12-06 Red Hat, Inc. System and method for assembly and use of integration applications
DE102013110571A1 (en) * 2013-09-24 2015-03-26 Iqser Ip Ag Automatic data harmonization
US9785719B2 (en) * 2014-07-15 2017-10-10 Adobe Systems Incorporated Generating synthetic data
EP2977888B1 (en) * 2014-07-21 2018-11-07 Siemens Aktiengesellschaft Method and system for storage and version management of computer programs
US9483250B2 (en) * 2014-09-15 2016-11-01 International Business Machines Corporation Systems management based on semantic models and low-level runtime state
CN104933098A (en) * 2015-05-28 2015-09-23 浪潮软件集团有限公司 Data cleaning platform design method based on elimination of repeated records
US9680799B2 (en) * 2015-09-21 2017-06-13 Bank Of America Corporation Masking and unmasking data over a network
CN105528424B (en) * 2015-12-07 2019-03-05 普元信息技术股份有限公司 The system and method for data persistence is realized under big data environment
US10467198B2 (en) * 2016-09-15 2019-11-05 Oracle International Corporation Network partition tolerance in a high available centralized VCS implementation
US10621195B2 (en) 2016-09-20 2020-04-14 Microsoft Technology Licensing, Llc Facilitating data transformations
US10621562B2 (en) * 2016-09-23 2020-04-14 Ncr Corporation Secure payment processing within messaging systems
US10706066B2 (en) 2016-10-17 2020-07-07 Microsoft Technology Licensing, Llc Extensible data transformations
US10776380B2 (en) 2016-10-21 2020-09-15 Microsoft Technology Licensing, Llc Efficient transformation program generation
US11170020B2 (en) 2016-11-04 2021-11-09 Microsoft Technology Licensing, Llc Collecting and annotating transformation tools for use in generating transformation programs
US11163788B2 (en) 2016-11-04 2021-11-02 Microsoft Technology Licensing, Llc Generating and ranking transformation programs
US10782960B2 (en) * 2017-06-06 2020-09-22 Ab Initio Technology Llc User interface that integrates plural client portals in plural user interface portions through sharing of one or more log records
CN109840196B (en) * 2017-11-29 2022-08-12 北京京东尚科信息技术有限公司 Method and device for testing business logic
US10783161B2 (en) * 2017-12-15 2020-09-22 International Business Machines Corporation Generating a recommended shaping function to integrate data within a data repository
US11137998B2 (en) * 2018-10-08 2021-10-05 Ebay Inc. Dependency lock in CICD pipelines
US10824541B1 (en) * 2018-10-18 2020-11-03 State Farm Mutual Automobile Insurance Company System and method for test data fabrication
CN109960708A (en) * 2019-03-22 2019-07-02 蔷薇智慧科技有限公司 Data processing method, device, electronic equipment and storage medium
US11100009B2 (en) 2020-01-03 2021-08-24 Bank Of America Corporation Intelligent detection and ejection of unused application components
US11354227B2 (en) 2020-10-12 2022-06-07 Bank Of America Corporation Conducting software testing using dynamically masked data
US11789838B2 (en) * 2022-01-31 2023-10-17 Microstrategy Incorporated Systems and methods for server crash prevention

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5712971A (en) * 1995-12-11 1998-01-27 Ab Initio Software Corporation Methods and systems for reconstructing the state of a computation
US5819021A (en) * 1995-12-11 1998-10-06 Ab Initio Software Corporation Overpartitioning system and method for increasing checkpoints in component-based parallel applications
US5857204A (en) * 1996-07-02 1999-01-05 Ab Initio Software Corporation Restoring the state of a set of files
US5897638A (en) * 1997-06-16 1999-04-27 Ab Initio Software Corporation Parallel virtual file system
US6014670A (en) * 1997-11-07 2000-01-11 Informatica Corporation Apparatus and method for performing data transformations in data warehousing
US6044374A (en) * 1997-11-14 2000-03-28 Informatica Corporation Method and apparatus for sharing metadata between multiple data marts through object references
US6088716A (en) * 1997-04-28 2000-07-11 Ab Initio Software Corporation Method for preventing buffer deadlock in dataflow computations
US6208990B1 (en) * 1998-07-15 2001-03-27 Informatica Corporation Method and architecture for automated optimization of ETL throughput in data warehousing applications
US6266804B1 (en) * 1997-12-23 2001-07-24 Ab Initio Software Corporation Method for analyzing capacity of parallel processing systems
US6301701B1 (en) * 1999-11-10 2001-10-09 Tenfold Corporation Method for computer-assisted testing of software application components
US6339775B1 (en) * 1997-11-07 2002-01-15 Informatica Corporation Apparatus and method for performing data transformations in data warehousing
US6584581B1 (en) * 1999-12-06 2003-06-24 Ab Initio Software Corporation Continuous flow checkpointing data processing
US20050155035A1 (en) * 2004-01-14 2005-07-14 International Business Machines Corp. Information kit objects for publish/subscribe integration architecture
US20050232046A1 (en) * 2003-08-27 2005-10-20 Ascential Software Corporation Location-based real time data integration services
US20050262192A1 (en) * 2003-08-27 2005-11-24 Ascential Software Corporation Service oriented architecture for a transformation function in a data integration platform
US7047232B1 (en) * 1999-01-13 2006-05-16 Ab Initio Software Corporation Parallelizing applications of script-driven tools
US20060179075A1 (en) * 2005-02-07 2006-08-10 Fay Jonathan E Method and system for obfuscating data structures by deterministic natural data substitution
US7164422B1 (en) * 2000-07-28 2007-01-16 Ab Initio Software Corporation Parameterized graphs with conditional components
US7167850B2 (en) * 2002-10-10 2007-01-23 Ab Initio Software Corporation Startup and control of graph-based computation
US20070276667A1 (en) * 2003-06-19 2007-11-29 Atkin Steven E System and Method for Configuring Voice Readers Using Semantic Analysis
US20080016048A1 (en) * 2006-07-12 2008-01-17 Dettinger Richard D Intelligent condition pruning for size minimization of dynamic, just in time tables

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6212556B1 (en) * 1995-11-13 2001-04-03 Webxchange, Inc. Configurable value-added network (VAN) switching
US20050005266A1 (en) * 1997-05-01 2005-01-06 Datig William E. Method of and apparatus for realizing synthetic knowledge processes in devices for useful applications
US6591278B1 (en) * 2000-03-03 2003-07-08 R-Objects, Inc. Project data management system and method
US6968363B1 (en) * 2000-05-09 2005-11-22 Microsoft Corporation File propagation tool
US7546598B2 (en) * 2003-09-03 2009-06-09 Sap Aktiengesellschaft Measuring software system performance using benchmarks
US7610477B2 (en) * 2004-09-15 2009-10-27 Microsoft Corporation Deploying and receiving software over a network susceptible to malicious communication
US7167422B2 (en) * 2004-09-24 2007-01-23 Benq Corporation Optical recording apparatus capable of optimizing write defocus and method thereof
US8719769B2 (en) * 2009-08-18 2014-05-06 Hewlett-Packard Development Company, L.P. Quality-driven ETL design optimization

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5819021A (en) * 1995-12-11 1998-10-06 Ab Initio Software Corporation Overpartitioning system and method for increasing checkpoints in component-based parallel applications
US5712971A (en) * 1995-12-11 1998-01-27 Ab Initio Software Corporation Methods and systems for reconstructing the state of a computation
US5857204A (en) * 1996-07-02 1999-01-05 Ab Initio Software Corporation Restoring the state of a set of files
US6088716A (en) * 1997-04-28 2000-07-11 Ab Initio Software Corporation Method for preventing buffer deadlock in dataflow computations
US5897638A (en) * 1997-06-16 1999-04-27 Ab Initio Software Corporation Parallel virtual file system
US6339775B1 (en) * 1997-11-07 2002-01-15 Informatica Corporation Apparatus and method for performing data transformations in data warehousing
US6014670A (en) * 1997-11-07 2000-01-11 Informatica Corporation Apparatus and method for performing data transformations in data warehousing
US6044374A (en) * 1997-11-14 2000-03-28 Informatica Corporation Method and apparatus for sharing metadata between multiple data marts through object references
US6665862B2 (en) * 1997-12-23 2003-12-16 Ab Initio Software Corporation Method for analyzing capacity of parallel processing systems
US6266804B1 (en) * 1997-12-23 2001-07-24 Ab Initio Software Corporation Method for analyzing capacity of parallel processing systems
US6208990B1 (en) * 1998-07-15 2001-03-27 Informatica Corporation Method and architecture for automated optimization of ETL throughput in data warehousing applications
US7047232B1 (en) * 1999-01-13 2006-05-16 Ab Initio Software Corporation Parallelizing applications of script-driven tools
US6301701B1 (en) * 1999-11-10 2001-10-09 Tenfold Corporation Method for computer-assisted testing of software application components
US6584581B1 (en) * 1999-12-06 2003-06-24 Ab Initio Software Corporation Continuous flow checkpointing data processing
US6654907B2 (en) * 1999-12-06 2003-11-25 Ab Initio Software Corporation Continuous flow compute point based data processing
US7164422B1 (en) * 2000-07-28 2007-01-16 Ab Initio Software Corporation Parameterized graphs with conditional components
US7167850B2 (en) * 2002-10-10 2007-01-23 Ab Initio Software Corporation Startup and control of graph-based computation
US20070276667A1 (en) * 2003-06-19 2007-11-29 Atkin Steven E System and Method for Configuring Voice Readers Using Semantic Analysis
US20050232046A1 (en) * 2003-08-27 2005-10-20 Ascential Software Corporation Location-based real time data integration services
US20050262192A1 (en) * 2003-08-27 2005-11-24 Ascential Software Corporation Service oriented architecture for a transformation function in a data integration platform
US20050155035A1 (en) * 2004-01-14 2005-07-14 International Business Machines Corp. Information kit objects for publish/subscribe integration architecture
US20060179075A1 (en) * 2005-02-07 2006-08-10 Fay Jonathan E Method and system for obfuscating data structures by deterministic natural data substitution
US20080016048A1 (en) * 2006-07-12 2008-01-17 Dettinger Richard D Intelligent condition pruning for size minimization of dynamic, just in time tables

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131247A1 (en) * 2009-11-30 2011-06-02 International Business Machines Corporation Semantic Management Of Enterprise Resourses
US20120095973A1 (en) * 2010-10-15 2012-04-19 Expressor Software Method and system for developing data integration applications with reusable semantic types to represent and process application data
US8954375B2 (en) * 2010-10-15 2015-02-10 Qliktech International Ab Method and system for developing data integration applications with reusable semantic types to represent and process application data
US10223401B2 (en) 2013-08-15 2019-03-05 International Business Machines Corporation Incrementally retrieving data for objects to provide a desired level of detail
US10445310B2 (en) 2013-08-15 2019-10-15 International Business Machines Corporation Utilization of a concept to obtain data of specific interest to a user from one or more data storage locations
US10515069B2 (en) 2013-08-15 2019-12-24 International Business Machines Corporation Utilization of a concept to obtain data of specific interest to a user from one or more data storage locations
US10521416B2 (en) 2013-08-15 2019-12-31 International Business Machines Corporation Incrementally retrieving data for objects to provide a desired level of detail
CN103473362A (en) * 2013-09-26 2013-12-25 杭州意能软件有限公司 Data migration method and system and receiving port module
US9767222B2 (en) 2013-09-27 2017-09-19 International Business Machines Corporation Information sets for data management
US10628437B1 (en) * 2016-03-28 2020-04-21 United Services Automobile Association (Usaa) Systems and methods for traceability and impact analysis across data islands

Also Published As

Publication number Publication date
CA2723933C (en) 2016-08-16
US20090282383A1 (en) 2009-11-12
US8312414B2 (en) 2012-11-13
WO2009140154A1 (en) 2009-11-19
US20090282042A1 (en) 2009-11-12
US20090282058A1 (en) 2009-11-12
CA2723933A1 (en) 2009-11-19
US8141029B2 (en) 2012-03-20
US20120173557A1 (en) 2012-07-05
US20090282392A1 (en) 2009-11-12
EP2300913A4 (en) 2017-01-25
US8112742B2 (en) 2012-02-07
EP2300913A1 (en) 2011-03-30

Similar Documents

Publication Publication Date Title
US8112742B2 (en) Method and system for debugging data integration applications with reusable synthetic data values
US8954375B2 (en) Method and system for developing data integration applications with reusable semantic types to represent and process application data
US20210004383A1 (en) System and method for enabling extract transform and load processes in a business intelligence server
CA2690081C (en) Migration of legacy applications
US8813024B2 (en) System and a method for cross-platform porting of business application and making them contextually-aware on target platforms
US9292306B2 (en) System, multi-tier interface and methods for management of operational structured data
US7322024B2 (en) Generating reusable software assets from distributed artifacts
US20090083268A1 (en) Managing variants of artifacts in a software process
US8495559B2 (en) Extracting platform independent models from composite applications
US20120054147A1 (en) System and method for extract, transform, and load workflow generation
US20030177481A1 (en) Enterprise information unification
US20160170719A1 (en) Software database system and process of building and operating the same
US20050108684A1 (en) Method and system for generating an application object repository from application framework metadata
US20120124550A1 (en) Facilitating database application code translation from a first application language to a second application language
US20120060141A1 (en) Integrated environment for software design and implementation
Buchgeher et al. A platform for the automated provisioning of architecture information for large-scale service-oriented software systems
Tok et al. Microsoft SQL Server 2012 Integration Services
Mitchell et al. SQL Server Integration Services Design Patterns
Talby et al. The design and implementation of a metadata repository
Lee et al. Repository support for metadata-based legacy migration
Narayanan et al. Metadata exploitation in large-scale data migration projects
Alfiadi TEACHER’S EVALUATION MANAGEMENT SYSTEM AT NPIC
Ballard et al. Data Warehousing with the Informix Dynamic Server
Pleehajinda Database centric software test management framework for test metrics
Alanen et al. Issues on the design of an XML-based configuration management system for model driven engineering

Legal Events

Date Code Title Description
AS Assignment

Owner name: EXPRESSOR SOFTWARE, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RUSSELL, JOHN D;FRECHETTE, STEVEN R;DELLERT, BRIAN R;AND OTHERS;REEL/FRAME:021707/0868

Effective date: 20081014

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: QLIKTECH INTERNATIONAL AB, SWEDEN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:EXPRESSOR SOFTWARE CORPORATION;REEL/FRAME:034507/0665

Effective date: 20121231