METHOD AND SYSTEM FOR AUTOMATICALLY REGENERATING DATA ON-DEMAND
TECHNICAL FIELD
The present invention relates generally to a computer system for the generating data and, more particularly, to regenerating data on-demand.
BACKGROUND OF THE INVENTION
Computer systems are often used to generate vast amounts of data. Computer programs often input data and generate output data corresponding to that input data. In complex computer systems, many computer programs may be used to generate data based on data generated from other computer programs. These computer programs for generating the data are referred to as "tools" or "services." A description of such complex computer systems will help illustrate the computational inefficiencies that are often encountered. One such complex computer system may be a management information system ("MIS") for a large organization. The MIS may collect raw data that is generated at various locations throughout the organization. The MIS may have a variety of report generating tools that input subsets of the raw data and may generate reports. A report itself may be stored as a data set and used as input into another report generating tool. Thus, the reports and the raw data combine to form a hierarchy of data sets. The various reports may be accessible to managers at different levels within the organization. For example, a low-level manager may need access to a detailed report relating to a specific location within the organization, whereas a high-level manager may need a high-level report that summarizes the detailed reports of many locations. When a manager requests that the MIS system generate a report, it may be important that the report is up-to-date. However, it may be very computationally expensive to regenerate all intermediate reports that are used to generate the requested report. It
would be desirable to have a technique that would ensure that a requested report is up-to-date, but that would avoid the high computational expense of regenerating all the intermediate reports.
Another such complex computer system is a development environment for computer programs. The development environment allows programs to write, compile, debug, and maintain computer programs. The development environment may use a word processor to generate the source code for the computer program- a parser to generate an intermediate representation of the computer program from the source code, a translator to generate object code from the intermediate representation, an optimizer to generate optimized object code from the object code, and a linker to link optimized object code from different functions into executable code. Large computer programs, such as operating systems, can have thousands of functions which need to be compiled and linked into executable code. The process of compiling and linking such large computer programs can be very computationally intensive. As a result, the compiling and linking of the large computer program can take a very long time and have a significant negative impact on the development of the computer program. For example, if a new executable code is needed because of changes to the source code of the computer program, the source code for each of the functions may need to be parsed, translated, optimized, and linked to generate executable code that is up-to-date. Such generation of executable code may take many hours of computer time, which can significantly slow the development of the computer programs. Some tools for the development system may be able to check the time when an input (e.g.. source code) for the tool (e.g., parser) was last written. If all the input that is used to generate an output (e.g.. intermediate representation) was not written since the output was last written, then the tool does not need to regenerate the output because it is already up-to-date. If, however, one of the inputs was written after the output was last written, then output may be out-of-date and the tool needs to regenerate the output. Although such tools help to reduce the time
needed to generate the executable code, it would be desirable to further reduce that time.
SUMMARY OF THE INVENTION
Embodiments of the present invention provide a replay method and system for monitoring the generating of a data set from input data sets and. when the data set is subsequently accessed, automatically regenerating the data set if me data set is out-of-date. The replay system only regenerates those input data sets that are determined to be out-of-date and only regenerates the output data set if it is determined to be out-of-date. A data set is determined to be out-of-date only when an input data set has actually changed since the data set was last generated.
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1A-1C illustrate the replaying of a session.
Figure 2 illustrates a computer system on which the replay system may be executed. Figure 3 is a flow diagram illustrating an example implementation of a service.
Figure 4 is a flow diagram illustrating an example implementation of the needs_replay routine of the replay system.
Figure 5 is a flow diagram of an example implementation of the replay routine.
Figure 6 illustrates a program library that the development environment uses to store information about a computer program.
Figure 7 illustrates the data set object and replay object organization for the program library of Figure 6. Figure 8 is a flow diagram of an example implementation of a routine of a service to translate functions within a module of a program library.
Figure 9 is a flow diagram of an example implementation of a routine to create a translation session.
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention provide a replay method and system for monitoring the generating of a data set from input data sets and. when the data set is subsequently accessed, automatically regenerating the data set if the data set is out-of-date. The replay system only regenerates those input data sets that are determined to be out-of-date and only regenerates the output data set if it is determined to be out-of-date. A data set is determined to be out-of-date only when an input data set has actually changed since the data set was last generated.
The replay system monitors the generating of data sets so that it knows how to regenerate the data set if one of its input data sets has changed. When the replay system monitors the generation of a data set. it records information describing how the data set is generated. The generating of a data set is referred to as a "session." During a session, the replay system records which data sets are input data sets from which the data set is generated and records which service (e.g.. computer program) is used to generate the data set from the input data sets. When the data set is initially generated, the service initializes a session and notifies the replay system of the input data sets, of the generating service, of the output data set, and of any arguments passed to the service to control the generation of the output data set. To generate the output data set. the generating program invokes session routines provided by the replay system to record all the information that is necessary to "replay" the session when the data set needs to be updated, that is when the data set is accessed and it is out-of-date. When the data set is accessed (e.g., as input to the generation of a different data set), the replay system determines whether the accessed data set may be out-of-date. A data set may be out-of-date when any of its input data sets has changed since the session was last replayed or any of its input
data sets are themselves out-of-date. To replay the session that generated the accessed data set. the replay system first ensures that all the input data sets of the session are up-to-date. After the replay system ensures that the input data sets are up-to-date, the replay system determines whether any of the input data sets have changed since the session was last replayed. If an input data set has changed, then the replay system replays the session to regenerate the data set to be accessed so that it can be ensured that it is up-to-date. The process of ensuring that an input data set is up-to-date may require the replaying of the session that was used to generate that input data set. Before replaying that session, the replay system ensures that its input data sets are also up-to-date. Thus, the replay system will replay any session whose output data set needs to be updated and is used directly or indirectly to regenerate the accessed data set. In this way. data sets are regenerated on-demand when they are needed and only those data sets that are actually needed as input are regenerated. Also, the replay system collects all the information necessary to regenerate the data sets when they are generated.
Figures 1A-1C illustrate the replaying of a session. Each of the squares represent an object 101-1 16. These objects contain either a data set or information describing how to replay a session. The objects 102. 104. 106, 108, 1 10. 1 12, 114, and 1 16 represent replay objects that each correspond to a session and contain information describing how to replay the session. The objects 101, 103, 105. 107. 109. I l l, 1 13. and 115 represent data set objects that contain data sets that are input to and output of the various replay objects. For example, data set object 101 is the output of replay object 102. and data set objects 103 and 109 are inputs of the replay object 102. The data set objects 103 and 109 are also outputs of replay objects 104 and 1 10. respectively. In this example, each replay object has only one output data set object. In general, however, a replay object can have multiple output data set objects. Each object contains a time stamp. The time stamp in the data set objects indicates the time when the data in the data set object was last updated in a
way that actually changed the data of the data set object. In particular, if a data set object is regenerated because an input data set object has changed, but that change had no effect on the content of the output data set object, then the replay system does not change the time stamp of the output data set object. The time stamp in the replay objects indicates the time when the session was last replayed or when the data set was initially generated. When determining whether to replay a session, the replay system, after ensuring that the input data sets are up-to-date, checks trie time stamp on each of the input data set objects. If the time stamp of input data set object is later than a time stamp of the replay object, then the input data set object has changed since the session represented by the replay object was last replayed. Therefore, the output data set object of that replay object may be out-of-date, and the session used to generate that output data set object needs to be replayed. If a session is replayed, but the output data set is not changed, then the time stamp in the output data set object is not updated. Even though an input data set is regenerated, the replay system can look at its time stamp to determine whether the session that it is input to needs to be replayed. Thus, the replay system can avoid replaying sessions whose input data sets have been regenerated but not charged.
In the example of Figure 1A. the replay object 102 has a time stamp of 61 and its output data set object 101 has a time stamp of 60. When data set object 101 is accessed, the replay system determines whether data set object 101 is up-to- date. If not up-to-date, the replay system replays the session that generated data set object 101, which is represented by replay object 102. To determine whether data set object 101 is up-to-date, the replay system determines whether the data in the input data set objects 103 and 109 are up-to-date. To make this determination, the replay system determines whether any of the input data set objects to the replay objects for the sessions that generated data set objects 103 and 109 are themselves up-to-date. For example, to determine whether data set object 103 is up-to-date, the replay system determines whether any of the input data set objects 105 and 107 of
replay object 104 are up-to-date. The replay system continues this determination for all data sets that are input to a data set that is used to generate data set object 101 either directly or indirectly. For example, data set object 103 is used to directly generate data set object 101, and data set object 105 is used to indirectly generate data set object 101. When it is determined that an input data set object is up-to-date and has a time stamp that is greater than the time stamp of its replay object, the replay system replays the session of the replay object. For example, replay" object 104 has a time stamp of 51 and its input data set object 107 has a time stamp of 64. Thus, input data set object 107 has changed since the session for replay object 104 was last replayed. However, the data set object 107 may not necessarily be itself up- to-date. Thus, the replay system needs to ensure that the data set object is up-to-date before it replays the session for replay object 104.
Figure IB shows the updated time stamps for the input data set objects 103 and 109. When replay object 104 was replayed, the replay system changed time stamp of replay object 104 to 81. However, because the replaying of the session did not change the contents of output data set object 103, the replay system did not change the time stamp for output data set object 103. Thus, replay object 102 does not need to be replayed as a result of input data set object 103 being regenerated. However, it does need to be replayed as a result of input data set object 109 being regenerated. The time stamp of input data set object 109 is now larger than the time stamp of replay object 102. Thus, the session of replay object 102 needs to be replayed. In determining whether replay object 110 needed to be replayed, the replay system ensured that input data set objects 111, 1 13. and 1 15 were up-to-date and then compared their time stamps to the time stamp of replay object 110. Since the time stamp of replay object 1 10 was initially 41 and the time stamp of data set object 1 15 was initially 70, then the replay system knew that the replay object needed to be replayed but did not know whether any of its input data sets were up- to-date. The replay system determined that data set object may not have been up-to-
date and replayed the session of replay object 1 16. As indicated by the change in time stamp on date set object 115. contents of data set object 115 was changed when it was regenerated. In contrast, when the session of replay object 114 was replayed, data set object 113 was not changed as indicated by no change in its time stamp. Figure 1C illustrates the time stamps after data set object 101 has been updated. The replay system replayed the session of replay object 102. which resulted in a change of data set object 101. Thus, the replay system updated trie time stamp for data set object 101. In the process of bringing data set object 101 up-to- date, the replay system replayed the sessions associated with replay objects 102. 104, 108. 110. 114, and 116. but did not replay the sessions associated with replay objects 106 and 112. Thus, rather than regenerating all the data set objects that are used directly or indirectly to generate data set object 101. the replay system only regenerated those data sets that whose input data sets were actually out-of-date.
Figure 2 illustrates a computer system on which the replay system may be executed. The computer system 201 includes a central processing unit 202, input/output devices 203. and memory 204. The memory contains the replay system 205. object store 206, and services 207. The replay system includes session interface 209 and replay component 210. The object store contains data set objects and replay objects that are maintained by the replay. The services are computer programs that are used to generate the object data sets of the object store. A service can be invoked independently from the replay system when, for example, a user wants to update a data set object if it is not currently up-to-date (e.g.. in a development environment, the user wants to generate a new executable code). The service uses the replay system to record a session that is used to generate a data set object. When invoked by the replay system during the replay of a session, the service is passed a flag indicating that its input data set objects are up-to-date. When the flag indicates that the input data set objects are up-to-date, the service regenerates the data set objects without checking if the input data set objects are up-
to-date. When a data set object in the data store is accessed when determining whether to replay a session, the replay component ensures that the input data set objects of that data set object are up-to-date. If an input data set object has changed since the session for that data set object was last generated, then the replay component replays the session for that data set object.
Figure 3 is a flow diagram illustrating an example implementation of a service. When the service is executed, it is passed arguments that specify the input data sets, the output data sets, and flags to control the actions performed by the service. The service is also passed a flag indicating whether the input data sets are up-to-date. The service creates a session and regenerates the data set object. If. however, the passed flag indicates that the input are not necessarily up-to-date, then the service ensures that the input data sets are up-to-date. If an input data set has changed since the output data set was last generated, then the service creates a session and regenerates the output data set. In step 301, the service locates the output data set object that is to be updated or creates that output data set object if it does not exist in the object store. In step 302, if the passed flag indicates that the input data set objects are up-to-date, then the service continues at step 304. else the service continues at step 303. In step 303. the service invokes a needs replay routine passing the output data set object and an indication that the output data set object should not be updated. The routine returns an indication as to whether an input data set object has changed since the output data set object was last generated, and thus the session needs to be replayed. If the output data set object needs to be replayed, then the service continues at step 304, else the service returns. In step 304. the service retrieves the replay object for the located data set object. If no replay object currently exist for the located data set object, then the service creates a replay object in the object store. In step 305. the service requests the replay system to create a session and record the session in the opened replay object. The service also identifies the service that is to be used to regenerate the output data set object when
the session is replayed and identifies any arguments that should be passed to the service when the output data set object is regenerated. In step 306. the service notifies the session of the input data set objects that are used to generate the output data set object. In step 307, the service performs the main processing of the service, which includes the generating of the output data set object. In step 308. the service notifies the session of the output data set object. In step 309, the service requests the replay system to close the session. The replay system closes the session by recording in the replay object all the information necessary to regenerate the output data set object. The replay system also updates the time stamp of the replay object and. if the output data set object has changed, the replay system updates the time stamp of the output data set object.
Figure 4 is a flow diagram illustrating an example implementation of the needs replay routine of the replay system. When a service is requested to ensure that an output data set object is up-to-date, the service invokes this routine to ensure that all the input data set objects necessary to generate the output data set objects are up-to-date and to indicate whether the output data set object needs to be regenerated. This routine also optionally regenerates the output data set as indicated by a passed flag. If a replay object generates multiple output data sets that are input to another replay object, then when the session for that other replay object is replayed, it may ensure that the input data sets of that replay object are up-to-date multiple times- once for each of the multiple output data sets. Of course, once it is ensured that the output sets are up-to-date during a session, there is no need to repeat the ensuring. To prevent the redundant ensuring, the routine tracks which replay objects have already been brought up-to-date during the session. In step 401. the routine retrieves the replay object for the passed output data set object. The routine also initializes a flag indicating that the retrieved replay object does not need to be replayed. In step 401 A, if the retrieved replay object has already been brought up-to-date, during this session, then the routine returns a flag indicating that the passed object does not
need to be brought up-to-date, else the routine continues at step 40 IB. In step 40 IB. the routine marks the replay object as already being brought up-to-date. In one embodiment, the routine tracks the replay objects that have already been brought up- to-date by creating an entry in a hash table for an object identifier of the replay object. In steps 402-406, the routine loops ensuring that any input data set objects of the passed output data set object are up-to-date. The routine recursively invokes the needs replay routine for each input data set object indicating that the input data set object is to be brought up-to-date, if not currently up-to-date. In step 402. the routine selects the next input data set object of the retrieved replay object. In step 403. if all the input data set objects have already been selected, then the routine continues at step 407. else the routine continues at step 404. In step 404. the routine recursively invokes the needs_replay routine passing the selected input data set object and an indication to bring the selected input data set object up-to-date, if not already up-to-date. In step 405. if the time stamp of the retrieved replay object is before the time stamp of the selected input data set object, then the passed output data set object may be out-of-date and the routine continues at step 406. else the routine loops to step 402 to select the next input data set object of the retrieved replay object. In step 406, the routine sets a flag to indicate that the session corresponding to the retrieved replay object needs to be replayed and then loops to step 402 to select the next input data set object. In step 407. if the passed replay flag indicates that the passed output data set object should be brought up-to-date and the passed output object data set may not up-to-date, then the routine continues at step 408. else the routine returns a flag indicating whether the passed output data set object needs to be brought up-to-date. In step 408, the routine invokes a routine to replay the session of the retrieved replay object to bring the passed output data set object up-to-date and then returns an indication that the data set object is up-to-date. Figure 5 is a flow diagram of an example implementation of the replay routine. This routine is passed a replay object and replays the session that recorded
the replay object. In step 501, the routine retrieves the identification of the service from the passed replay object. In step 502, the routine retrieves the arguments from the passed replay object. In step 503, the routine invokes the identified service passing the arguments and a flag indicating that the input data set objects are up-to- date. The routine then returns.
Figures 6-9 illustrate an example use of the replay system by a computer program development environment. The development environment provides services for generating front-end intermediate language code from source code in different programming languages and for generating object code from the front-end intermediate language code. Figure 6 illustrates a program library, a collection of objects, that the development environment uses to store information about a computer program. The information in the program library is organized hierarchically. Node 601 represents the computer program "lib.a"' and is the root of the hierarchy. Node 602 contains the external symbol dictionary for the computer program "lib.o". The external symbol dictionary contains the name of each symbol of the computer program that is declared to be external by a function of the computer program and an indication of where the external symbols are defined. The external symbol for a function may be defined within another function within the computer program or defined in a library external to the computer program. Node 604 contains interface records for the computer program. The interface records contain information used by an automatic inliner service indicating where functions should be stored inline rather than having an invocation to the functions. When a function is updated, each of the functions that have the updated function stored inline becomes out-of-date and needs to be updated. Node 603 contains a list of each of the modules, "libl.o" of node 605 and "lib2.o" of node 606, of the computer program. Node 605 contains the functions of the modules in node 607 and a source tree map for the module in node 608. The source tree map contains a cross reference between the front-end intermediate language code to source files (not
shown) and line numbers. In this example, the module "libl.o" contains two functions, "fund" of node 609 and "func2" of node 610. Node 611 contains the front-end intermediate language code of the function "fund." Node 612 contains object code generated from the front-end implementation language for the function "fund." Node 613 contains the information necessary to regenerate the object code for function "fund." The function inline table node 614 contains information primarily for that indicating that an inline function of function "fund" has been changed. Node 615 is a replay object for the automatic inliner service. When the front-end translator generates the front-end intermediate language code, it generates the automatic inliner replay object so that whenever relevant portions of the interface records of node 604 change, the function inline table of node 614 is also changed. In this way, the development environment can determine to regenerate the implementation when an inline function has changed. The dashed lines indicate the input and outputs of the replay objects. Figure 7 illustrates the data set object and replay object organization for the program library of Figure 6. Data set object 701 corresponds to the node 612 and is the output data set of the implementation replay object 702 corresponding to node 613. The replay object 702 represents a session for generating output data set object 701 from input data set objects 703 and 704. Data set object 703 represents the front-end intermediate language code of node 61 1. Data set object 704 represents the function inline table of node 614. Data set object 704 is an output data set object of the automatic inliner replay object 705 of node 615. Data set object 706 represents the interface records of node 604. When the implementation represented by data set object 701 is accessed, the replay system determines whether data set object 701 is up-to-date. In doing so, the replay system determines whether input data set object 704 is up-to-date. Data set object 704 is up-to-date if the time stamp of data set object 706 is less than a time stamp of replay object 705. Because the interface records of node 604 contains information relating to all the functions in
the program library, the interface records may be updated whenever any function in the program library is modified. If the data set object 706 was input directly into replay object 702 (as indicated by the dashed line), then anytime a function was changed, which caused any change to the interface record, the session of replay object 702 would need to be replayed to bring data set object 701 up-to-date. Thus, the replay system would likely regenerate data set object 701 many times when a change in the interface records would result in no change in implementation represented by data set object 701. However, the function inline table of node 614 represents only portions of the interface records of node 604 that pertains to the function of node 609. Thus, when updating the translations of the front-end intermediate language of the functions of the program library, a function will only be retranslated when the portion of the interface records represented by its function inline table has changed. Thus, by separating an aggregate data set object, like a global function inline table, into data set objects representing subsets of the aggregate data set object, like data set object 704. the replaying of complex sessions, such as the translation session of replay object 702. is not performed as a result of a non-relevant portion of the aggregate data set object changing.
Figure 8 is a flow diagram of an example implementation of a routine of a service to translate functions within a module of a program library. This routine is passed arguments which specify the program library and module name, and a flag indicating whether the input data set objects of the functions within the module are up-to-date. In step 801. the routine selects the root node of the program library indicated by the passed arguments. In step 802, the routines selects the module node within the program library. In step 803. the routine selects a certain module as indicated by the passed arguments within the selected program library. In steps 804- 809. the routine loops selecting each function within the selected module and retranslating the selected function if any input used to generate the implementation has changed implementation. In step 804. the routine selects the next function of the
selected module. In step 805, if all the functions of the selected module have already been selected, then the translation is complete, else the routine continues at step 806. In step 806, the routine selects the implementation of the selected function. In step 807, if the passed flag indicates that the input data sets are up-to- date, then the routine continues to step 809, else the routine continues at step 808. In step 808, the routine invokes the needs_replay routine passing the selected implementation along with an indication that the implementation should not be regenerated. The invoked routine returns an indication as to whether the selected implementation needs to be regenerated. If the implementation is not up-to-date. then the routine continues at step 809. else the routine loops to step 804 to select the next function of the selected module. In step 809. the routine invokes the routine to translate the function and create a session. The routine then loops to step 804 to select the next function in the selected module.
Figure 9 is a flow diagram of an example implementation of a routine to create a translation session. This routine is passed the identification of a front-end intermediate language data set object. In step 901, the routine creates a replay object for the implementation data set object, if one does not exist. In step 902. the routine opens a session. In step 903. the routine notifies the session of the name of the translator service and notifies the session of the arguments that are to be provided to the translator when this session is replayed. In step 904, the routine performs the translation of the front-end intermediate language data set object into the implementation data set object. In step 905, the routine notifies the session to save the implementation data set object. In step 906. the routine closes the session, which updates the time stamp of the implementation data set object (if it has changed) updates the time stamp of the replay object, the routine returns. Although not shown, the routine also creates a replay object for a session that determines whether any of the inlined functions of the implementation have changed.
From the foregoing it will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the invention. Accordingly, the invention is not limited except as by the appended claims.