US20070083856A1 - Dynamic temporal optimization framework - Google Patents
Dynamic temporal optimization framework Download PDFInfo
- Publication number
- US20070083856A1 US20070083856A1 US11/539,111 US53911106A US2007083856A1 US 20070083856 A1 US20070083856 A1 US 20070083856A1 US 53911106 A US53911106 A US 53911106A US 2007083856 A1 US2007083856 A1 US 2007083856A1
- Authority
- US
- United States
- Prior art keywords
- hot
- program
- profiling
- data stream
- language element
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3612—Software analysis for verifying properties of programs by runtime analysis
Definitions
- the present invention relates to temporal profiling and memory access optimization of computer programs, and particularly for dynamic optimization during program execution.
- processor speed increasing much more rapidly than memory access speed, there is a growing performance gap between processor and memory in computers. More particularly, processor speed continues to adhere to Moore's law (approximately doubling every 18 months). By comparison memory access speed has been increasing at the relatively glacial rate of 10% per year. Consequently, there is a rapidly growing processor-memory performance gap. Computer architects have tried to mitigate the performance impact of this imbalance with small high-speed cache memories that store recently accessed data. This solution is effective only if most of the data referenced by a program is available in the cache. Unfortunately, many general-purpose programs, which use dynamic, pointer-based data structures, often suffer from high cache miss rates, and therefore are limited by memory system performance.
- memory system optimizations have the potential to significantly improve program performance.
- One such optimization involves prefetching data ahead of its use by the program, which has the potential of alleviating the processor-memory performance gap by overlapping long latency memory accesses with useful computation.
- Successful prefetching is accurate (i.e., correctly anticipates the data objects that will be accessed in the future) and timely (fetching the data early enough so that it is available in the cache when required). For example, T. Mowry, M.
- Lam and A Gupta “Design And Analysis Of A Compiler Algorithm For Prefetching,” Architectural Support For Programming Languages And Operating Systems (ASP-LOS) (1992) describe an automatic prefetching technique for scientific codes that access dense arrays in tightly nested loops, which relies on static compiler analyses to predict the program's data accesses and insert prefetch instructions at appropriate program points.
- ASP-LOS Architectural Support For Programming Languages And Operating Systems
- Dynamic optimization uses profile information from the current execution of a program to decide what and how to optimize. This can provide an advantage over static and even feedback-directed optimization, such as in the case of the programs with distinct phase behavior.
- dynamic optimization must be more concerned with the profiling overhead, since the slow-down from profiling has to be recovered by the speed-up from optimization.
- sampling instead of recording all the information that may be useful for optimization, sample a small, but representative fraction of it.
- sampling counts the frequency of individual events such as calls or loads.
- TOCS Transactions On Computer Systems
- the sequence of all events occurring during execution of a program is generally referred to as the “trace.”
- a “burst” is a subsequence of the trace.
- Arnold and Ryder present a framework that samples bursts. (See, M. Arnold and B. Ryder, “A Framework For Reducing The Cost Of Instrumented Code,” Programming Languages Design And Implementation (PLDI) (2001).)
- PLDI Programming Languages Design And Implementation
- the other version only contains checks at procedure entries and loop back-edges that decrement a counter “nCheck,” which is initialized to “nCheck 0 .” Most of the time, the (non-instrumented) checking code is executed. Only when the ncheck counter reaches zero, a single intraprocedural acyclic path of the instrumented code is executed and ncheck is reset to nCheck 0 .
- a limitation of the Arnold-Ryder framework is that it stays in the instrumented code only for the time between two checks. Since it has checks at every procedure entry and loop back-edge, the framework captures a burst of only one acyclic intraprocedural path's worth of trace. In other words, only the burst between the procedure entry check and a next loop back-edge is captured. This limitation can fail to profile many longer “hot data stream” bursts, and thus fail to optimize such hot data streams.
- the Arnold-Ryder framework was implemented for a Java virtual machine execution environment, where the program is a set of Java class files. These Java programs typically have a higher execution overhead, so that the overhead of the instrumentation checks is smaller compared to a relatively slow executing program.
- the overhead of the Arnold-Ryder framework's instrumentation checks may make dynamic optimization with the framework impractical in other settings for programs with lower execution overhead (such as statically compiled machine code programs).
- a further problem is that the overhead of hot data stream detection has been overly high for use in dynamic optimization systems, such as the Arnold-Ryder framework.
- Techniques described herein provide low-overhead temporal profiling and analysis, such as for use in dynamic memory access optimization.
- temporal profiling of longer bursts in a program trace is achieved by incorporating symmetric “checking code” and “instrumented code” counters in a temporal profiling framework employing non-instrumented (checking) code and instrumented code versions of a program.
- a counter Rather than immediately transitioning back to the checking code at a next proximate check in the instrumented code as in the prior Arnold-Ryder framework, a counter also is placed on checks in the instrumented code. After transitioning to the instrumented code, a count of plural checks in the instrumented code is made before returning to the checking code. This permits the instrumented code to profile longer continuous bursts sampled out of the program trace.
- the overhead of temporal profiling is reduced by intelligently eliminating checks.
- checks were placed at all procedure entries and loop back-edges in the code to ensure that the program can never loop or recurse for an unbounded amount of time without executing a check.
- the techniques intelligently eliminate checks from procedure entries and loop back-edges.
- the intelligent check elimination performs a static call graph analysis of the program to determine where checks should be placed on procedure entries to avoid unbounded execution without checking. Based on the call graph analysis, the intelligent check elimination places checks at entries to root procedures, procedures whose address is taken, and procedures with recursion from below.
- the intelligent check elimination does not place checks on leaf procedures (that call no other code in the program) in the call graph. Further, the intelligent check elimination eliminates checks at loop back-edges of tight inner loops, and at “k-boring loops” (loops with no calls and at most k profiling events of interest, since these are easy for a compiler to statically optimize). Other techniques to reduce checks also can be employed. This reduction in temporal profiling overhead can make dynamic optimization practical for faster executing programs (e.g., binary code), as well as improving efficiency of dynamic optimization of just-in-time compiled (JITed) code and interpreted programs.
- JITed just-in-time compiled
- an improved hot data stream detection more quickly identifies hot data streams from profiled bursts of a program, which can make dynamic prefetching practical for dynamic optimization of programs.
- the improved hot data stream detection constructs a parse tree of the profiled bursts, then forms a Sequitur grammar from the parse tree. The improved hot stream detection then traverses the grammar tree in reverse postorder numbering order.
- the improved hot stream detection calculates a regularity magnitude or “heat” of the element based on a length of the burst sequence represented by the element multiplied by its number of “cold” uses (i.e., number of times the element occurs in the complete parse tree, not counting occurrences as sub-trees of another “hot” element).
- the improved hot stream detection identifies elements as representing “hot data streams” if their heat exceeds a heat threshold.
- FIG. 1 is a data flow diagram of a dynamic optimizer utilizing a low overhead, long burst temporal profiling framework and fast hot data stream detection to dynamically optimize a program with dynamic hot data stream prefetching.
- FIG. 2 is a block diagram of a program modified according to the prior Arnold-Ryder framework for burst profiling.
- FIG. 3 is a block diagram of a program modified according to an improved framework for longer burst profiling in the dynamic optimizer of FIG. 1 .
- FIG. 4 is a program code listing for a check to control transitions between checking and instrumented code versions in the improved framework of FIG. 3 for longer burst profiling.
- FIG. 5 is a call graph of an example program to be modified according to an improved framework for low-overhead burst profiling.
- FIG. 6 is an illustration of an analysis of the call graph of FIG. 5 for modifying the example program according to the improved framework for low-overhead burst profiling.
- FIG. 7 is a data flow diagram illustrating processing for dynamic optimization of a program image in the dynamic optimizer of FIG. 1 .
- FIG. 8 is a timeline showing phases of the low-overhead, long burst temporal profiling by the dynamic optimizer of FIG. 1 .
- FIG. 9 is an illustration of grammar analysis of an exemplary data reference sequence in bursts profiled with the low-overhead, long burst temporal profiling forming part of the processing by the dynamic optimizer shown in FIG. 7 .
- FIG. 10 is a program code listing for fast hot data stream detection in the processing by the dynamic optimizer shown in FIG. 7 .
- FIG. 11 is an illustration of the fast hot data stream detection performed according to the program code listing of FIG. 10 on the grammar of the exemplary data reference sequence from FIG. 9 .
- FIG. 12 is a table listing results of the fast hot data stream detection illustrated in FIG. 11 .
- FIG. 13 is a block diagram of a suitable computing device environment for devices in the network device architecture of FIG. 1 .
- the following description is directed to techniques for low-overhead, long burst temporal profiling and fast hot data stream detection, which can be utilized in dynamic optimization of computer programs. More particularly, these technique are described in their particular application to a dynamic optimization involving hot data stream prefetching to optimize a program's memory accesses. However, the techniques can be applied in contexts other than the described hot data stream prefetching dynamic optimization.
- an exemplary dynamic optimizer 100 utilizes techniques described more fully herein below for low-overhead, long burst temporal profiling and fast hot data stream detection in a process of dynamically optimizing a computer program.
- the exemplary dynamic optimizer 120 includes a program editing tool 122 to build a program image 130 in accordance with a low-overhead temporal profiling framework described below, including inserting instrumentation and checking code for profiling long burst samples of a trace of the program's execution.
- the program editing tool 122 inserts the instrumentation and checking code for the low-overhead temporal profiling framework by editing an executable or binary version 115 of the program to be optimized, after compiling and linking by a conventional compiler from the program's source code version.
- the source code 105 of the program to be optimized may be initially written by a programmer in a high level programming language, such as C or C++.
- Such program source code is then compiled using an appropriate conventional compiler 110 , such as a C/C++ compiler available in the Microsoft® Visual Studio development platform, to produce the machine-executable program binary 115 .
- the executable editing tool for the instrumentation insertion 122 can be the Vulcan executable editing tool for x86 computer platform program binaries, which is described in detail by A. Srivastava, A. Edwards, and H. Vo, “Vulcan: Binary Transformation In A Distributed Environment,” Technical Report MSR-TR-2001-50, Microsoft Research (2001).
- This has the advantage that the dynamic optimizer does not require access to the source code, and can employed to optimize programs where only an executable binary version is available.
- the profiling framework can be built into the program image 130 as part of the process of compiling the program from source code or an intermediate language form, such as for use with programs written in Java, or intermediate code representations for the Microsoft Net platform.
- the compiler that inserts instrumentation and checks embodies the tool 122 .
- the temporal profiling framework provided in the program image 130 produces profiled burst data 135 representing sampled bursts of the program's execution trace.
- the exemplary dynamic optimizer 120 includes a hot data stream analyzer 140 and hot stream prefetching code injection tool 142 .
- the hot data stream analyzer 140 implements fast hot data stream detection described herein below that process the profiled burst data to identify “hot data streams,” which are frequently recurring sequences of data accesses by the program.
- the hot stream prefetching code injection tool 142 then dynamically modifies the program image 130 to perform prefetching so as to optimize cache utilization and data accesses by the program, based on the identified hot data streams.
- the program image 130 ( FIG. 1 ) is structured according to a low-overhead, long burst temporal profiling framework 300 illustrated in FIG. 3 , which is an improvement on the prior Arnold-Ryder framework 200 ( FIG. 2 ).
- the code of each procedure from an original program version (e.g., original procedure 210 with code blocks 212 - 213 ) is duplicated.
- Both duplicate versions of the code in the framework 200 contain the original instructions, but only one version is instrumented to also collect profile information (referred to herein as the “instrumented code” 220 ).
- the other version (referred to herein as the “checking code” 230 ) only contains checks 240 - 241 at procedure entries and loop back-edges that decrement a counter “nCheck,” which is initialized to “nCheck 0 .” Most of the time, the (non-instrumented) checking code 230 is executed.
- nCheck Only when the nCheck counter reaches zero, a single intraprocedural acyclic path of the instrumented code 220 is executed and nCheck is reset to nCheck 0 . All back-edges 250 in the instrumented code 220 transition back to the checking code 230 .
- the Arnold-Ryder framework 200 While executing in the instrumented code 220 , the Arnold-Ryder framework 200 profiles a burst out of the program execution trace, which begins at a check (e.g., procedure entry check 240 or back-edge check 241 ) and extends to the next check. In other words, the profiling captures one intraprocedural acyclic path.
- the profile of the program captured during execution of this path can be, for example, the data accesses made by the program.
- the improved framework 300 extends the prior Arnold-Ryder framework 200 ( FIG. 2 ) so that profiled bursts can extend over multiple checks, possibly crossing procedure boundaries. This way, the improved framework can obtain interprocedural, context-sensitive and flow-sensitive profiling information.
- the improved framework 300 is structured to include duplicate non-instrumented (“checking code”) 330 and instrumented code 320 versions of at least some original procedures 310 of the program. Further, checks 340 - 341 are placed at procedure entry and loop back-edges.
- the extension in the improved framework 300 adds a second “profiling phase” counter (labeled “nInstr”) to make execution flow in the instrumented code 320 symmetric with the checking code 330 . Further, the loop back-edges 350 from the instrumented code 320 do not transition directly back to the procedure entry as in the prior Arnold-Ryder framework 200 , but instead go to a back-edge check 341 .
- the program logic or code 400 for the checks 340 - 341 is shown in FIG. 4 .
- the value of the checking phase counter (“nCheck”) is set to its initial value, “nCheck 0 .”
- the framework 300 decrements the checking phase counter (nCheck) (statement 410 ) at every check 340 - 341 .
- the framework 300 continues to execute in the checking code (statement 420 ) as long as the value of the checking phase counter has not yet reached zero. For example, from the entry and back-edge checks 340 - 341 , the framework 300 takes the paths 360 - 361 to the checking code 330 .
- the framework 300 initializes the profiling phase counter (nInstr) to an initial value, nInstr 0 , and transitions to the instrumented code 320 (statement 430 ).
- the framework 300 While executing in the instrumented code, the framework 300 decrements the profiling phase counter (nInstr) at every check 340 - 341 (statement 440 ). The framework 300 continues to execute in the instrumented code (statement 450 ) as long as the value of the profiling phase counter has not yet reached zero. For example, from the entry and back-edge checks 340 - 341 , the framework 300 takes the paths 370 - 371 to the instrumented code 320 . When the profiling phase counter reaches zero, the framework again initializes the checking phase counter to the initial value, nCheck 0 , and returns to the checking code 330 (statement 460 ).
- nInstr the profiling phase counter
- the check code 400 is structured so that in the common case where the framework is executing in the checking code and is to continue executing the checking code (checking phase), the check consists of a decrement of the checking phase counter and a conditional branch.
- the improved framework 300 profiles longer bursts of the program trace and provides more precise profiles. For example, consider the following code fragment:
- the overhead imposed by the temporal profiling framework desirably is relatively small compared to the overall program execution, so that performance gains are achieved from dynamically optimizing the program.
- the overhead of the temporal profiling framework can be particularly significant in the exemplary dynamic optimizer 120 in which the program image 130 is built from editing an executable program binary 115 , to which the compiler 110 has already applied many static optimizations.
- the overhead of the prior Arnold-Ryder framework may be too high for effective dynamic optimization.
- the prior Arnold-Ryder framework has checks at all procedure entries and loop back-edges to insure that the program can never loop or recurse for an unbounded amount of time without executing a check. Otherwise, sampling could miss too much profiling information (when the program spends an unbounded amount of time in the checking code), or the overhead could become too high (when the program spends an unbounded amount of time in the instrumented code).
- the low-overhead temporal profiling framework described herein decreases the overhead of the burst sampling by intelligently eliminating some checks (i.e., placing checks at fewer than all procedure entries and loop back-edges), while still ensuring that the program does not spend an unbounded amount of time without executing a check.
- the instrumentation tool 122 places checks at an approximated minimum set of procedure entries so that the program cannot recurse for an unbounded amount of time without executing a check.
- the instrumentation tool 122 performs a static call graph analysis of the program 115 to determine this approximate minimum set (C ⁇ N) of nodes in the program's call graph, such that every cycle in the call graph contains at least one node of the set.
- the instrumentation tool 122 does not place any check on any entry to a leaf procedure (i.e., a procedure that calls nothing), since such leaf procedures cannot be part of a recursive cycle. Otherwise, the instrumentation tool 122 places a check on entries to all root procedures (i.e., procedures that are only called from outside the program), so as to ensure that execution starts in the correct version of the code. Also, the tool places a check on entry to every procedure whose address is taken, since such procedures may be part of recursion with indirect calls. Further, the tool places a check on entry to every procedure with recursion from below.
- a procedure f has recursion from below, iff it is called by g in the same strongly connected component as f that is at least as far away from the roots.
- the distance of a procedure f from the roots is the minimum length of the shortest path from a root to f.
- the “recursion_from_below” heuristic in this criteria guarantees that there is no recursive cycle without a check and breaks the ties to determine where in the cycle to put the check (similarly to back-edges in loops).
- the tool breaks ties so that checks are as far up in the call-stack as possible. This should reduce the number of dynamic checks.
- FIG. 5 illustrates a call graph 500 of an exemplary program being structured by the tool 122 according to the low-overhead temporal profiling framework.
- the only root is procedure main 510
- the only leaf procedure is delete-digram 520 .
- the only non-trivial strongly connected component in the call graph 500 is the component 650 (of procedures ⁇ check, match, substitute ⁇ 530 - 532 ).
- FIG. 6 illustrates an analysis 600 of the call graph 500 by the tool 122 to determine the set of procedures for entry check placement.
- the tool 122 begins with a breadth-first search of the call graph.
- the tool calculates the distances (e.g., from 0 to 4 in this example) of each procedure from the root procedure (main 510 ), and determines that only the procedure check 530 has recursion from below, since it is called from the procedure substitute 532 which is further away from the root procedure main 510 .
- the instrumentation tool 122 also places checks at fewer than all loop back-edges in the program.
- the instrumentation tool 122 eliminates checks for some tight inner loops. This is because a dynamic optimizer that complements a static optimizer may often find the profiling information from tight inner loops to be of little interest because static optimization excels at optimizing such loops.
- checks at the back-edges of tight inner loops can become extremely expensive (i.e., create excessive overhead relative to potential optimization performance gain).
- loops that compare or copy arrays preferably should not have checks. Such loops typically are easy to optimize statically, the check on the back-edge is almost as expensive as the loop body, and the loop body contains too little work to overlap with the prefetch.
- the instrumentation tool 122 eliminates checks on loop back-edges of loops meeting a “k-boring loops” criteria.
- k-boring loops are defined as loops with no calls and at most a number (k) of profiling events of interest.
- the instrumentation tool 122 does not instrument either version of the code of a k-boring loop, and does not place a check on its back-edge. Since the loop is not included in the instrumented code 320 ( FIG. 1 ) version, the program image 130 does not spend an unbounded amount of time executing in instrumented code. The program image may spend an unbounded amount of time executing such a loop in uninstrumented code (checking code 330 of FIG.
- the instrumentation tool 122 may eliminate additional checks on loop back-edges. For example, the instrumentation tool may eliminate back-edge checks from a loop that has only a small, fixed number of iterations. Further, if a check is always executing within a loop body, the loop does not need a check on the loop's back-edge.
- the instrumentation tool 122 can combine the loop counter with the profiling phase counter; if the counters are linearly related, the program image can execute checks for the loop via a predicate on the loop counter, rather than updating the profiling counter each iteration of the loop.
- the temporal profiling 710 using the above-described low-overhead, long burst temporal profiling framework 300 is a first phase in an overall dynamic optimization process 700 based on hot data stream pre-fetching.
- the dynamic optimization process 700 operates in three phases—profiling 710 , analysis and optimization 720 , and hibernation 730 .
- the profiling phase collects ( 740 ) a temporal data reference profile 135 from a running program with low-overhead, which is accomplished using the program image 130 ( FIG. 1 ) structured according to the improved temporal profiling framework 300 .
- a grammar analysis using the Sequitur compression process 750 incrementally builds an online grammar representation 900 of the traced data references.
- a fast hot data stream detection 140 extracts hot data streams 760 from the Sequitur grammar representation 900 .
- a prefetching engine 142 builds a stream prefix matching deterministic finite state machine (DFSM) 770 for these hot data streams, and dynamically injects checks at appropriate program points to detect and prefetch these hot data streams in the program image 130 .
- DFSM deterministic finite state machine
- the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute ( 780 ) as optimized with the added prefetch instructions.
- the program image 130 is de-optimized ( 790 ) to remove the inserted checks and prefetch instructions, and control returns to the profiling phase 710 .
- this profiling 710 , analysis and optimization 720 and hibernate 730 cycle may repeat multiple times.
- FIG. 8 shows a timeline 800 for the three phase profiling, analysis and optimization, and hibernation cycle operation of the dynamic optimizer 100 ( FIG. 1 ).
- the low-overhead, long burst temporal profiling framework uses the checking phase and profiling phase counters (nCheck, nInstr) to control its overhead and sampling rate of profiling, by transitioning between a checking phase 810 in which the program image 130 ( FIG. 1 ) executes in its non-instrumented checking code 330 program image 130 ( FIG. 1 ) executes in its non-instrumented checking code 330 ( FIG. 3 ) and a profiling phase 820 in which it executes in its instrumented code 320 ( FIG. 3 ).
- nCheck profiling phase counters
- the time spent for one iteration of the checking and profiling phase (nCheck 0 +nInstr 0 ) is referred to as a burst period 850 .
- the above-described low-overhead temporal profiling framework 300 ( FIG. 3 ) is further extended to alternate between two additional phases, awake 830 and hibernating 840 , which are controlled via two additional (awake and hibernating) counters.
- the temporal profiling framework starts out in the awake phase 830 , and continues operating in the awake phase for a number (nAwake 0 ) of burst-periods, yielding (nAwake 0 ⁇ nInstr 0 ) checks ( 860 ) worth of traced data references (or “bursts”). Then, as described above and illustrated in FIG. 7 , the dynamic optimizer 100 performs the optimizations, and then the profiler hibernates while the optimized program executes.
- nCheck 0 to (nCheck 0 +nInstr 0 ⁇ 1) and nInstr 0 to 1 for the next nHibernate 0 burst-periods (which causes the check code 400 in FIG. 4 to keep the program image executing in the non-instrumented checking code 330 ), where nHibernate 0 is much greater than nAwake 0 .
- the profiling framework is “woken up” by resetting nCheck 0 and nInstr 0 to their original values.
- the program image traces next to no data references and hence incurs only the basic overhead of executing the checks 400 ( FIG. 4 ).
- the burst-periods correspond to the same time (in executed checks 860 ) in both awake and hibernating phases. This facilitates control over the relative length of the awake and hibernating phases by appropriately setting the initial value parameters nAwake 0 and nHibernate 0 of the awake and hibernating counters relative to each other.
- a data reference r is a load or store operation on a particular address, represented in the exemplary dynamic optimizer 120 as a data pair (r.pc, r.addr).
- the “pc” value i.e., r.pc
- the “addr” value is the memory location accessed by the load or store operation.
- the profiled burst is a temporal sequence or stream of these data references.
- FIG. 9 illustrates an example of a grammar 900 produced from an input data reference sequence (input string 910 ).
- the grammar 900 represents a hierarchical structure (a directed acyclic graph 920 ) of the data references.
- each observed data reference (r.pc, r.addr) is conceptually represented as a symbol in a grammar, and the concatenation of the profiled bursts is a string w of symbols ( 910 ).
- the Sequitur grammar analysis constructs a context-free grammar for the language ⁇ w ⁇ consisting of exactly one word, the string w.
- the Sequitur grammar analysis runs in time O (w.length). It is incremental (one symbol can be appended at a time), and deterministic.
- the grammar analysis can be performed as the profiled data is sampled during the profiling phase 710 ( FIG. 7 ).
- the grammar 900 is a compressed representation of the input burst 910 . Further, it is unambiguous and acyclic in the sense that no non-terminal directly or indirectly defines itself.
- the terminal nodes represent individual data references (r.pc, r.addr), which may be repeated in the profiled burst.
- the intermediate nodes represent temporal sequences of the data references.
- the grammar 910 produced from the example input string 910 shows that the string S consists of the sequence “AaBB.” A, in turn, consists of the data references a and b.
- the intermediate node B represents a sequence with two occurrences of the intermediate node C, which is a sequence of the intermediate node A and data reference c.
- the dynamic optimizer 100 After construction of the grammar 900 in the profiling phase 710 , the dynamic optimizer 100 performs a fast hot data stream detection 140 ( FIGS. 1 and 7 ) to identify frequently recurring data reference subsequences (the “hot data streams”) in the profiled bursts.
- the exemplary dynamic optimizer For the fast hot data stream detection, the exemplary dynamic optimizer performs analysis of the grammar as represented in a hot data stream detection code 1000 shown in FIG. 10 .
- the purposes of the fast hot data stream analysis is to identify hot data streams, which are a data reference subsequence in the profiled bursts, whose regularity magnitude exceeds a predetermined “heat” threshold, H.
- the result of the analysis is the set ⁇ w A
- A is a hot non-terminal ⁇ of hot data streams.
- FIGS. 11 and 12 show an example 1100 of the analysis in the code 1000 ( FIG. 10 ) for the input data reference sequence 910 and grammar 900 in FIG. 9 .
- the Sequitur grammar analysis 750 FIG. 7
- the input data reference sequence has been parsed (as shown by parse tree 1110 ) and sub-sequences grouped under intermediate (non-terminal) nodes into the Sequitur grammar ( 1120 ).
- the Sequitur grammar analysis also yields the length of the subsequence represented in each non-terminal node of the grammar 1120 .
- the information shown in the first three columns (the non-terminal nodes, their children, and their lengths) of the table 1200 is provided to the fast hot data stream detection analysis.
- a non-terminal node is considered the child of another non-terminal node if it is listed on the right-hand side of the grammar rule of the other non-terminal node in Sequitur grammar 900 ( FIG. 9 ).
- the analyzer 140 ( FIG. 1 ) first executes instructions ( 1010 ) to perform a reverse post-order numbering of the non-terminal nodes in the grammar.
- This property guarantees that the analysis does not visit a non-terminal node before having visited all its predecessors.
- the analyzer 140 next determines at instructions 1020 in code 1000 how often each non-terminal node occurs in the parse-tree 1110 ( FIG. 11 ), which is represented in the “use” column of the table 1200 ( FIG. 12 ). Each of the non-terminal nodes is now associated with two values, its number of “hot uses” and its length, which are depicted conceptually in the uses:length tree 1140 ( FIG. 11 ).
- FIG. 13 illustrates a generalized example of a suitable computing environment 1300 in which the described techniques can be implemented.
- the computing environment 1300 is not intended to suggest any limitation as to scope of use or functionality of the invention, as the present invention may be implemented in diverse general-purpose or special-purpose computing environments.
- the computing environment 1300 includes at least one processing unit 1310 and memory 1320 .
- the processing unit 1310 executes computer-executable instructions and may be a real or a virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.
- the memory 1320 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two.
- the memory 1320 stores software 1380 implementing the dynamic optimizer 100 ( FIG. 1 ).
- a computing environment may have additional features.
- the computing environment 1300 includes storage 1340 , one or more input devices 1350 , one or more output devices 1360 , and one or more communication connections 1370 .
- An interconnection mechanism such as a bus, controller, or network interconnects the components of the computing environment 1300 .
- operating system software provides an operating environment for other software executing in the computing environment 1300 , and coordinates activities of the components of the computing environment 1300 .
- the storage 1340 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, COD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing environment 1300 .
- the storage 1340 stores instructions for the dynamic optimizer software 1380 .
- the input device(s) 1350 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing environment 1300 .
- the input device(s) 1350 may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment.
- the output device(s) 1360 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing environment 1300 .
- the communication connection(s) 1370 enable communication over a communication medium to another computing entity.
- the communication medium conveys information such as computer-executable instructions, audio/video or other media information, or other data in a modulated data signal.
- a modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
- Computer-readable media are any available media that can be accessed within a computing environment.
- Computer-readable media include memory 1320 , storage 1340 , communication media, and combinations of any of the above.
- program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- the functionality of the program modules may be combined or split between program modules as desired in various embodiments.
- Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
Abstract
A temporal profiling framework useful for dynamic optimization with hot data stream prefetching provides profiling of longer bursts and lower overhead. For profiling longer bursts, the framework employs a profiling phase counter, as well as a checking phase counter, to control transitions to and from instrumented code for sampling bursts of a program execution trace. The temporal profiling framework further intelligently eliminates some checks at procedure entries and loop back-edges, while still avoiding unbounded execution without executing checks for transition to and from instrumented code. Fast hot data stream detection analyzes a grammar of a profiled data reference sequence, calculating a heat metric for recurring subsequences based on length and number of unique occurrences outside of other hot data streams in the sequence with sufficiently low-overhead to permit use in a dynamic optimization framework.
Description
- This a divisional of U.S. patent application Ser. No. 10/305,056, filed Nov. 25, 2002, which application is incorporated herein in its entirety.
- The present invention relates to temporal profiling and memory access optimization of computer programs, and particularly for dynamic optimization during program execution.
- With processor speed increasing much more rapidly than memory access speed, there is a growing performance gap between processor and memory in computers. More particularly, processor speed continues to adhere to Moore's law (approximately doubling every 18 months). By comparison memory access speed has been increasing at the relatively glacial rate of 10% per year. Consequently, there is a rapidly growing processor-memory performance gap. Computer architects have tried to mitigate the performance impact of this imbalance with small high-speed cache memories that store recently accessed data. This solution is effective only if most of the data referenced by a program is available in the cache. Unfortunately, many general-purpose programs, which use dynamic, pointer-based data structures, often suffer from high cache miss rates, and therefore are limited by memory system performance.
- Due to the increasing processor-memory performance gap, memory system optimizations have the potential to significantly improve program performance. One such optimization involves prefetching data ahead of its use by the program, which has the potential of alleviating the processor-memory performance gap by overlapping long latency memory accesses with useful computation. Successful prefetching is accurate (i.e., correctly anticipates the data objects that will be accessed in the future) and timely (fetching the data early enough so that it is available in the cache when required). For example, T. Mowry, M. Lam and A Gupta, “Design And Analysis Of A Compiler Algorithm For Prefetching,” Architectural Support For Programming Languages And Operating Systems (ASP-LOS) (1992) describe an automatic prefetching technique for scientific codes that access dense arrays in tightly nested loops, which relies on static compiler analyses to predict the program's data accesses and insert prefetch instructions at appropriate program points. However, the reference pattern of general-purpose programs, which use dynamic, pointer-based data structures, is much more complex, and the same techniques do not apply.
- An alternative to static analyses for predicting data access patterns is to perform program data reference profiling. Recent research has shown that programs possess a small number of “hot data streams,” which are data reference sequences that frequently repeat in the same order, and these account for around 90% of a program's data references and more than 80% of cache misses. (See, e.g., T. M. Chilimbi, “Efficient Representations And Abstractions For Quantifying And Exploiting Data Reference Locality,” Proceedings Of The ACM SIGPLAN '01 Conference On Programming Language Design And Implementation (June 2001); and S. Rubin, R. Bodik and T. Chilimbi, “An Efficient Profile-Analysis Framework For Data-Layout Optimizations,” Principles Of Programming Languages, POPL '02 (January 2002).) These hot data streams can be prefetched accurately since they repeat frequently in the same order and thus are predictable. They are long enough (15-20 object references on average) so that they can be prefetched ahead of use in a timely manner.
- In prior work, Chilimbi instrumented a program to collect the trace of its data memory references; then used a compression technique called Sequitur to process the trace off-line and extract hot data streams. (See, T. M. Chilimbi, “Efficient Representations And Abstractions For Quantifying And Exploiting Data Reference Locality,” Proceedings Of The ACM SIGPLAN '01 Conference On Programming Language Design And Implementation (June 2001).) Chilimbi further demonstrated that these hot data streams are fairly stable across program inputs and could serve as the basis for an off-line static prefetching scheme. (See, T. M. Chilimbi, “On The Stability Of Temporal Data Reference Profiles,” International Conference On Parallel Architectures And Compilation Techniques (PACT) (2001).) However, this off-line static prefetching scheme may not be appropriate for programs with distinct phase behavior.
- Dynamic optimization uses profile information from the current execution of a program to decide what and how to optimize. This can provide an advantage over static and even feedback-directed optimization, such as in the case of the programs with distinct phase behavior. On the other hand, dynamic optimization must be more concerned with the profiling overhead, since the slow-down from profiling has to be recovered by the speed-up from optimization.
- One common way to reduce the overhead of profiling is through use of sampling: instead of recording all the information that may be useful for optimization, sample a small, but representative fraction of it. In a typical example, sampling counts the frequency of individual events such as calls or loads. (See, J. Anderson et al., “Continuous Profiling: Where Have All The Cycles Gone?,” ACM Transactions On Computer Systems (TOCS) (1997).) Other dynamic optimizations exploit causality between two or more events. One example is prefetching with Markov-predictors using pairs of data accesses. (See, D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” International Symposium On Computer Architecture (ISCA) (1997).) Some recent transparent native code optimizers focus on single-entry, multiple-exit code regions. (See, e.g., V. Bala, E. Duesterwald and S. Banerjia, “Dynamo: A Transparent Dynamic Optimization System,” Programming Languages Design And Implementation (PLDI) (2000); and D. Deaver, R. Gorton and N. Rubin, “Wiggins/Redstone: An On-Line Program Specializer,” Hot Chips (1999).) Another example provides cache-conscious data placement during generational garbage collection to lay out sequences of data objects. (See, T. Chilimbi, B. Davidson and J. Larus, “Cache-Conscious Structure Definition,” Programming Languages Design And Implementation (PLDI) (1999); and T. Chilimbi and J. Larus, “Using Generational Garbage Collection To Implement Cache-Conscious Data Placement,” International Symposium On Memory Management (ISMM) (1998).) However, for lack of low-overhead temporal profilers, these systems usually employ event profilers. But, as Ball and Larus point out, event (node or edge) profiling may misidentify frequencies of event sequences. (See, T. Ball and J. Larus, “Efficient Path Profiling,” International Symposium On Microarchitecture (MICRO) (1996).)
- The sequence of all events occurring during execution of a program is generally referred to as the “trace.” A “burst” on the other hand is a subsequence of the trace. Arnold and Ryder present a framework that samples bursts. (See, M. Arnold and B. Ryder, “A Framework For Reducing The Cost Of Instrumented Code,” Programming Languages Design And Implementation (PLDI) (2001).) In their framework, the code of each procedure is duplicated. (Id., at
FIG. 2 .) Both versions of the code contain the original instructions, but only one version is instrumented to also collect profile information. The other version only contains checks at procedure entries and loop back-edges that decrement a counter “nCheck,” which is initialized to “nCheck0.” Most of the time, the (non-instrumented) checking code is executed. Only when the ncheck counter reaches zero, a single intraprocedural acyclic path of the instrumented code is executed and ncheck is reset to nCheck0. - A limitation of the Arnold-Ryder framework is that it stays in the instrumented code only for the time between two checks. Since it has checks at every procedure entry and loop back-edge, the framework captures a burst of only one acyclic intraprocedural path's worth of trace. In other words, only the burst between the procedure entry check and a next loop back-edge is captured. This limitation can fail to profile many longer “hot data stream” bursts, and thus fail to optimize such hot data streams. Consider for example the code fragment:
- for (i=0; i<n; i++)
-
- if ( . . . ) f( );
- else g( );
Because the Arnold-Ryder framework ends burst profiling at loop back-edges, the framework would be unable to distinguish the traces fgfgfgfg and ffffgggg. For optimizing single-entry multiple-exit regions of programs, this profiling limitation may make the difference between executing optimized code most of the time or not.
- Another limitation of the Arnold-Ryder framework is that the overhead of the framework can still be too high for dynamic optimization of machine executable code binaries. The Arnold-Ryder framework was implemented for a Java virtual machine execution environment, where the program is a set of Java class files. These Java programs typically have a higher execution overhead, so that the overhead of the instrumentation checks is smaller compared to a relatively slow executing program. The overhead of the Arnold-Ryder framework's instrumentation checks may make dynamic optimization with the framework impractical in other settings for programs with lower execution overhead (such as statically compiled machine code programs).
- A further problem is that the overhead of hot data stream detection has been overly high for use in dynamic optimization systems, such as the Arnold-Ryder framework.
- Techniques described herein provide low-overhead temporal profiling and analysis, such as for use in dynamic memory access optimization.
- In accordance with one technique described herein, temporal profiling of longer bursts in a program trace is achieved by incorporating symmetric “checking code” and “instrumented code” counters in a temporal profiling framework employing non-instrumented (checking) code and instrumented code versions of a program. Rather than immediately transitioning back to the checking code at a next proximate check in the instrumented code as in the prior Arnold-Ryder framework, a counter also is placed on checks in the instrumented code. After transitioning to the instrumented code, a count of plural checks in the instrumented code is made before returning to the checking code. This permits the instrumented code to profile longer continuous bursts sampled out of the program trace.
- In accordance with further techniques, the overhead of temporal profiling is reduced by intelligently eliminating checks. In the prior Arnold-Ryder framework, checks were placed at all procedure entries and loop back-edges in the code to ensure that the program can never loop or recurse for an unbounded amount of time without executing a check. The techniques intelligently eliminate checks from procedure entries and loop back-edges. In one implementation, the intelligent check elimination performs a static call graph analysis of the program to determine where checks should be placed on procedure entries to avoid unbounded execution without checking. Based on the call graph analysis, the intelligent check elimination places checks at entries to root procedures, procedures whose address is taken, and procedures with recursion from below. On the other hand, the intelligent check elimination does not place checks on leaf procedures (that call no other code in the program) in the call graph. Further, the intelligent check elimination eliminates checks at loop back-edges of tight inner loops, and at “k-boring loops” (loops with no calls and at most k profiling events of interest, since these are easy for a compiler to statically optimize). Other techniques to reduce checks also can be employed. This reduction in temporal profiling overhead can make dynamic optimization practical for faster executing programs (e.g., binary code), as well as improving efficiency of dynamic optimization of just-in-time compiled (JITed) code and interpreted programs.
- In accordance with another technique, an improved hot data stream detection more quickly identifies hot data streams from profiled bursts of a program, which can make dynamic prefetching practical for dynamic optimization of programs. In one implementation, the improved hot data stream detection constructs a parse tree of the profiled bursts, then forms a Sequitur grammar from the parse tree. The improved hot stream detection then traverses the grammar tree in reverse postorder numbering order. At each grammar element, the improved hot stream detection calculates a regularity magnitude or “heat” of the element based on a length of the burst sequence represented by the element multiplied by its number of “cold” uses (i.e., number of times the element occurs in the complete parse tree, not counting occurrences as sub-trees of another “hot” element). The improved hot stream detection identifies elements as representing “hot data streams” if their heat exceeds a heat threshold.
- Additional features and advantages of the invention will be made apparent from the following detailed description that proceeds with reference to the accompanying drawings.
-
FIG. 1 is a data flow diagram of a dynamic optimizer utilizing a low overhead, long burst temporal profiling framework and fast hot data stream detection to dynamically optimize a program with dynamic hot data stream prefetching. -
FIG. 2 is a block diagram of a program modified according to the prior Arnold-Ryder framework for burst profiling. -
FIG. 3 is a block diagram of a program modified according to an improved framework for longer burst profiling in the dynamic optimizer ofFIG. 1 . -
FIG. 4 is a program code listing for a check to control transitions between checking and instrumented code versions in the improved framework ofFIG. 3 for longer burst profiling. -
FIG. 5 is a call graph of an example program to be modified according to an improved framework for low-overhead burst profiling. -
FIG. 6 is an illustration of an analysis of the call graph ofFIG. 5 for modifying the example program according to the improved framework for low-overhead burst profiling. -
FIG. 7 is a data flow diagram illustrating processing for dynamic optimization of a program image in the dynamic optimizer ofFIG. 1 . -
FIG. 8 is a timeline showing phases of the low-overhead, long burst temporal profiling by the dynamic optimizer ofFIG. 1 . -
FIG. 9 is an illustration of grammar analysis of an exemplary data reference sequence in bursts profiled with the low-overhead, long burst temporal profiling forming part of the processing by the dynamic optimizer shown inFIG. 7 . -
FIG. 10 is a program code listing for fast hot data stream detection in the processing by the dynamic optimizer shown inFIG. 7 . -
FIG. 11 is an illustration of the fast hot data stream detection performed according to the program code listing ofFIG. 10 on the grammar of the exemplary data reference sequence fromFIG. 9 . -
FIG. 12 is a table listing results of the fast hot data stream detection illustrated inFIG. 11 . -
FIG. 13 is a block diagram of a suitable computing device environment for devices in the network device architecture ofFIG. 1 . - The following description is directed to techniques for low-overhead, long burst temporal profiling and fast hot data stream detection, which can be utilized in dynamic optimization of computer programs. More particularly, these technique are described in their particular application to a dynamic optimization involving hot data stream prefetching to optimize a program's memory accesses. However, the techniques can be applied in contexts other than the described hot data stream prefetching dynamic optimization.
- 1. Overview of Dynamic Optimizer
- With reference to
FIG. 1 , an exemplarydynamic optimizer 100 utilizes techniques described more fully herein below for low-overhead, long burst temporal profiling and fast hot data stream detection in a process of dynamically optimizing a computer program. The exemplarydynamic optimizer 120 includes aprogram editing tool 122 to build aprogram image 130 in accordance with a low-overhead temporal profiling framework described below, including inserting instrumentation and checking code for profiling long burst samples of a trace of the program's execution. In the exemplary dynamic optimizer, theprogram editing tool 122 inserts the instrumentation and checking code for the low-overhead temporal profiling framework by editing an executable orbinary version 115 of the program to be optimized, after compiling and linking by a conventional compiler from the program's source code version. For example, thesource code 105 of the program to be optimized may be initially written by a programmer in a high level programming language, such as C or C++. Such program source code is then compiled using an appropriateconventional compiler 110, such as a C/C++ compiler available in the Microsoft® Visual Studio development platform, to produce the machine-executable program binary 115. The executable editing tool for theinstrumentation insertion 122 can be the Vulcan executable editing tool for x86 computer platform program binaries, which is described in detail by A. Srivastava, A. Edwards, and H. Vo, “Vulcan: Binary Transformation In A Distributed Environment,” Technical Report MSR-TR-2001-50, Microsoft Research (2001). This has the advantage that the dynamic optimizer does not require access to the source code, and can employed to optimize programs where only an executable binary version is available. In other embodiments, the profiling framework can be built into theprogram image 130 as part of the process of compiling the program from source code or an intermediate language form, such as for use with programs written in Java, or intermediate code representations for the Microsoft Net platform. In such other embodiments, the compiler that inserts instrumentation and checks embodies thetool 122. - The temporal profiling framework provided in the
program image 130 produces profiled burstdata 135 representing sampled bursts of the program's execution trace. The exemplarydynamic optimizer 120 includes a hotdata stream analyzer 140 and hot stream prefetchingcode injection tool 142. The hotdata stream analyzer 140 implements fast hot data stream detection described herein below that process the profiled burst data to identify “hot data streams,” which are frequently recurring sequences of data accesses by the program. The hot stream prefetchingcode injection tool 142 then dynamically modifies theprogram image 130 to perform prefetching so as to optimize cache utilization and data accesses by the program, based on the identified hot data streams. - 2. Temporal Profiling Framework
- The program image 130 (
FIG. 1 ) is structured according to a low-overhead, long bursttemporal profiling framework 300 illustrated inFIG. 3 , which is an improvement on the prior Arnold-Ryder framework 200 (FIG. 2 ). - In the prior Arnold-
Ryder framework 200, the code of each procedure from an original program version (e.g.,original procedure 210 with code blocks 212-213) is duplicated. Both duplicate versions of the code in theframework 200 contain the original instructions, but only one version is instrumented to also collect profile information (referred to herein as the “instrumented code” 220). The other version (referred to herein as the “checking code” 230) only contains checks 240-241 at procedure entries and loop back-edges that decrement a counter “nCheck,” which is initialized to “nCheck0.” Most of the time, the (non-instrumented) checkingcode 230 is executed. Only when the nCheck counter reaches zero, a single intraprocedural acyclic path of the instrumentedcode 220 is executed and nCheck is reset to nCheck0. All back-edges 250 in the instrumentedcode 220 transition back to thechecking code 230. - While executing in the instrumented
code 220, the Arnold-Ryder framework 200 profiles a burst out of the program execution trace, which begins at a check (e.g., procedure entry check 240 or back-edge check 241) and extends to the next check. In other words, the profiling captures one intraprocedural acyclic path. The profile of the program captured during execution of this path can be, for example, the data accesses made by the program. - Profiling Longer Bursts
- The
improved framework 300 extends the prior Arnold-Ryder framework 200 (FIG. 2 ) so that profiled bursts can extend over multiple checks, possibly crossing procedure boundaries. This way, the improved framework can obtain interprocedural, context-sensitive and flow-sensitive profiling information. - As in the Arnold-
Ryder framework 200, theimproved framework 300 is structured to include duplicate non-instrumented (“checking code”) 330 and instrumentedcode 320 versions of at least someoriginal procedures 310 of the program. Further, checks 340-341 are placed at procedure entry and loop back-edges. - The extension in the
improved framework 300 adds a second “profiling phase” counter (labeled “nInstr”) to make execution flow in the instrumentedcode 320 symmetric with the checkingcode 330. Further, the loop back-edges 350 from the instrumentedcode 320 do not transition directly back to the procedure entry as in the prior Arnold-Ryder framework 200, but instead go to a back-edge check 341. - The program logic or
code 400 for the checks 340-341 is shown inFIG. 4 . Initially, the value of the checking phase counter (“nCheck”) is set to its initial value, “nCheck0.” While in thechecking code 400, theframework 300 decrements the checking phase counter (nCheck) (statement 410) at every check 340-341. Theframework 300 continues to execute in the checking code (statement 420) as long as the value of the checking phase counter has not yet reached zero. For example, from the entry and back-edge checks 340-341, theframework 300 takes the paths 360-361 to thechecking code 330. - When the checking phase counter (nCheck) reaches zero, the
framework 300 initializes the profiling phase counter (nInstr) to an initial value, nInstr0, and transitions to the instrumented code 320 (statement 430). In general, the checking phase counter's initial value is selected to be much greater than that of the profiling phase counter (i.e., nInstr0<<nCheck0), which determines the sampling rate of the framework (r=nInstr0I(nCheck0+nInstr0)). - While executing in the instrumented code, the
framework 300 decrements the profiling phase counter (nInstr) at every check 340-341 (statement 440). Theframework 300 continues to execute in the instrumented code (statement 450) as long as the value of the profiling phase counter has not yet reached zero. For example, from the entry and back-edge checks 340-341, theframework 300 takes the paths 370-371 to the instrumentedcode 320. When the profiling phase counter reaches zero, the framework again initializes the checking phase counter to the initial value, nCheck0, and returns to the checking code 330 (statement 460). - The
check code 400 is structured so that in the common case where the framework is executing in the checking code and is to continue executing the checking code (checking phase), the check consists of a decrement of the checking phase counter and a conditional branch. - Compared to the prior Arnold-
Ryder framework 200, theimproved framework 300 profiles longer bursts of the program trace and provides more precise profiles. For example, consider the following code fragment: - for (i=0; i<n; i++)
-
- if ( . . . ) f( );
- else g( );
In this example code fragment, the Arnold-Ryder framework returns to the checking code upon the back-edge path from each execution of the procedures, f( ) and g( ). Accordingly, the Arnold-Ryder framework profiles only on acyclic intraprocedural path of the program trace, and would be unable to distinguish the traces, fgfgfgfg and ffffgggg. Theimproved framework 300 profiles longer bursts across procedure boundaries. In the dynamic optimizer 120 (FIG. 1 ), this can make a difference between executing optimized code most of the time or not.
- Low-overhead Temporal Profiling
- For the dynamic optimization to effectively enhance the performance of the program, the overhead imposed by the temporal profiling framework desirably is relatively small compared to the overall program execution, so that performance gains are achieved from dynamically optimizing the program. The overhead of the temporal profiling framework can be particularly significant in the exemplary
dynamic optimizer 120 in which theprogram image 130 is built from editing anexecutable program binary 115, to which thecompiler 110 has already applied many static optimizations. In such case, the overhead of the prior Arnold-Ryder framework may be too high for effective dynamic optimization. The prior Arnold-Ryder framework has checks at all procedure entries and loop back-edges to insure that the program can never loop or recurse for an unbounded amount of time without executing a check. Otherwise, sampling could miss too much profiling information (when the program spends an unbounded amount of time in the checking code), or the overhead could become too high (when the program spends an unbounded amount of time in the instrumented code). - The low-overhead temporal profiling framework described herein decreases the overhead of the burst sampling by intelligently eliminating some checks (i.e., placing checks at fewer than all procedure entries and loop back-edges), while still ensuring that the program does not spend an unbounded amount of time without executing a check.
- Eliminating Checks at Procedure Entries
- In the low-overhead temporal profiling framework, the
instrumentation tool 122 places checks at an approximated minimum set of procedure entries so that the program cannot recurse for an unbounded amount of time without executing a check. Theinstrumentation tool 122 performs a static call graph analysis of theprogram 115 to determine this approximate minimum set (C⊂N) of nodes in the program's call graph, such that every cycle in the call graph contains at least one node of the set. - In the
dynamic optimizer 120, theinstrumentation tool 122 selects this set (C⊂N) of procedures f at which to place procedure entry checks, according to the criteria represented in the following expression: - In accordance with this criteria, the
instrumentation tool 122 does not place any check on any entry to a leaf procedure (i.e., a procedure that calls nothing), since such leaf procedures cannot be part of a recursive cycle. Otherwise, theinstrumentation tool 122 places a check on entries to all root procedures (i.e., procedures that are only called from outside the program), so as to ensure that execution starts in the correct version of the code. Also, the tool places a check on entry to every procedure whose address is taken, since such procedures may be part of recursion with indirect calls. Further, the tool places a check on entry to every procedure with recursion from below. A procedure f has recursion from below, iff it is called by g in the same strongly connected component as f that is at least as far away from the roots. The distance of a procedure f from the roots is the minimum length of the shortest path from a root to f. - The “recursion_from_below” heuristic in this criteria guarantees that there is no recursive cycle without a check and breaks the ties to determine where in the cycle to put the check (similarly to back-edges in loops). The tool breaks ties so that checks are as far up in the call-stack as possible. This should reduce the number of dynamic checks.
- For example,
FIG. 5 illustrates acall graph 500 of an exemplary program being structured by thetool 122 according to the low-overhead temporal profiling framework. In thiscall graph 500, the only root is procedure main 510, and the only leaf procedure is delete-digram 520. The only non-trivial strongly connected component in thecall graph 500 is the component 650 (of procedures {check, match, substitute} 530-532). -
FIG. 6 illustrates ananalysis 600 of thecall graph 500 by thetool 122 to determine the set of procedures for entry check placement. For this analysis, thetool 122 begins with a breadth-first search of the call graph. The tool calculates the distances (e.g., from 0 to 4 in this example) of each procedure from the root procedure (main 510), and determines that only theprocedure check 530 has recursion from below, since it is called from theprocedure substitute 532 which is further away from the root procedure main 510. Thetool 122 thus determines that for this example withcall graph 500, only the procedures main 510 and check 530 meet the above criteria for placing an entry check (i.e., the above expression evaluates to the minimum set C={main,check} for this call graph). Accordingly, by placing a check on entry to every procedure in this minimum set C={main,check}, the program cannot recurse indefinitely without executing checks. - Eliminating Checks at Loop Back-Edges
- In the low-overhead temporal profiling framework, the
instrumentation tool 122 also places checks at fewer than all loop back-edges in the program. In particular, theinstrumentation tool 122 eliminates checks for some tight inner loops. This is because a dynamic optimizer that complements a static optimizer may often find the profiling information from tight inner loops to be of little interest because static optimization excels at optimizing such loops. At the same time, checks at the back-edges of tight inner loops can become extremely expensive (i.e., create excessive overhead relative to potential optimization performance gain). With thedynamic optimizer 100 that prefetches data into cache memory based on hot data streams, loops that compare or copy arrays preferably should not have checks. Such loops typically are easy to optimize statically, the check on the back-edge is almost as expensive as the loop body, and the loop body contains too little work to overlap with the prefetch. - More particularly, the
instrumentation tool 122 eliminates checks on loop back-edges of loops meeting a “k-boring loops” criteria. According to this criteria, k-boring loops are defined as loops with no calls and at most a number (k) of profiling events of interest. Theinstrumentation tool 122 does not instrument either version of the code of a k-boring loop, and does not place a check on its back-edge. Since the loop is not included in the instrumented code 320 (FIG. 1 ) version, theprogram image 130 does not spend an unbounded amount of time executing in instrumented code. The program image may spend an unbounded amount of time executing such a loop in uninstrumented code (checkingcode 330 ofFIG. 1 ) without executing a check. But, if the k-boring loop hypothesis holds (i.e., there is little or no gain from optimizing such loops with hot data stream prefetching), thedynamic optimizer 120 does not miss interesting profiling information. Experiments have shown that the quality of the profile actually improved when instrumenting of back-edge checks were eliminated from 4-boring loops (i.e., k=4) in an experimental program image, where the quality of the profile is measured by the ability to detect hot data streams. Accordingly, eliminating k-boring loop from profiling helps focus sampling on more interesting events (for optimizing with hot data stream prefetching). - In alternative implementations, the
instrumentation tool 122 may eliminate additional checks on loop back-edges. For example, the instrumentation tool may eliminate back-edge checks from a loop that has only a small, fixed number of iterations. Further, if a check is always executing within a loop body, the loop does not need a check on the loop's back-edge. In yet further alternative implementations, theinstrumentation tool 122 can combine the loop counter with the profiling phase counter; if the counters are linearly related, the program image can execute checks for the loop via a predicate on the loop counter, rather than updating the profiling counter each iteration of the loop. - 3. Hot Data Stream Prefetching
- With reference now to
FIG. 7 , thetemporal profiling 710 using the above-described low-overhead, long burst temporal profiling framework 300 (FIG. 3 ) is a first phase in an overall dynamic optimization process 700 based on hot data stream pre-fetching. The dynamic optimization process 700 operates in three phases—profiling 710, analysis andoptimization 720, andhibernation 730. First, the profiling phase collects (740) a temporaldata reference profile 135 from a running program with low-overhead, which is accomplished using the program image 130 (FIG. 1 ) structured according to the improvedtemporal profiling framework 300. As described in more detail below, a grammar analysis using theSequitur compression process 750 incrementally builds anonline grammar representation 900 of the traced data references. - Once sufficient data references have been traced, profiling is turned off, and the analysis and
optimization phase 720 commences. First, a fast hotdata stream detection 140 extractshot data streams 760 from theSequitur grammar representation 900. Then, aprefetching engine 142 builds a stream prefix matching deterministic finite state machine (DFSM) 770 for these hot data streams, and dynamically injects checks at appropriate program points to detect and prefetch these hot data streams in theprogram image 130. This dynamic prefetching based on a DFSM is described in more detail in DYNAMIC PREFETCHING OF HOT DATA STREAMS, U.S. Pat. No. 7,058,936, issued on Jun. 6, 2006, which is hereby incorporated herein by reference. - Finally, the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute (780) as optimized with the added prefetch instructions. At the end of the hibernation phase, the
program image 130 is de-optimized (790) to remove the inserted checks and prefetch instructions, and control returns to theprofiling phase 710. For long-running programs, thisprofiling 710, analysis andoptimization 720 and hibernate 730 cycle may repeat multiple times. -
FIG. 8 shows atimeline 800 for the three phase profiling, analysis and optimization, and hibernation cycle operation of the dynamic optimizer 100 (FIG. 1 ). As discussed above, the low-overhead, long burst temporal profiling framework uses the checking phase and profiling phase counters (nCheck, nInstr) to control its overhead and sampling rate of profiling, by transitioning between achecking phase 810 in which the program image 130 (FIG. 1 ) executes in itsnon-instrumented checking code 330 program image 130 (FIG. 1 ) executes in its non-instrumented checking code 330 (FIG. 3 ) and aprofiling phase 820 in which it executes in its instrumented code 320 (FIG. 3 ). The time periods for these checking and profiling phase are parameterized by the nCheck0 and nInstr0 counter initialization values. For example, setting nCheck0 to 9900 and nInstr0 to 100 results in a sampling rate of profiling of 100/10000=1% and a burst length of 100 dynamic checks. The time spent for one iteration of the checking and profiling phase (nCheck0+nInstr0) is referred to as aburst period 850. - For dynamic optimization, the above-described low-overhead temporal profiling framework 300 (
FIG. 3 ) is further extended to alternate between two additional phases, awake 830 and hibernating 840, which are controlled via two additional (awake and hibernating) counters. The temporal profiling framework starts out in theawake phase 830, and continues operating in the awake phase for a number (nAwake0) of burst-periods, yielding (nAwake0×nInstr0) checks (860) worth of traced data references (or “bursts”). Then, as described above and illustrated inFIG. 7 , thedynamic optimizer 100 performs the optimizations, and then the profiler hibernates while the optimized program executes. This is done by setting nCheck0 to (nCheck0+nInstr0−1) and nInstr0 to 1 for the next nHibernate0 burst-periods (which causes thecheck code 400 inFIG. 4 to keep the program image executing in the non-instrumented checking code 330), where nHibernate0 is much greater than nAwake0. When thehibernating phase 840 is over, the profiling framework is “woken up” by resetting nCheck0 and nInstr0 to their original values. - While the profiling framework is hibernating, the program image traces next to no data references and hence incurs only the basic overhead of executing the checks 400 (
FIG. 4 ). With the values of nCheck0 and nInstr0 set as described above during hibernation, the burst-periods correspond to the same time (in executed checks 860) in both awake and hibernating phases. This facilitates control over the relative length of the awake and hibernating phases by appropriately setting the initial value parameters nAwake0 and nHibernate0 of the awake and hibernating counters relative to each other. - Fast Hot Data Stream Detection
- When the
temporal profiling framework 300 executes in the instrumented code 320 (FIG. 3 ), the temporal profiling instrumentation produces data reference bursts or temporal data reference sequences 135 (FIGS. 1 and 7 ). A data reference r is a load or store operation on a particular address, represented in the exemplarydynamic optimizer 120 as a data pair (r.pc, r.addr). The “pc” value (i.e., r.pc), is the value of the program counter, which indicates the address in the executing program of the data load or store instruction being executed. The “addr” value (i.e., r.addr), is the memory location accessed by the load or store operation. The profiled burst is a temporal sequence or stream of these data references. - During the profiling phase 710 (
FIG. 7 ) as discussed above, this data reference sequence is incrementally processed into a compressed “Sequitur”grammar representation 900 using the Sequitur grammar analysis processing, as described in T. M. Chilimbi, “Efficient Representations And Abstractions For Quantifying And Exploiting Data Reference Locality,” Proceedings Of The ACM SIGPLAN '01 Conference On Programming Language Design And Implementation (June 2001).FIG. 9 illustrates an example of agrammar 900 produced from an input data reference sequence (input string 910). Thegrammar 900 represents a hierarchical structure (a directed acyclic graph 920) of the data references. - More particularly, each observed data reference (r.pc, r.addr) is conceptually represented as a symbol in a grammar, and the concatenation of the profiled bursts is a string w of symbols (910). The Sequitur grammar analysis constructs a context-free grammar for the language {w} consisting of exactly one word, the string w. The Sequitur grammar analysis runs in time O (w.length). It is incremental (one symbol can be appended at a time), and deterministic. Thus, the grammar analysis can be performed as the profiled data is sampled during the profiling phase 710 (
FIG. 7 ). Thegrammar 900 is a compressed representation of the input burst 910. Further, it is unambiguous and acyclic in the sense that no non-terminal directly or indirectly defines itself. - In the
Sequitur grammar 910, the terminal nodes (denoted in small case letters) represent individual data references (r.pc, r.addr), which may be repeated in the profiled burst. The intermediate nodes (denoted in capital letters) represent temporal sequences of the data references. For example, thegrammar 910 produced from theexample input string 910 shows that the string S consists of the sequence “AaBB.” A, in turn, consists of the data references a and b. The intermediate node B represents a sequence with two occurrences of the intermediate node C, which is a sequence of the intermediate node A and data reference c. - After construction of the
grammar 900 in theprofiling phase 710, thedynamic optimizer 100 performs a fast hot data stream detection 140 (FIGS. 1 and 7 ) to identify frequently recurring data reference subsequences (the “hot data streams”) in the profiled bursts. For the fast hot data stream detection, the exemplary dynamic optimizer performs analysis of the grammar as represented in a hot datastream detection code 1000 shown inFIG. 10 . The purposes of the fast hot data stream analysis is to identify hot data streams, which are a data reference subsequence in the profiled bursts, whose regularity magnitude exceeds a predetermined “heat” threshold, H. The regularity magnitude, given a data reference subsequence v, is defined as v.heat=v.length*v.frequency, where v.frequency is the number of non-overlapping occurrences of v in the profiled bursts. - The analysis in
code 1000 is based on the observation that each non-terminal node (A) of a Sequitur grammar generates a language L(A)={wA} with just one word wA. For the fast hot data stream detection analysis, the regularity magnitude of a non-terminal A is defined instead as A.heat=wA.length*A.coldUses, where A.coldUses is the number of times A occurs in the (unique) parse tree of the complete grammar, not counting occurrences in sub-trees belonging to hot non-terminals other than A. A non-terminal A is hot iff minLen<=A.length<=maxlen and H<=A.heat, where H is the predetermined heat threshold. The result of the analysis is the set {wA|A is a hot non-terminal} of hot data streams. -
FIGS. 11 and 12 show an example 1100 of the analysis in the code 1000 (FIG. 10 ) for the inputdata reference sequence 910 andgrammar 900 inFIG. 9 . As a result of the Sequitur grammar analysis 750 (FIG. 7 ), the input data reference sequence has been parsed (as shown by parse tree 1110) and sub-sequences grouped under intermediate (non-terminal) nodes into the Sequitur grammar (1120). Further, the Sequitur grammar analysis also yields the length of the subsequence represented in each non-terminal node of thegrammar 1120. Accordingly, the information shown in the first three columns (the non-terminal nodes, their children, and their lengths) of the table 1200 is provided to the fast hot data stream detection analysis. As shown inFIGS. 11 and 12 , a non-terminal node is considered the child of another non-terminal node if it is listed on the right-hand side of the grammar rule of the other non-terminal node in Sequitur grammar 900 (FIG. 9 ). - In the fast hot data
stream analysis code 1000, the analyzer 140 (FIG. 1 ) first executes instructions (1010) to perform a reverse post-order numbering of the non-terminal nodes in the grammar. For the example grammar, the results in numbering the nodes S, A, B, and C as 0, 3, 1 and 2, respectively, as shown in the index column of the table 1200 (FIG. 12 ) and illustrated in the reverse postorder numbering tree 1130 (FIG. 11 ). This results in the non-terminal nodes being numbered such that whenever a non-terminal node (e.g., node C) is a child of another non-terminal node (e.g., B), the number of the child node is greater (e.g., B. index<C.index). This property guarantees that the analysis does not visit a non-terminal node before having visited all its predecessors. - The
analyzer 140 next determines atinstructions 1020 incode 1000 how often each non-terminal node occurs in the parse-tree 1110 (FIG. 11 ), which is represented in the “use” column of the table 1200 (FIG. 12 ). Each of the non-terminal nodes is now associated with two values, its number of “hot uses” and its length, which are depicted conceptually in the uses:length tree 1140 (FIG. 11 ). - Finally, the
analyzer 140 finds the number of “cold uses” for each non-terminal node, which are the number of hot uses not attributable in the “cold uses” of a “hot” predecessor node. More specifically, the analyzer finds hot non-terminal nodes such that a non-terminal node is only considered hot if it accounts for enough of the trace on its own, where it is not part of the expansion of the other hot non-terminals. In the example grammar with a heat threshold (H=8) and length restrictions (minLen=2, maxLen=7), only the non-terminal node B is considered as “hot,” since its “heat” (cold uses×length=2×6=12) exceeds the heat threshold (12>8). All uses of the non-terminal node C are completely subsumed its the predecessor “hot” non-terminal node B and therefore is not considered hot (its heat=cold uses×length=0×3=0). The non-terminal node A has a single use apart from as a subsequence of the “hot” non-terminal node B, but this single use is not sufficient to exceed the heat threshold (A's cold uses×length=1×3=3<8). The single hot non-terminal node B represents the hot data stream wB=abcabc, which accounts for 12/15=80% of all data references in this example burst. - 4. Computing Environment
-
FIG. 13 illustrates a generalized example of asuitable computing environment 1300 in which the described techniques can be implemented. Thecomputing environment 1300 is not intended to suggest any limitation as to scope of use or functionality of the invention, as the present invention may be implemented in diverse general-purpose or special-purpose computing environments. - With reference to
FIG. 13 , thecomputing environment 1300 includes at least oneprocessing unit 1310 andmemory 1320. InFIG. 13 , this mostbasic configuration 1330 is included within a dashed line. Theprocessing unit 1310 executes computer-executable instructions and may be a real or a virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power. Thememory 1320 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two. Thememory 1320stores software 1380 implementing the dynamic optimizer 100 (FIG. 1 ). - A computing environment may have additional features. For example, the
computing environment 1300 includesstorage 1340, one ormore input devices 1350, one ormore output devices 1360, and one ormore communication connections 1370. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of thecomputing environment 1300. Typically, operating system software (not shown) provides an operating environment for other software executing in thecomputing environment 1300, and coordinates activities of the components of thecomputing environment 1300. - The
storage 1340 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, COD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within thecomputing environment 1300. Thestorage 1340 stores instructions for thedynamic optimizer software 1380. - The input device(s) 1350 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the
computing environment 1300. For audio, the input device(s) 1350 may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment. The output device(s) 1360 may be a display, printer, speaker, CD-writer, or another device that provides output from thecomputing environment 1300. - The communication connection(s) 1370 enable communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, audio/video or other media information, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
- The device connectivity and messaging techniques herein can be described in the general context of computer-readable media. Computer-readable media are any available media that can be accessed within a computing environment. By way of example, and not limitation, with the
computing environment 1300, computer-readable media includememory 1320,storage 1340, communication media, and combinations of any of the above. - The techniques herein can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
- For the sake of presentation, the detailed description uses terms like “determine,” “generate,” “adjust,” and “apply” to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on implementation.
- In view of the many possible embodiments to which the principles of our invention may be applied, we claim as our invention all such embodiments as may come within the scope and spirit of the following claims and equivalents thereto.
Claims (20)
1. A method of detecting a hot data stream in a data reference sequence from sampled bursts of a program execution trace, the method comprising:
parsing the data reference sequence to extract a compressed grammar representation of the data reference sequence, the compressed grammar representation comprising a plurality of language elements each representing a number of occurrences of unique subsequences and related as a directed acyclic graph;
numbering the language elements according to a reverse postorder numbering;
calculating a heat measure of each language element related to a product of the length of the subsequence represented by the language element together with a number of occurrences of the subsequence represented by the language element that are not included in a heat measure of a predecessor language element according to the numbering that meets a hot criteria;
comparing the heat measure of each language element to the hot criteria; and
identifying the subsequence represented by a language element meeting the hot criteria as a hot data stream.
2. The method of claim 1 wherein the reverse postorder numbering results in non-terminal nodes being numbered such that whenever a non-terminal node is a child of another non-terminal node, a number assigned to the child node is greater.
3. The method of claim 1 wherein a data reference comprises a data pair of a program counter value which indicates an address of a data load or store instruction and a memory location accessed by the data load or store instruction.
4. The method of claim 1 wherein the hot data stream is a data reference subsequence in a profiled burst with a regularity magnitude that exceeds the hot criteria.
5. The method of claim 4 wherein the regularity magnitude of a data reference subsequence v, is defined as v.heat=v.length*v.frequency, where v.frequency is the number of non-overlapping occurrences of v in the profiled burst.
6. The method of claim 4 wherein the regularity magnitude of a non-terminal node A comprises A.heat=wA.length*A.coldUses, where A.coldUses is a number of times A occurs in a unique parse tree not counting occurrences in sub-trees belonging to hot non-terminals other than A.
7. A dynamic optimizer comprising:
a temporal profiling framework insertion tool operating to modify a program to provide instrumentation for capturing a temporal data reference sequence for sampled bursts of an execution trace of the program;
a hot data stream detector operating to parse the temporal data reference sequence to extract a compressed grammar representation of the data reference sequence, the compressed grammar representation comprising a plurality of language elements each representing a number of occurrences of unique subsequences and related as a directed acyclic graph, the hot data stream detector further numbering the language elements according to a reverse postorder numbering, the hot data stream detector further calculating a heat measure of each language element related to a product of the length of the subsequence represented by the language element together with a number of occurrences of the subsequence represented by the language element that are not included in a heat measure of a predecessor language element according to the numbering that meets a hot criteria, the hot data stream detector further comparing the heat measure of each language element to the hot criteria, and identifying the subsequence represented by a language element meeting the hot criteria as a hot data stream; and
a prefetching code injector for inserting prefetching instructions at locations in the program corresponding to occurrences of the identified hot data stream in the data reference sequence.
8. The dynamic optimizer of claim 7 wherein the temporal profiling framework insertion tool modifies an executable binary version of the program.
9. The dynamic optimizer of claim 8 wherein the modified executable binary version of the program comprises a profiling phase counter and a checking phase counter controlling transitions to and from instrumented code for sampled bursts.
10. The dynamic optimizer of claim 7 wherein the reverse postorder numbering results in numbering language elements such that when a non-terminal language element is a child of another non-terminal language element, a number assigned to the child is greater.
11. The dynamic optimizer of claim 7 operating in phases comprising a profiling phase, analysis and optimization phase, and a hibernation phase.
12. The dynamic optimizer of claim 11 wherein the profiling phase comprises sampling bursts of the execution trace of the program.
13. The dynamic optimizer of claim 11 wherein the analysis and optimization phase comprises operation of the hot data stream detector.
14. The dynamic optimizer of claim 11 wherein the hibernation phase comprises the program executing as optimized with prefetch instructions.
15. A computer-readable program carrying medium having a program carried thereon executable on a computer to perform a method of detecting a hot data stream in a data reference sequence from sampled bursts of a program execution trace, the method comprising:
parsing the data reference sequence to extract a compressed grammar representation of the data reference sequence, the compressed grammar representation comprising a plurality of language elements each representing a number of occurrences of unique subsequences and related as a directed acyclic graph;
numbering the language elements according to a reverse postorder numbering;
calculating a heat measure of each language element related to a product of the length of the subsequence represented by the language element together with a number of occurrences of the subsequence represented by the language element that are not included in a heat measure of a predecessor language element according to the numbering that meets a hot criteria;
comparing the heat measure of each language element to the hot criteria; and
identifying the subsequence represented by a language element meeting the hot criteria as a hot data stream.
16. The computer-readable program carrying medium of claim 15 wherein the reverse postorder numbering comprises non-terminal nodes being numbered such that when a non-terminal node is a child of another non-terminal node the child node is assigned a greater number.
17. The computer-readable program carrying medium of claim 15 wherein a data reference in a sequence comprises:
a program counter value which indicates an address of a data load or store instruction; and
a memory location accessed by the data load or store instruction.
18. The computer-readable program carrying medium of claim 15 wherein the hot data stream is a data reference subsequence in a profiled burst with a regularity magnitude that exceeds the hot criteria.
19. The computer-readable program carrying medium of claim 18 wherein the regularity magnitude of a data reference subsequence v comprises:
v.heat=v.length*v.frequency where v.frequency is the number of non-overlapping occurrences of v in the profiled burst.
20. The computer-readable program carrying medium of claim 18 wherein the regularity magnitude of a non-terminal node A comprises A.heat=wA.length*A.coldUses, where A.coldUses is the number of times A occurs in a unique parse tree not counting occurrences in sub-trees belonging to hot non-terminals other than A.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/539,111 US20070083856A1 (en) | 2002-11-25 | 2006-10-05 | Dynamic temporal optimization framework |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/305,056 US7140008B2 (en) | 2002-11-25 | 2002-11-25 | Dynamic temporal optimization framework |
US11/539,111 US20070083856A1 (en) | 2002-11-25 | 2006-10-05 | Dynamic temporal optimization framework |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/305,056 Division US7140008B2 (en) | 2002-11-25 | 2002-11-25 | Dynamic temporal optimization framework |
Publications (1)
Publication Number | Publication Date |
---|---|
US20070083856A1 true US20070083856A1 (en) | 2007-04-12 |
Family
ID=32325361
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/305,056 Expired - Fee Related US7140008B2 (en) | 2002-11-25 | 2002-11-25 | Dynamic temporal optimization framework |
US11/539,111 Abandoned US20070083856A1 (en) | 2002-11-25 | 2006-10-05 | Dynamic temporal optimization framework |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/305,056 Expired - Fee Related US7140008B2 (en) | 2002-11-25 | 2002-11-25 | Dynamic temporal optimization framework |
Country Status (1)
Country | Link |
---|---|
US (2) | US7140008B2 (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050229165A1 (en) * | 2004-04-07 | 2005-10-13 | Microsoft Corporation | Method and system for probe optimization while instrumenting a program |
US20060265438A1 (en) * | 2005-05-20 | 2006-11-23 | Microsoft Corporation | Leveraging garbage collection to dynamically infer heap invariants |
US20070079293A1 (en) * | 2005-09-30 | 2007-04-05 | Cheng Wang | Two-pass MRET trace selection for dynamic optimization |
US20070180439A1 (en) * | 2006-02-01 | 2007-08-02 | Sun Microsystems, Inc. | Dynamic application tracing in virtual machine environments |
US20070244942A1 (en) * | 2006-04-17 | 2007-10-18 | Microsoft Corporation | Using dynamic analysis to improve model checking |
US20080005208A1 (en) * | 2006-06-20 | 2008-01-03 | Microsoft Corporation | Data structure path profiling |
US20100083236A1 (en) * | 2008-09-30 | 2010-04-01 | Joao Paulo Porto | Compact trace trees for dynamic binary parallelization |
US20100095278A1 (en) * | 2008-10-09 | 2010-04-15 | Nageshappa Prashanth K | Tracing a calltree of a specified root method |
US20110004869A1 (en) * | 2009-07-02 | 2011-01-06 | International Business Machines Corporation | Program, apparatus, and method of optimizing a java object |
US8046752B2 (en) | 2002-11-25 | 2011-10-25 | Microsoft Corporation | Dynamic prefetching of hot data streams |
US20120109639A1 (en) * | 2009-01-20 | 2012-05-03 | Oracle International Corporation | Method, computer program and apparatus for analyzing symbols in a computer system |
US8504984B1 (en) * | 2009-05-29 | 2013-08-06 | Google Inc. | Modifying grammars to correct programming language statements |
US8510721B2 (en) | 2010-08-25 | 2013-08-13 | Microsoft Corporation | Dynamic calculation of sample profile reports |
US20140026185A1 (en) * | 2008-08-13 | 2014-01-23 | International Business Machines Corporation | System, Method, and Apparatus for Modular, String-Sensitive, Access Rights Analysis with Demand-Driven Precision |
US20140195788A1 (en) * | 2013-01-10 | 2014-07-10 | Oracle International Corporation | Reducing instruction miss penalties in applications |
US8941657B2 (en) | 2011-05-23 | 2015-01-27 | Microsoft Technology Licensing, Llc | Calculating zoom level timeline data |
US9697058B2 (en) | 2007-08-08 | 2017-07-04 | Oracle International Corporation | Method, computer program and apparatus for controlling access to a computer resource and obtaining a baseline therefor |
US10346222B2 (en) | 2010-11-30 | 2019-07-09 | Microsoft Technology Licensing, Llc | Adaptive tree structure for visualizing data |
US20200065076A1 (en) * | 2018-08-23 | 2020-02-27 | Oracle International Corporation | Method for performing deep static analysis with or without source code |
Families Citing this family (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001069376A2 (en) * | 2000-03-15 | 2001-09-20 | Arc International Plc | Method and apparatus for processor code optimization using code compression |
US7278137B1 (en) * | 2001-12-26 | 2007-10-02 | Arc International | Methods and apparatus for compiling instructions for a data processor |
US7043682B1 (en) * | 2002-02-05 | 2006-05-09 | Arc International | Method and apparatus for implementing decode operations in a data processor |
US7140008B2 (en) * | 2002-11-25 | 2006-11-21 | Microsoft Corporation | Dynamic temporal optimization framework |
US7051322B2 (en) * | 2002-12-06 | 2006-05-23 | @Stake, Inc. | Software analysis framework |
US7114150B2 (en) * | 2003-02-13 | 2006-09-26 | International Business Machines Corporation | Apparatus and method for dynamic instrumenting of code to minimize system perturbation |
US7343598B2 (en) * | 2003-04-25 | 2008-03-11 | Microsoft Corporation | Cache-conscious coallocation of hot data streams |
US7194732B2 (en) * | 2003-06-26 | 2007-03-20 | Hewlett-Packard Development Company, L.P. | System and method for facilitating profiling an application |
US20050028148A1 (en) * | 2003-08-01 | 2005-02-03 | Sun Microsystems, Inc. | Method for dynamic recompilation of a program |
US7478371B1 (en) * | 2003-10-20 | 2009-01-13 | Sun Microsystems, Inc. | Method for trace collection |
US7587709B2 (en) * | 2003-10-24 | 2009-09-08 | Microsoft Corporation | Adaptive instrumentation runtime monitoring and analysis |
US9026467B2 (en) * | 2004-02-13 | 2015-05-05 | Fis Financial Compliance Solutions, Llc | Systems and methods for monitoring and detecting fraudulent uses of business applications |
US20050182750A1 (en) * | 2004-02-13 | 2005-08-18 | Memento, Inc. | System and method for instrumenting a software application |
US8612479B2 (en) * | 2004-02-13 | 2013-12-17 | Fis Financial Compliance Solutions, Llc | Systems and methods for monitoring and detecting fraudulent uses of business applications |
US9978031B2 (en) | 2004-02-13 | 2018-05-22 | Fis Financial Compliance Solutions, Llc | Systems and methods for monitoring and detecting fraudulent uses of business applications |
US7765534B2 (en) * | 2004-04-30 | 2010-07-27 | International Business Machines Corporation | Compiler with cache utilization optimizations |
US7971191B2 (en) * | 2004-06-10 | 2011-06-28 | Hewlett-Packard Development Company, L.P. | System and method for analyzing a process |
GB0418306D0 (en) * | 2004-08-17 | 2004-09-15 | Ibm | Debugging an application process at runtime |
US20060048114A1 (en) * | 2004-09-02 | 2006-03-02 | International Business Machines Corporation | Method and apparatus for dynamic compilation of selective code blocks of computer programming code to different memory locations |
US7716647B2 (en) * | 2004-10-01 | 2010-05-11 | Microsoft Corporation | Method and system for a system call profiler |
US7721268B2 (en) * | 2004-10-01 | 2010-05-18 | Microsoft Corporation | Method and system for a call stack capture |
US7661097B2 (en) * | 2005-04-05 | 2010-02-09 | Cisco Technology, Inc. | Method and system for analyzing source code |
US7607119B2 (en) * | 2005-04-26 | 2009-10-20 | Microsoft Corporation | Variational path profiling |
US7770153B2 (en) * | 2005-05-20 | 2010-08-03 | Microsoft Corporation | Heap-based bug identification using anomaly detection |
US8490065B2 (en) * | 2005-10-13 | 2013-07-16 | International Business Machines Corporation | Method and apparatus for software-assisted data cache and prefetch control |
US8341605B2 (en) * | 2005-12-15 | 2012-12-25 | Ca, Inc. | Use of execution flow shape to allow aggregate data reporting with full context in an application manager |
US7770163B2 (en) * | 2006-03-24 | 2010-08-03 | International Business Machines Corporation | Method of efficiently performing precise profiling in a multi-threaded dynamic compilation environment |
US20070240141A1 (en) * | 2006-03-30 | 2007-10-11 | Feng Qin | Performing dynamic information flow tracking |
US20070250820A1 (en) * | 2006-04-20 | 2007-10-25 | Microsoft Corporation | Instruction level execution analysis for debugging software |
US7818722B2 (en) * | 2006-06-09 | 2010-10-19 | International Business Machines Corporation | Computer implemented method and system for accurate, efficient and adaptive calling context profiling |
US8613080B2 (en) | 2007-02-16 | 2013-12-17 | Veracode, Inc. | Assessment and analysis of software security flaws in virtual machines |
US8095910B2 (en) * | 2007-04-10 | 2012-01-10 | Microsoft Corporation | Interruptible client-side scripts |
WO2008129635A1 (en) * | 2007-04-12 | 2008-10-30 | Fujitsu Limited | Performance failure factor analysis program and performance failure factor analysis apparatus |
US8813041B2 (en) * | 2008-02-14 | 2014-08-19 | Yahoo! Inc. | Efficient compression of applications |
CN101546287A (en) * | 2008-03-26 | 2009-09-30 | 国际商业机器公司 | Code modification method and code modification equipment |
US8065565B2 (en) * | 2008-10-03 | 2011-11-22 | Microsoft Corporation | Statistical debugging using paths and adaptive profiling |
US8549464B2 (en) | 2010-11-22 | 2013-10-01 | Microsoft Corporation | Reusing expression graphs in computer programming languages |
US8677335B1 (en) * | 2011-12-06 | 2014-03-18 | Google Inc. | Performing on-stack replacement for outermost loops |
US9286063B2 (en) | 2012-02-22 | 2016-03-15 | Veracode, Inc. | Methods and systems for providing feedback and suggested programming methods |
US9836379B2 (en) * | 2012-04-26 | 2017-12-05 | Nxp Usa, Inc. | Method and system for generating a memory trace of a program code executable on a programmable target |
JP2014075046A (en) * | 2012-10-04 | 2014-04-24 | International Business Maschines Corporation | Trace generation method, device, and program, and multilevel compilation using the method |
CN104008058B (en) * | 2014-06-16 | 2016-06-08 | 东南大学 | A kind of framework evaluation method based on prototype emulation |
US9823998B2 (en) * | 2015-12-02 | 2017-11-21 | International Business Machines Corporation | Trace recovery via statistical reasoning |
US10380347B2 (en) | 2016-06-08 | 2019-08-13 | Salesforce.Com., Inc. | Hierarchical runtime analysis framework for defining vulnerabilities |
US10140456B2 (en) | 2016-06-08 | 2018-11-27 | Salesforce.Com, Inc. | Runtime analysis of software security vulnerabilities |
US10241890B2 (en) * | 2016-07-28 | 2019-03-26 | Salesforce.Com, Inc. | Hybrid code modification in intermediate language for software application |
US11340814B1 (en) | 2017-04-27 | 2022-05-24 | EMC IP Holding Company LLC | Placing data in a data storage array based on detection of different data streams within an incoming flow of data |
CN107368319B (en) * | 2017-07-25 | 2020-09-18 | 苏州浪潮智能科技有限公司 | Method and device for realizing soft backup and switching of code base |
US11561778B1 (en) * | 2021-11-23 | 2023-01-24 | International Business Machines Corporation | Instrumentation for nested conditional checks |
Citations (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5220667A (en) * | 1989-09-25 | 1993-06-15 | Mitsubishi Denki Kabushiki Kaisha | Computer system |
US5333311A (en) * | 1990-12-10 | 1994-07-26 | Alsoft, Inc. | Optimizing a magnetic disk by allocating files by the frequency a file is accessed/updated or by designating a file to a fixed location on a disk |
US5713008A (en) * | 1995-06-08 | 1998-01-27 | Sun Microsystems | Determination of working sets by logging and simulating filesystem operations |
US5740443A (en) * | 1995-08-14 | 1998-04-14 | International Business Machines Corporation | Call-site specific selective automatic inlining |
US5774685A (en) * | 1995-04-21 | 1998-06-30 | International Business Machines Corporation | Method and apparatus for biasing cache LRU for prefetched instructions/data based upon evaluation of speculative conditions |
US5815720A (en) * | 1996-03-15 | 1998-09-29 | Institute For The Development Of Emerging Architectures, L.L.C. | Use of dynamic translation to collect and exploit run-time information in an optimizing compilation system |
US5909578A (en) * | 1996-09-30 | 1999-06-01 | Hewlett-Packard Company | Use of dynamic translation to burst profile computer applications |
US5925100A (en) * | 1996-03-21 | 1999-07-20 | Sybase, Inc. | Client/server system with methods for prefetching and managing semantic objects based on object-based prefetch primitive present in client's executing application |
US5940618A (en) * | 1997-09-22 | 1999-08-17 | International Business Machines Corporation | Code instrumentation system with non intrusive means and cache memory optimization for dynamic monitoring of code segments |
US5950003A (en) * | 1995-08-24 | 1999-09-07 | Fujitsu Limited | Profile instrumentation method and profile data collection method |
US5950007A (en) * | 1995-07-06 | 1999-09-07 | Hitachi, Ltd. | Method for compiling loops containing prefetch instructions that replaces one or more actual prefetches with one virtual prefetch prior to loop scheduling and unrolling |
US5953524A (en) * | 1996-11-22 | 1999-09-14 | Sybase, Inc. | Development system with methods for runtime binding of user-defined classes |
US5960198A (en) * | 1997-03-19 | 1999-09-28 | International Business Machines Corporation | Software profiler with runtime control to enable and disable instrumented executable |
US6026234A (en) * | 1997-03-19 | 2000-02-15 | International Business Machines Corporation | Method and apparatus for profiling indirect procedure calls in a computer program |
US6073232A (en) * | 1997-02-25 | 2000-06-06 | International Business Machines Corporation | Method for minimizing a computer's initial program load time after a system reset or a power-on using non-volatile storage |
US6079032A (en) * | 1998-05-19 | 2000-06-20 | Lucent Technologies, Inc. | Performance analysis of computer systems |
US6148437A (en) * | 1998-05-04 | 2000-11-14 | Hewlett-Packard Company | System and method for jump-evaluated trace designation |
US6216219B1 (en) * | 1996-12-31 | 2001-04-10 | Texas Instruments Incorporated | Microprocessor circuits, systems, and methods implementing a load target buffer with entries relating to prefetch desirability |
US6233678B1 (en) * | 1998-11-05 | 2001-05-15 | Hewlett-Packard Company | Method and apparatus for profiling of non-instrumented programs and dynamic processing of profile data |
US6311260B1 (en) * | 1999-02-25 | 2001-10-30 | Nec Research Institute, Inc. | Method for perfetching structured data |
US6321240B1 (en) * | 1999-03-15 | 2001-11-20 | Trishul M. Chilimbi | Data structure partitioning with garbage collection to optimize cache utilization |
US6330556B1 (en) * | 1999-03-15 | 2001-12-11 | Trishul M. Chilimbi | Data structure partitioning to optimize cache utilization |
US6360361B1 (en) * | 1999-03-15 | 2002-03-19 | Microsoft Corporation | Field reordering to optimize cache utilization |
US6370684B1 (en) * | 1999-04-12 | 2002-04-09 | International Business Machines Corporation | Methods for extracting reference patterns in JAVA and depicting the same |
US6404455B1 (en) * | 1997-05-14 | 2002-06-11 | Hitachi Denshi Kabushiki Kaisha | Method for tracking entering object and apparatus for tracking and monitoring entering object |
US20020133639A1 (en) * | 1995-12-27 | 2002-09-19 | International Business Machines Corporation | Method and system for migrating an object between a split status and a merged status |
US20020144245A1 (en) * | 2001-03-30 | 2002-10-03 | Guei-Yuan Lueh | Static compilation of instrumentation code for debugging support |
US6560693B1 (en) * | 1999-12-10 | 2003-05-06 | International Business Machines Corporation | Branch history guided instruction/data prefetching |
US6571318B1 (en) * | 2001-03-02 | 2003-05-27 | Advanced Micro Devices, Inc. | Stride based prefetcher with confidence counter and dynamic prefetch-ahead mechanism |
US6598141B1 (en) * | 2001-03-08 | 2003-07-22 | Microsoft Corporation | Manipulating interior pointers on a stack during garbage collection |
US6628835B1 (en) * | 1998-08-31 | 2003-09-30 | Texas Instruments Incorporated | Method and system for defining and recognizing complex events in a video sequence |
US6651243B1 (en) * | 1997-12-12 | 2003-11-18 | International Business Machines Corporation | Method and system for periodic trace sampling for real-time generation of segments of call stack trees |
US6658652B1 (en) * | 2000-06-08 | 2003-12-02 | International Business Machines Corporation | Method and system for shadow heap memory leak detection and other heap analysis in an object-oriented environment during real-time trace processing |
US6675374B2 (en) * | 1999-10-12 | 2004-01-06 | Hewlett-Packard Development Company, L.P. | Insertion of prefetch instructions into computer program code |
US20040015930A1 (en) * | 2001-03-26 | 2004-01-22 | Youfeng Wu | Method and system for collaborative profiling for continuous detection of profile phase transitions |
US20040015897A1 (en) * | 2001-05-15 | 2004-01-22 | Thompson Carlos L. | Method and apparatus for verifying invariant properties of data structures at run-time |
US20040025145A1 (en) * | 1999-05-12 | 2004-02-05 | Dawson Peter S. | Dynamic software code instrumentation method and system |
US6704860B1 (en) * | 2000-07-26 | 2004-03-09 | International Business Machines Corporation | Data processing system and method for fetching instruction blocks in response to a detected block sequence |
US20040103401A1 (en) * | 2002-11-25 | 2004-05-27 | Microsoft Corporation | Dynamic temporal optimization framework |
US20040103408A1 (en) * | 2002-11-25 | 2004-05-27 | Microsoft Corporation | Dynamic prefetching of hot data streams |
US20040111444A1 (en) * | 2002-12-06 | 2004-06-10 | Garthwaite Alexander T. | Advancing cars in trains managed by a collector based on the train algorithm |
US20040133556A1 (en) * | 2003-01-02 | 2004-07-08 | Wolczko Mario I. | Method and apparatus for skewing a bi-directional object layout to improve cache performance |
US20040215880A1 (en) * | 2003-04-25 | 2004-10-28 | Microsoft Corporation | Cache-conscious coallocation of hot data streams |
US6848029B2 (en) * | 2000-01-03 | 2005-01-25 | Dirk Coldewey | Method and apparatus for prefetching recursive data structures |
US20050091645A1 (en) * | 2003-10-24 | 2005-04-28 | Microsoft Corporation | Adaptive instrumentation runtime monitoring and analysis |
US6951015B2 (en) * | 2002-05-30 | 2005-09-27 | Hewlett-Packard Development Company, L.P. | Prefetch insertion by correlation of cache misses and previously executed instructions |
US20050246696A1 (en) * | 2004-04-29 | 2005-11-03 | International Business Machines Corporation | Method and apparatus for hardware awareness of data types |
US7181730B2 (en) * | 2000-06-21 | 2007-02-20 | Altera Corporation | Methods and apparatus for indirect VLIW memory allocation |
-
2002
- 2002-11-25 US US10/305,056 patent/US7140008B2/en not_active Expired - Fee Related
-
2006
- 2006-10-05 US US11/539,111 patent/US20070083856A1/en not_active Abandoned
Patent Citations (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5220667A (en) * | 1989-09-25 | 1993-06-15 | Mitsubishi Denki Kabushiki Kaisha | Computer system |
US5333311A (en) * | 1990-12-10 | 1994-07-26 | Alsoft, Inc. | Optimizing a magnetic disk by allocating files by the frequency a file is accessed/updated or by designating a file to a fixed location on a disk |
US5774685A (en) * | 1995-04-21 | 1998-06-30 | International Business Machines Corporation | Method and apparatus for biasing cache LRU for prefetched instructions/data based upon evaluation of speculative conditions |
US5713008A (en) * | 1995-06-08 | 1998-01-27 | Sun Microsystems | Determination of working sets by logging and simulating filesystem operations |
US5950007A (en) * | 1995-07-06 | 1999-09-07 | Hitachi, Ltd. | Method for compiling loops containing prefetch instructions that replaces one or more actual prefetches with one virtual prefetch prior to loop scheduling and unrolling |
US5740443A (en) * | 1995-08-14 | 1998-04-14 | International Business Machines Corporation | Call-site specific selective automatic inlining |
US5950003A (en) * | 1995-08-24 | 1999-09-07 | Fujitsu Limited | Profile instrumentation method and profile data collection method |
US6886167B1 (en) * | 1995-12-27 | 2005-04-26 | International Business Machines Corporation | Method and system for migrating an object between a split status and a merged status |
US20020133639A1 (en) * | 1995-12-27 | 2002-09-19 | International Business Machines Corporation | Method and system for migrating an object between a split status and a merged status |
US5815720A (en) * | 1996-03-15 | 1998-09-29 | Institute For The Development Of Emerging Architectures, L.L.C. | Use of dynamic translation to collect and exploit run-time information in an optimizing compilation system |
US5925100A (en) * | 1996-03-21 | 1999-07-20 | Sybase, Inc. | Client/server system with methods for prefetching and managing semantic objects based on object-based prefetch primitive present in client's executing application |
US5909578A (en) * | 1996-09-30 | 1999-06-01 | Hewlett-Packard Company | Use of dynamic translation to burst profile computer applications |
US5953524A (en) * | 1996-11-22 | 1999-09-14 | Sybase, Inc. | Development system with methods for runtime binding of user-defined classes |
US6216219B1 (en) * | 1996-12-31 | 2001-04-10 | Texas Instruments Incorporated | Microprocessor circuits, systems, and methods implementing a load target buffer with entries relating to prefetch desirability |
US6073232A (en) * | 1997-02-25 | 2000-06-06 | International Business Machines Corporation | Method for minimizing a computer's initial program load time after a system reset or a power-on using non-volatile storage |
US5960198A (en) * | 1997-03-19 | 1999-09-28 | International Business Machines Corporation | Software profiler with runtime control to enable and disable instrumented executable |
US6026234A (en) * | 1997-03-19 | 2000-02-15 | International Business Machines Corporation | Method and apparatus for profiling indirect procedure calls in a computer program |
US6404455B1 (en) * | 1997-05-14 | 2002-06-11 | Hitachi Denshi Kabushiki Kaisha | Method for tracking entering object and apparatus for tracking and monitoring entering object |
US5940618A (en) * | 1997-09-22 | 1999-08-17 | International Business Machines Corporation | Code instrumentation system with non intrusive means and cache memory optimization for dynamic monitoring of code segments |
US6651243B1 (en) * | 1997-12-12 | 2003-11-18 | International Business Machines Corporation | Method and system for periodic trace sampling for real-time generation of segments of call stack trees |
US6148437A (en) * | 1998-05-04 | 2000-11-14 | Hewlett-Packard Company | System and method for jump-evaluated trace designation |
US6079032A (en) * | 1998-05-19 | 2000-06-20 | Lucent Technologies, Inc. | Performance analysis of computer systems |
US6628835B1 (en) * | 1998-08-31 | 2003-09-30 | Texas Instruments Incorporated | Method and system for defining and recognizing complex events in a video sequence |
US6233678B1 (en) * | 1998-11-05 | 2001-05-15 | Hewlett-Packard Company | Method and apparatus for profiling of non-instrumented programs and dynamic processing of profile data |
US6311260B1 (en) * | 1999-02-25 | 2001-10-30 | Nec Research Institute, Inc. | Method for perfetching structured data |
US6321240B1 (en) * | 1999-03-15 | 2001-11-20 | Trishul M. Chilimbi | Data structure partitioning with garbage collection to optimize cache utilization |
US6330556B1 (en) * | 1999-03-15 | 2001-12-11 | Trishul M. Chilimbi | Data structure partitioning to optimize cache utilization |
US6360361B1 (en) * | 1999-03-15 | 2002-03-19 | Microsoft Corporation | Field reordering to optimize cache utilization |
US6370684B1 (en) * | 1999-04-12 | 2002-04-09 | International Business Machines Corporation | Methods for extracting reference patterns in JAVA and depicting the same |
US20040025145A1 (en) * | 1999-05-12 | 2004-02-05 | Dawson Peter S. | Dynamic software code instrumentation method and system |
US6675374B2 (en) * | 1999-10-12 | 2004-01-06 | Hewlett-Packard Development Company, L.P. | Insertion of prefetch instructions into computer program code |
US6560693B1 (en) * | 1999-12-10 | 2003-05-06 | International Business Machines Corporation | Branch history guided instruction/data prefetching |
US6848029B2 (en) * | 2000-01-03 | 2005-01-25 | Dirk Coldewey | Method and apparatus for prefetching recursive data structures |
US6658652B1 (en) * | 2000-06-08 | 2003-12-02 | International Business Machines Corporation | Method and system for shadow heap memory leak detection and other heap analysis in an object-oriented environment during real-time trace processing |
US7181730B2 (en) * | 2000-06-21 | 2007-02-20 | Altera Corporation | Methods and apparatus for indirect VLIW memory allocation |
US6704860B1 (en) * | 2000-07-26 | 2004-03-09 | International Business Machines Corporation | Data processing system and method for fetching instruction blocks in response to a detected block sequence |
US6571318B1 (en) * | 2001-03-02 | 2003-05-27 | Advanced Micro Devices, Inc. | Stride based prefetcher with confidence counter and dynamic prefetch-ahead mechanism |
US6598141B1 (en) * | 2001-03-08 | 2003-07-22 | Microsoft Corporation | Manipulating interior pointers on a stack during garbage collection |
US20040015930A1 (en) * | 2001-03-26 | 2004-01-22 | Youfeng Wu | Method and system for collaborative profiling for continuous detection of profile phase transitions |
US7032217B2 (en) * | 2001-03-26 | 2006-04-18 | Intel Corporation | Method and system for collaborative profiling for continuous detection of profile phase transitions |
US20020144245A1 (en) * | 2001-03-30 | 2002-10-03 | Guei-Yuan Lueh | Static compilation of instrumentation code for debugging support |
US20040015897A1 (en) * | 2001-05-15 | 2004-01-22 | Thompson Carlos L. | Method and apparatus for verifying invariant properties of data structures at run-time |
US6951015B2 (en) * | 2002-05-30 | 2005-09-27 | Hewlett-Packard Development Company, L.P. | Prefetch insertion by correlation of cache misses and previously executed instructions |
US20040103408A1 (en) * | 2002-11-25 | 2004-05-27 | Microsoft Corporation | Dynamic prefetching of hot data streams |
US20040103401A1 (en) * | 2002-11-25 | 2004-05-27 | Microsoft Corporation | Dynamic temporal optimization framework |
US20040111444A1 (en) * | 2002-12-06 | 2004-06-10 | Garthwaite Alexander T. | Advancing cars in trains managed by a collector based on the train algorithm |
US20040133556A1 (en) * | 2003-01-02 | 2004-07-08 | Wolczko Mario I. | Method and apparatus for skewing a bi-directional object layout to improve cache performance |
US20040215880A1 (en) * | 2003-04-25 | 2004-10-28 | Microsoft Corporation | Cache-conscious coallocation of hot data streams |
US20050091645A1 (en) * | 2003-10-24 | 2005-04-28 | Microsoft Corporation | Adaptive instrumentation runtime monitoring and analysis |
US20050246696A1 (en) * | 2004-04-29 | 2005-11-03 | International Business Machines Corporation | Method and apparatus for hardware awareness of data types |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8046752B2 (en) | 2002-11-25 | 2011-10-25 | Microsoft Corporation | Dynamic prefetching of hot data streams |
US20050229165A1 (en) * | 2004-04-07 | 2005-10-13 | Microsoft Corporation | Method and system for probe optimization while instrumenting a program |
US7590521B2 (en) * | 2004-04-07 | 2009-09-15 | Microsoft Corporation | Method and system for probe optimization while instrumenting a program |
US20060265438A1 (en) * | 2005-05-20 | 2006-11-23 | Microsoft Corporation | Leveraging garbage collection to dynamically infer heap invariants |
US7912877B2 (en) | 2005-05-20 | 2011-03-22 | Microsoft Corporation | Leveraging garbage collection to dynamically infer heap invariants |
US7694281B2 (en) * | 2005-09-30 | 2010-04-06 | Intel Corporation | Two-pass MRET trace selection for dynamic optimization |
US20070079293A1 (en) * | 2005-09-30 | 2007-04-05 | Cheng Wang | Two-pass MRET trace selection for dynamic optimization |
US7818721B2 (en) * | 2006-02-01 | 2010-10-19 | Oracle America, Inc. | Dynamic application tracing in virtual machine environments |
US20070180439A1 (en) * | 2006-02-01 | 2007-08-02 | Sun Microsystems, Inc. | Dynamic application tracing in virtual machine environments |
US7962901B2 (en) | 2006-04-17 | 2011-06-14 | Microsoft Corporation | Using dynamic analysis to improve model checking |
US20070244942A1 (en) * | 2006-04-17 | 2007-10-18 | Microsoft Corporation | Using dynamic analysis to improve model checking |
US20080005208A1 (en) * | 2006-06-20 | 2008-01-03 | Microsoft Corporation | Data structure path profiling |
US7926043B2 (en) | 2006-06-20 | 2011-04-12 | Microsoft Corporation | Data structure path profiling |
US9697058B2 (en) | 2007-08-08 | 2017-07-04 | Oracle International Corporation | Method, computer program and apparatus for controlling access to a computer resource and obtaining a baseline therefor |
US20140026185A1 (en) * | 2008-08-13 | 2014-01-23 | International Business Machines Corporation | System, Method, and Apparatus for Modular, String-Sensitive, Access Rights Analysis with Demand-Driven Precision |
US9858419B2 (en) * | 2008-08-13 | 2018-01-02 | International Business Machines Corporation | System, method, and apparatus for modular, string-sensitive, access rights analysis with demand-driven precision |
US20100083236A1 (en) * | 2008-09-30 | 2010-04-01 | Joao Paulo Porto | Compact trace trees for dynamic binary parallelization |
US8332558B2 (en) | 2008-09-30 | 2012-12-11 | Intel Corporation | Compact trace trees for dynamic binary parallelization |
US20100095278A1 (en) * | 2008-10-09 | 2010-04-15 | Nageshappa Prashanth K | Tracing a calltree of a specified root method |
US8347273B2 (en) * | 2008-10-09 | 2013-01-01 | International Business Machines Corporation | Tracing a calltree of a specified root method |
US20120109639A1 (en) * | 2009-01-20 | 2012-05-03 | Oracle International Corporation | Method, computer program and apparatus for analyzing symbols in a computer system |
US8825473B2 (en) * | 2009-01-20 | 2014-09-02 | Oracle International Corporation | Method, computer program and apparatus for analyzing symbols in a computer system |
US20150100584A1 (en) * | 2009-01-20 | 2015-04-09 | Oracle International Corporation | Method, computer program and apparatus for analyzing symbols in a computer system |
US9600572B2 (en) * | 2009-01-20 | 2017-03-21 | Oracle International Corporation | Method, computer program and apparatus for analyzing symbols in a computer system |
US8504984B1 (en) * | 2009-05-29 | 2013-08-06 | Google Inc. | Modifying grammars to correct programming language statements |
US20110004869A1 (en) * | 2009-07-02 | 2011-01-06 | International Business Machines Corporation | Program, apparatus, and method of optimizing a java object |
US8479182B2 (en) | 2009-07-02 | 2013-07-02 | International Business Machines Corporation | Program, apparatus, and method of optimizing a java object |
US8336039B2 (en) * | 2009-07-02 | 2012-12-18 | International Business Machines Corporation | Program, apparatus, and method of optimizing a Java object |
US8510721B2 (en) | 2010-08-25 | 2013-08-13 | Microsoft Corporation | Dynamic calculation of sample profile reports |
US10346222B2 (en) | 2010-11-30 | 2019-07-09 | Microsoft Technology Licensing, Llc | Adaptive tree structure for visualizing data |
US8941657B2 (en) | 2011-05-23 | 2015-01-27 | Microsoft Technology Licensing, Llc | Calculating zoom level timeline data |
US20140195788A1 (en) * | 2013-01-10 | 2014-07-10 | Oracle International Corporation | Reducing instruction miss penalties in applications |
US8978022B2 (en) * | 2013-01-10 | 2015-03-10 | Oracle International Corporation | Reducing instruction miss penalties in applications |
US20200065076A1 (en) * | 2018-08-23 | 2020-02-27 | Oracle International Corporation | Method for performing deep static analysis with or without source code |
US10768913B2 (en) * | 2018-08-23 | 2020-09-08 | Oracle International Corporation | Method for performing deep static analysis with or without source code |
Also Published As
Publication number | Publication date |
---|---|
US20040103401A1 (en) | 2004-05-27 |
US7140008B2 (en) | 2006-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7140008B2 (en) | Dynamic temporal optimization framework | |
Chilimbi et al. | Dynamic hot data stream prefetching for general-purpose programs | |
Hirzel et al. | Bursty tracing: A framework for low-overhead temporal profiling | |
US7607119B2 (en) | Variational path profiling | |
Chen et al. | Data dependence profiling for speculative optimizations | |
US8046752B2 (en) | Dynamic prefetching of hot data streams | |
Larus | Whole program paths | |
US9946523B2 (en) | Multiple pass compiler instrumentation infrastructure | |
Arnold et al. | Online feedback-directed optimization of Java | |
US5530964A (en) | Optimizing assembled code for execution using execution statistics collection, without inserting instructions in the code and reorganizing the code based on the statistics collected | |
US11003428B2 (en) | Sample driven profile guided optimization with precise correlation | |
JP3790683B2 (en) | Computer apparatus, exception handling program thereof, and compiling method | |
Nagpurkar et al. | Online phase detection algorithms | |
Moseley et al. | Identifying potential parallelism via loop-centric profiling | |
Georges et al. | Method-level phase behavior in Java workloads | |
Moseley et al. | Loopprof: Dynamic techniques for loop detection and profiling | |
Watterson et al. | Goal-directed value profiling | |
Choi et al. | Design and experience: Using the Intel Itanium 2 processor performance monitoring unit to implement feedback optimizations | |
Mutlu et al. | Address-value delta (AVD) prediction: A hardware technique for efficiently parallelizing dependent cache misses | |
Zhou et al. | Code size efficiency in global scheduling for ILP processors | |
Porto et al. | Trace execution automata in dynamic binary translation | |
Mysore et al. | Profiling over adaptive ranges | |
Becker et al. | Optimizing worst-case execution times using mainstream compilers | |
Feigin | A Case for Automatic Run-Time Code Optimization | |
Kaplow et al. | Program optimization based on compile-time cache performance prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001 Effective date: 20141014 |