WO2001022213A2 - Optimized bytecode interpreter of virtual machine instructions - Google Patents

Optimized bytecode interpreter of virtual machine instructions Download PDF

Info

Publication number
WO2001022213A2
WO2001022213A2 PCT/EP2000/008976 EP0008976W WO0122213A2 WO 2001022213 A2 WO2001022213 A2 WO 2001022213A2 EP 0008976 W EP0008976 W EP 0008976W WO 0122213 A2 WO0122213 A2 WO 0122213A2
Authority
WO
WIPO (PCT)
Prior art keywords
bytecodes
bytecode
implementation
interpreter
virtual machine
Prior art date
Application number
PCT/EP2000/008976
Other languages
French (fr)
Other versions
WO2001022213A3 (en
Inventor
Fabio Riccardi
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to EP00966006A priority Critical patent/EP1183598A2/en
Priority to KR1020017006400A priority patent/KR20010080525A/en
Priority to JP2001525514A priority patent/JP2003510681A/en
Publication of WO2001022213A2 publication Critical patent/WO2001022213A2/en
Publication of WO2001022213A3 publication Critical patent/WO2001022213A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45516Runtime code conversion or optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/451Stack data

Definitions

  • the invention relates to run-time optimization of interpreted programs. It relates, more particularly, to a method for optimizing interpreted programs by means of a virtual machine which dynamically reconfigures itself with new macro operation codes.
  • the invention applies to any bytecode-based programming language.
  • VM virtual machine
  • the VM is a software implementation representing an architecture of a virtual processor on which applications especially compiled for this architecture are executed.
  • the instructions of the virtual processor / machine are called bytecodes.
  • the VM interpreter is the part of the VM which represents the bytecodes' execution mechanism.
  • the bytecodes are said to be interpreted by the VM interpreter.
  • the bytecodes' execution mechanism is currently implemented as an infinite loop with a switch case bloc.
  • the technique described in the cited article applies to direct threaded interpreters. Threaded code interpreters execute the bytecodes in line. Each bytecode translation contains the reference to the next bytecode. Therefore, the bytecode translation as executed by a threaded interpreter does not involve the infinite loop. Even though threaded interpreters offer a performance advantage, they are too slow and require too much memory to be convenient for most embedded systems.
  • a direct threaded code interpreter As described in the cited article, the VM bytecodes are represented with the address of their implementation, so that each bytecode can directly jump to the implementation of the next bytecode.
  • a table is initialized before the translation operation, with the addresses of each bytecode of the application in order that, when the bytecode translation takes place, the physical addresses of the bytecode implementations can be quickly accessible. The table allows to switch from a bytecode to another one.
  • Direct threaded interpreters are rather fast but they involve code expansion. By changing bytecodes into direct threaded codes, the code size is increased by approximately 150%, because the operation codes are replaces with the addresses of their implementation code. In general, addresses need 4 bytes whereas the operation codes need only 1 byte. Therefore, direct threaded interpreters increase memory consumption and are thus not very suitable for embedded systems.
  • Such systems may be, for example, satellite or cable transmission systems embedded into a digital video receiver, often called a set top box.
  • the invention also applies to any product whose operating system is based on a bytecode-based programming language.
  • the invention also allows to save memory and CPU resources and can improve the performance of the system.
  • a method of optimizing interpreted programs in a virtual machine interpreter of a bytecode-based language wherein the virtual machine dynamically reconfigures itself with new macro bytecodes (or opcodes) replacing sequences of simple bytecodes, and wherein the virtual machine interpreter is coded as a threaded code interpreter for translating the bytecodes into their implementation codes.
  • the threaded code interpreter according to the invention is coded as an indirect threaded code interpreter thanks to a reference table which contains the implementation addresses of the bytecodes in order that during translation of a bytecode, the address of the next bytecode is retrieved to be able to jump to the next bytecode.
  • Fig. 1 is a bloc diagram illustrating the features of a method according to the invention.
  • Fig. 2 is a bloc diagram illustrating the features of a method according to the preferred embodiment of the invention.
  • Fig. 3 is a schematic diagram illustrating an example of a receiver according to the invention.
  • JIT Just-In-Time
  • VM Java virtual machine
  • the invention is also based on some sort of dynamic code generation, but its goal is not that of translating the application's Java bytecode into native machine code, but rather to dynamically adapt the Java VM to the execution of the application's specific bytecode sequences.
  • the original application's Java bytecode is thus preserved, while the VM is dynamically enriched with novel bytecodes or operation codes (opcodes) improving its execution efficiency.
  • opcodes operation codes
  • the VM's execution mechanisms is economic : there is only one execution mechanism, therefore the VM executing the application will not have to deal with multiple code representations which contributes to reduce its size and improve its reliability,
  • the code generation technique is rather simple: the VM optimizer has a very simple structure, the application's bytecode analysis is a one-pass table-driven procedure taking very little CPU resources, and which directly drives the synthesis of new bytecodes. These properties make the invention suitable for embedded applications.
  • the foundation of the optimization technique according to the invention lies in the study of the costs of the very basic mechanisms of an interpreter with respect to a category of "typical" applications.
  • the relevance of the application's profile lies in the potential benefit attainable from the various optimization techniques that might be envisaged. Since the target is embedded applications, what might be define as "typical" applications are, for example, control applications, graphical user interfaces, and so forth.
  • the overheads associated with this approach are as follows : increment the instruction pointer pc, fetch the next bytecode from memory, a redundant range check on the argument to switch, fetch the address of the destination case label from a table, jump to that address, and at the end of each bytecode : jump back to the start of the loop to fetch the next bytecode.
  • the cost of instruction dispatching ignoring all other sources of inefficiency such as the actual implementation of the switch statement, consists of : 2 memory accesses : one to retrieve the value of the next instruction, one to retrieve the address of the instruction's implementation, plus 2 branches : one to jump to the bytecode's implementation and another one to go back to the beginning of the loop. Jumps are among the most expensive instructions on modern architectures. Pure bytecode interpreters are easy to write and to understand. They are also highly portable but rather slow. They are thus not convenient for embedded systems. In the case where most bytecodes perform simple operations, as in the example illustrated herein before, most of the execution time is wasted in performing the dispatch.
  • Java bytecodes have a very low-level semantics, and their implementation is often trivial. Therefore, the most commonly executed bytecodes are actually less expensive than the dispatching mechanism itself.
  • a first improvement in efficiency according to the invention is the adoption of indirect threaded code as illustrated with the set of instructions below : Op_l_lbl
  • the VM is coded as an indirect threaded code interpreter.
  • a reference table denoted opcode table, contains the bytecodes implementation addresses. The reference table is accessed by an index of a pointer (*pc++).
  • the address of the next bytecode is retrieved to jump to the next bytecode. In this way each bytecode implementation directly jumps to the next bytecode implementation, we have saved one branch, the outer loop, and the unnecessary inefficiency of the switch statement's implementation (range checking and default case handling).
  • the translation is carried out by exploiting unused bytecodes of the bytecode-based language VM specification.
  • Java bytecodes can be separated into two categories : simple operation codes (loads, stores, arithmetic and control statements) and complex operation codes (memory management, synchronization, etc.).
  • Simple bytecodes are typically less expensive than the dispatching mechanism. Complex bytecodes are instead much more expensive, the dispatching cost representing only a minimal fraction of the total cost of the bytecode execution cost. Simple bytecodes are also executed much more frequently (about an order of magnitude) than complex ones, implying that a classical Java interpreter spends most of its time dispatching bytecodes rather than really doing anything useful. It is thus assumed that it would be definitively more effective to reduce the dispatching cost for simple bytecodes than for complex ones.
  • Translating bytecodes into indirect threaded code also gives the opportunity to make arbitrary transformations on the executable code. One such transformation is to detect common sequences of bytecodes and translate them into a single threaded "macro code ".
  • This macro code performs the work of the entire sequence of original bytecodes. Therefore, according to a preferred embodiment of the invention, it is proposed to replace sequences of simple bytecodes by some equivalent "macro codes". For example, as presented in the cited article, the bytecodes "push literal, push variable, add, store variable” can be translated into a single "add-literal-to-variable" macro code in the indirect threaded code. Such optimization are effective because they avoid the overhead of the multiple dispatches that are implied by the original bytecodes, but elided within the macro code. A single macro code which is translated from a sequence of N original bytecodes avoids N-l bytecode dispatches at execution time. More details about how to build macro codes can be found in the cited article. Such macro codes will have to satisfy the following criteria :
  • Macros have to be made out of sequences of simple bytecodes, since there is no point in reducing the dispatching cost of complex ones.
  • Macros must not contain instructions that are possible branch targets, otherwise one would have to radically change the VM execution mechanism.
  • a macro itself can be a branch target.
  • Macros must terminate with control statements or method calls, since the cost of a native branch is equivalent to that of a dispatch operation.
  • the maximal length of a macro should be approximately of 15 bytecodes.
  • the "natural" average macro length being of 4-5 bytecodes. From these criteria it is very simple to construct such macro sequences, taking very little -and bounded- CPU time. A simple scan of a method's bytecode is indeed enough, and most of the parsing can be table driven and single-bytecode based.
  • a two-byte representation can be used for the new bytecodes representing the new macro-instruction.
  • the operands of the original sequence are grouped right after the new sequence, which leaves them easily accessible by incrementing the program counter of the virtual machine.
  • macros can be constructed by simply cutting and pasting together the binary code produced by the compiler for the threaded code interpreter. Macros are just considered as normal bytecodes by the threading dispatcher.
  • FIG. 2 summarizes the preferred embodiment of a virtual machine according to the invention.
  • the VM is implemented to load programs containing bytes codes to be interpreted by the VM interpreter.
  • the invention brings out some additional advantages.
  • the processor branch instructions can also be reduced by about a factor of five. Since the code to be executed has been linearized, the performance of the processor's pipeline and memory subsystem may be significantly improved. The actual gain depends on the architecture of the processor for the cost of a pipeline stall and on the memory subsystem architecture for the cost of a cache line fill. On "memory challenged" systems, like most embedded applications, these costs are quite high and definitively worth reducing.
  • the residual dispatching cost essentially depends on the control statements present into the Java code. To fully translate the bytecode into binary code like classical dynamic recompilation, branch statements should be introduced in the executable code. This would have more or less the same cost as the remaining dispatches which are left with.
  • macros are generic sequences of bytecodes, and that the probability that one of such sequences can be found elsewhere in the context of another process, or even in the same process, is quite high. Test were made for Java bytecodes. It was found that a significant part of the macros can be reused. Therefore, by taking into account the reuse factor, the memory footprint used by macro code implementation may be reduced. A full translation to binary code would consume at least twice as much memory, and would very likely have only a negligible performance advantage. For instance, assuming that it would be possible to further cut the cost of scheduling by another factor of two, the total observable increment in speed would be very small. Most likely, it is not worth trading against the doubling of memory footprint.
  • Another advantage of macros is that they do not have any impact on the normal bytecode dispatching mechanism. There is no need to add another execution mechanism to those already existing in the VM. There is no need to distinguish between compiled and non-compiled processes and no need to recur to the weirdness and overhead of native code interfaces.
  • Object-oriented languages like Java are characterized by the presence of very small units of code. Java processes are also very hard to inline, since they are almost always potentially polymorphic. Therefore, even if a fully optimizing compiler would be able to better map the process execution semantics on the underlying processor architecture, the overhead of the preamble and conclusion of binary translated processes would often suppress any advantage.
  • a stack catching technique can be used, which keeps the first three locations of the Java stack inside the processor's register file, reducing considerably the number of memory accesses.
  • the technique exploits the fact that the target processor is a stack machine itself.
  • the original bytecode implementations are substituted with equivalent processor instruction sequences.
  • the cost reduction of memory Input / output will now be described, in the case of Java as an example, according to another alternative embodiment of the invention.
  • FIG. 2 An example of a receiver according to the invention is shown in fig. 2. It is a set top box receiver 20 for interactive video transmission. It comprises a decoder, e.g. compatible with the MPEG 2 (Moving Pictures Experts group, ISO/TEC 13818-2) recommendation, for receiving via a cable transmission channel 23 an encoded signal from a video transmitter 24 and for decoding the received signal in order to retrieve the transmitted data to be displayed on a video display 25.
  • the functions of the set top box can be efficiently software implemented using a system that executes an interpreted language such as Java in the form of bytecodes.
  • the system comprises a main processor CPU and a memory MEM for storing software code portions representing instructions for causing the main processor CPU to carry out the methods according to the invention as described in figure 1 or 2.
  • the set top box 20 can receive Java applications containing bytecodes as part of the received signal.
  • the set top box would comprise a loader to load the bytecode-based programs received from a distant sender.

Abstract

The invention relates to a method of optimizing interpreted programs, in a virtual machine interpreter of a bytecode-based language, comprising means for dynamically reconfiguring said virtual machine with macro operation codes by replacing an original sequence of simple operation codes with a new sequence of said macro operation codes. The virtual machine interpreter is coded as an indirect threading interpreter thanks to a translation table containing the implementation addresses of the operation codes for translating the bytecodes into the operation code's implementation addresses. Application: embedded systems using any bytecode-based programming language, set to box for interactive video transmissions.

Description

Optimized bytecode interpreter of virtual machine instructions
FIELD OF THE INVENTION
The invention relates to run-time optimization of interpreted programs. It relates, more particularly, to a method for optimizing interpreted programs by means of a virtual machine which dynamically reconfigures itself with new macro operation codes. The invention applies to any bytecode-based programming language.
BACKGROUND OF THE INVENTION
Bytecode-based languages with programmer- visible stacks are popular as intermediate languages for compilers, and also as machine-independent executable program representations. They offer significant advantages for network computing. The article
"Optimizing direct threaded code by selective in-lining", by I. Piumanta and F. Riccardi, in Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17, 1998, pp.291-300, describes a technique as mentioned in the opening paragraph, for optimizing interpreted programs. A virtual machine (VM) is used to interpret the programs thanks to a VM interpreter. The VM is a software implementation representing an architecture of a virtual processor on which applications especially compiled for this architecture are executed. The instructions of the virtual processor / machine are called bytecodes. The VM interpreter is the part of the VM which represents the bytecodes' execution mechanism. The bytecodes are said to be interpreted by the VM interpreter. The bytecodes' execution mechanism is currently implemented as an infinite loop with a switch case bloc. The technique described in the cited article applies to direct threaded interpreters. Threaded code interpreters execute the bytecodes in line. Each bytecode translation contains the reference to the next bytecode. Therefore, the bytecode translation as executed by a threaded interpreter does not involve the infinite loop. Even though threaded interpreters offer a performance advantage, they are too slow and require too much memory to be convenient for most embedded systems. In a direct threaded code interpreter, as described in the cited article, the VM bytecodes are represented with the address of their implementation, so that each bytecode can directly jump to the implementation of the next bytecode. A table is initialized before the translation operation, with the addresses of each bytecode of the application in order that, when the bytecode translation takes place, the physical addresses of the bytecode implementations can be quickly accessible. The table allows to switch from a bytecode to another one. Direct threaded interpreters are rather fast but they involve code expansion. By changing bytecodes into direct threaded codes, the code size is increased by approximately 150%, because the operation codes are replaces with the addresses of their implementation code. In general, addresses need 4 bytes whereas the operation codes need only 1 byte. Therefore, direct threaded interpreters increase memory consumption and are thus not very suitable for embedded systems.
SUMMARY OF THE INVENTION
It is an object of the invention to provide a method for optimizing run- time of interpreted programs which is very convenient for embedded systems. Such systems may be, for example, satellite or cable transmission systems embedded into a digital video receiver, often called a set top box. But the invention also applies to any product whose operating system is based on a bytecode-based programming language. The invention also allows to save memory and CPU resources and can improve the performance of the system.
In accordance with the invention, it is described a method of optimizing interpreted programs in a virtual machine interpreter of a bytecode-based language, wherein the virtual machine dynamically reconfigures itself with new macro bytecodes (or opcodes) replacing sequences of simple bytecodes, and wherein the virtual machine interpreter is coded as a threaded code interpreter for translating the bytecodes into their implementation codes. The threaded code interpreter according to the invention is coded as an indirect threaded code interpreter thanks to a reference table which contains the implementation addresses of the bytecodes in order that during translation of a bytecode, the address of the next bytecode is retrieved to be able to jump to the next bytecode.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention and additional features, which may be optionally used to implement the invention, are apparent from and will be elucidated with reference to the drawings described hereinafter.
Fig. 1 is a bloc diagram illustrating the features of a method according to the invention. Fig. 2 is a bloc diagram illustrating the features of a method according to the preferred embodiment of the invention.
Fig. 3 is a schematic diagram illustrating an example of a receiver according to the invention.
DETAILED DESCRIPTION OF THE INVENTION
The invention will be now explained in greater detail, taking the Java language as an example, to illustrate a novel run-time optimization strategy applicable to any bytecode- based language. The approach normally taken by Just-In-Time (JIT) compilers is to discard altogether the Java virtual machine (VM) interpreter and to translate the application's bytecode into native machine code prior to its execution (hence the Just-In-Time denomination). This process involves understanding the original application's semantic and re-expressing it into a more convenient native form. While this may be an efficient way of attaining performance, it also comes at the expense of a very large memory consumption in the one hand, because a bytecode-based language is more compact than a native code, and of large CPU (Central Processing Unit) resources in the other hand, because re-mapping Java bytecodes on the target machine is not an easy task.
The invention is also based on some sort of dynamic code generation, but its goal is not that of translating the application's Java bytecode into native machine code, but rather to dynamically adapt the Java VM to the execution of the application's specific bytecode sequences. The original application's Java bytecode is thus preserved, while the VM is dynamically enriched with novel bytecodes or operation codes (opcodes) improving its execution efficiency. There are several advantages to this approach :
It does not increase the size of the executable code : the application is left into memory- efficient Java's bytecoded representation,
The VM's execution mechanisms is economic : there is only one execution mechanism, therefore the VM executing the application will not have to deal with multiple code representations which contributes to reduce its size and improve its reliability,
The code generation technique is rather simple: the VM optimizer has a very simple structure, the application's bytecode analysis is a one-pass table-driven procedure taking very little CPU resources, and which directly drives the synthesis of new bytecodes. These properties make the invention suitable for embedded applications. The foundation of the optimization technique according to the invention lies in the study of the costs of the very basic mechanisms of an interpreter with respect to a category of "typical" applications. The relevance of the application's profile lies in the potential benefit attainable from the various optimization techniques that might be envisaged. Since the target is embedded applications, what might be define as "typical" applications are, for example, control applications, graphical user interfaces, and so forth.
It is assumed that the target applications are well mapped on the primitives offered by the underlying VM (object manipulations). Therefore, they will not benefit much from radical code transformations, but rather from a general improvement of the VM's execution mechanisms. To understand how to improve the efficiency of the VM, it was made use of Amdhal's law. In the version stated by Hennessy and Patterson, Amdhal's law is expressed as follows : "the performance improvement to be gained from using some faster mode of execution is limited by the fraction of time the faster mode can be used", or more synthetically : "make the common case fast".
Interpreter's performance depend on the representation chosen for executable code and on the mechanism used to dispatch the bytecodes. The first approach to reduce the implementation cost was to reduce the cost of instruction dispatching because the heart of an interpreter is its instruction dispatching mechanism. The typical interpreter, called pure bytecode interpreter, is implemented like a processor simulation : a large switch statement sitting in an endless loop, dispatching instructions to their implementations. Therefore, the inner loop of a pure bytecode interpreter is very simple : fetch the next bytecode and dispatch to the implementation using a switch statement. The interpreter is an infinite loop containing a switch statement to dispatch successive bytecodes, and passes control to the next bytecode by breaking out of the switch to pass control back to the start of the infinite loop. The following set of instructions illustrates an implementation of a typical bytecode interpreter.
Loop (
Op = *pc++; Switch (op) {
Case op_l :
// op_l's implementation break; case op 2 : // op_2 ' s implementation break; case op_3 :
// op_3 ' s implementation break;
}
Assuming the compiler optimizes the jump chains from the breaks through the implicit jump at the end of the loop back to its beginning, the overheads associated with this approach are as follows : increment the instruction pointer pc, fetch the next bytecode from memory, a redundant range check on the argument to switch, fetch the address of the destination case label from a table, jump to that address, and at the end of each bytecode : jump back to the start of the loop to fetch the next bytecode.
In this case the cost of instruction dispatching, ignoring all other sources of inefficiency such as the actual implementation of the switch statement, consists of : 2 memory accesses : one to retrieve the value of the next instruction, one to retrieve the address of the instruction's implementation, plus 2 branches : one to jump to the bytecode's implementation and another one to go back to the beginning of the loop. Jumps are among the most expensive instructions on modern architectures. Pure bytecode interpreters are easy to write and to understand. They are also highly portable but rather slow. They are thus not convenient for embedded systems. In the case where most bytecodes perform simple operations, as in the example illustrated herein before, most of the execution time is wasted in performing the dispatch. Actually, in order to be aware of the real cost of the mechanism, it should be compared with the cost of the execution of a single bytecode. Java bytecodes have a very low-level semantics, and their implementation is often trivial. Therefore, the most commonly executed bytecodes are actually less expensive than the dispatching mechanism itself.
A first improvement in efficiency according to the invention is the adoption of indirect threaded code as illustrated with the set of instructions below : Op_l_lbl
// op_l's implementation goto opcode_table (*pc++) ;
Op_2_lbl
// op_2 ' s implementation goto opcode_table (*pc++);
Op_3_lbl
// op_3 ' s implementation goto opcode_table ( *pc++) ; where Op_l_ Ibl, Op_2_ Ibl and Op_3_ Ibl represent 3 different operation codes interpreted by the VM interpreter.
According to this implementation, called indirect threaded code, the VM is coded as an indirect threaded code interpreter. During bytecode translation, the address of the next bytecode is resolved. A reference table, denoted opcode table, contains the bytecodes implementation addresses. The reference table is accessed by an index of a pointer (*pc++). For each bytecode translation, the address of the next bytecode is retrieved to jump to the next bytecode. In this way each bytecode implementation directly jumps to the next bytecode implementation, we have saved one branch, the outer loop, and the unnecessary inefficiency of the switch statement's implementation (range checking and default case handling). According to a preferred embodiment of the invention, the translation is carried out by exploiting unused bytecodes of the bytecode-based language VM specification.
The bloc diagram of figure 1 summarizes the mains steps of the method according to the invention for translating a bytecode, e.g. the bytecode bipush, into native instructions with an indirect threaded code interpreter : step K0= BIPUSH ; beginning of the method of translating the bytecode bipush which consists of putting Vi word on a stack, the Vz word being the bipush parameter (par) step Kl= PAR ; retrieve the bipush parameter (par) step K2= PUT; put the bipush parameter on the stack step K3 = GOTO; go to the next bytecode (goto opcode_table (*pc)) by looking into a reference table opcode table containing the address of the next bytecode' s implementation.
The adoption of threaded code by itself can double the VM's performance, but as we will see in the following it can also offer other interesting optimization opportunities. A statistic analysis of Java's bytecodes shows that, on average, about every 5-6 instructions there is a branch. On any modern CPU, branches are intrinsically expensive instructions, since they can cause pipeline stalls and/or trigger external bus activity. Besides, for loop unrolling or method call in-lining, there is not much that can really be done about it. Even when recompiling the code into a native representation, the control statements will still be there.
Recent studies on the CPU utilization for object oriented applications on high- end workstations show that the CPU can spend as much as 70% of its clock cycles to recover from pipeline stalls, as the effect of mispredicted branch statements, and to wait for data and instructions from a main memory (cache misses). Additionally, CPUs available in embedded systems normally have very small caches, no hardware assistance for dynamic branch prediction, and low and/or narrow memory interfaces with no L2 caches. These additional constraints will reduce even further the CPU utilization and performance.
Java bytecodes can be separated into two categories : simple operation codes (loads, stores, arithmetic and control statements) and complex operation codes (memory management, synchronization, etc.).
Simple bytecodes are typically less expensive than the dispatching mechanism. Complex bytecodes are instead much more expensive, the dispatching cost representing only a minimal fraction of the total cost of the bytecode execution cost. Simple bytecodes are also executed much more frequently (about an order of magnitude) than complex ones, implying that a classical Java interpreter spends most of its time dispatching bytecodes rather than really doing anything useful. It is thus assumed that it would be definitively more effective to reduce the dispatching cost for simple bytecodes than for complex ones. Translating bytecodes into indirect threaded code also gives the opportunity to make arbitrary transformations on the executable code. One such transformation is to detect common sequences of bytecodes and translate them into a single threaded "macro code ". This macro code performs the work of the entire sequence of original bytecodes. Therefore, according to a preferred embodiment of the invention, it is proposed to replace sequences of simple bytecodes by some equivalent "macro codes". For example, as presented in the cited article, the bytecodes "push literal, push variable, add, store variable" can be translated into a single "add-literal-to-variable" macro code in the indirect threaded code. Such optimization are effective because they avoid the overhead of the multiple dispatches that are implied by the original bytecodes, but elided within the macro code. A single macro code which is translated from a sequence of N original bytecodes avoids N-l bytecode dispatches at execution time. More details about how to build macro codes can be found in the cited article. Such macro codes will have to satisfy the following criteria :
Macros have to be made out of sequences of simple bytecodes, since there is no point in reducing the dispatching cost of complex ones.
Macros must not contain instructions that are possible branch targets, otherwise one would have to radically change the VM execution mechanism. A macro itself can be a branch target.
Macros must terminate with control statements or method calls, since the cost of a native branch is equivalent to that of a dispatch operation.
For implementation simplicity, the maximal length of a macro should be approximately of 15 bytecodes. The "natural" average macro length being of 4-5 bytecodes. From these criteria it is very simple to construct such macro sequences, taking very little -and bounded- CPU time. A simple scan of a method's bytecode is indeed enough, and most of the parsing can be table driven and single-bytecode based.
According to a particular alternative of the preferred embodiment, which takes into account that unused bytecodes are very few (30-40 on average) a two-byte representation can be used for the new bytecodes representing the new macro-instruction. The operands of the original sequence are grouped right after the new sequence, which leaves them easily accessible by incrementing the program counter of the virtual machine.
Once a process is scanned, macros can be constructed by simply cutting and pasting together the binary code produced by the compiler for the threaded code interpreter. Macros are just considered as normal bytecodes by the threading dispatcher.
Figure 2 summarizes the preferred embodiment of a virtual machine according to the invention. The VM is implemented to load programs containing bytes codes to be interpreted by the VM interpreter. The main steps of the method are the following : step K0= INIT: initialization of the procedure executed by the VM by loading the programs containing the bytecodes , step Kl= OPCODE : to retrieve the bytecodes to be interpreted, step K2= MACRO : replacement of sequences of simple bytecodes with macro bytecodes, step K3= TRANS : interpretation of the macro bytecodes using the indirect threaded interpreter method as described in figure 1, step K4= RES : get the result, end of the method. Statistical analysis performed on execution traces of actual Java applications, show that the typical macro length is of 4-5 bytecodes, and that, after the code transformation, macros can be executed up to five times more often than the remaining bytecodes. The remaining bytecodes are those for whom the implementation is just too complex to be worth in-lining and those which are left behind by taking into account the branch target analysis. The total bytecode dispatching cost may thus be reduced by more than a factor of four. If the dispatching cost originally constituted about 50 % of the total execution cost, it can be significantly reduced by using the invention.
The invention brings out some additional advantages. The processor branch instructions can also be reduced by about a factor of five. Since the code to be executed has been linearized, the performance of the processor's pipeline and memory subsystem may be significantly improved. The actual gain depends on the architecture of the processor for the cost of a pipeline stall and on the memory subsystem architecture for the cost of a cache line fill. On "memory challenged" systems, like most embedded applications, these costs are quite high and definitively worth reducing. The residual dispatching cost essentially depends on the control statements present into the Java code. To fully translate the bytecode into binary code like classical dynamic recompilation, branch statements should be introduced in the executable code. This would have more or less the same cost as the remaining dispatches which are left with. One of the advantages of macros is that they are generic sequences of bytecodes, and that the probability that one of such sequences can be found elsewhere in the context of another process, or even in the same process, is quite high. Test were made for Java bytecodes. It was found that a significant part of the macros can be reused. Therefore, by taking into account the reuse factor, the memory footprint used by macro code implementation may be reduced. A full translation to binary code would consume at least twice as much memory, and would very likely have only a negligible performance advantage. For instance, assuming that it would be possible to further cut the cost of scheduling by another factor of two, the total observable increment in speed would be very small. Most likely, it is not worth trading against the doubling of memory footprint. Another advantage of macros is that they do not have any impact on the normal bytecode dispatching mechanism. There is no need to add another execution mechanism to those already existing in the VM. There is no need to distinguish between compiled and non-compiled processes and no need to recur to the weirdness and overhead of native code interfaces. Object-oriented languages like Java are characterized by the presence of very small units of code. Java processes are also very hard to inline, since they are almost always potentially polymorphic. Therefore, even if a fully optimizing compiler would be able to better map the process execution semantics on the underlying processor architecture, the overhead of the preamble and conclusion of binary translated processes would often suppress any advantage.
To improve execution efficiency, a stack catching technique can be used, which keeps the first three locations of the Java stack inside the processor's register file, reducing considerably the number of memory accesses. The technique exploits the fact that the target processor is a stack machine itself. The original bytecode implementations are substituted with equivalent processor instruction sequences. By using a trivial translation table and a simple cost function (number of memory references), very fast and efficient compilation technique can be achieved. The cost reduction of memory Input / output will now be described, in the case of Java as an example, according to another alternative embodiment of the invention.
Java is a stack-based language: bytecodes communicate with each other using memory. Every single bytecode execution implies at least one memory access, which turns out to be very expensive. Considering, for instance, the following simple expression : C = a + b; In a stack based language it is translated into : Push a — 1 read, 1 write
Push b - 1 read, 1 write
Add — 2 read, 1 write
Store c — 1 read, 1 write which represents nine memory access operations. A CPU with a minimum of internal state can do the same with only three memory accesses. Considering the fact that on a modern processor architecture, memory references are among the most expensive operations, it is an ideal field of optimization. With a little additional coding effort, a version of the Java bytecodes can be made to exchange data through machines registers instead than through external memory. Macros can then be created, starting from these specialized bytecodes which are called strands, reducing the number of memory accesses within a macro by more than a factor of two.
An implementation of the "macroizer" and of the bytecode "standifier" would not need too many lines of code. Partial rewrite of the interpreter's loop can be estimated, for example, in about a few Kilo lines of C code. Only a few lines of assembly are necessary for the implementation of the indirect threaded code dispatcher, and a few hundreds are dedicated to the "standifier".
Tests and measures of the running time have been made which don't take into account the time spent for the bytecode parsing and for the generation of the new macro bytecodes. Nevertheless the run-time was measured using a native code profiler. When mnning a large application, like a web browser, the total time spent for "macroization" remains limited to a very little percentage of the total execution time.
An example of a receiver according to the invention is shown in fig. 2. It is a set top box receiver 20 for interactive video transmission. It comprises a decoder, e.g. compatible with the MPEG 2 (Moving Pictures Experts group, ISO/TEC 13818-2) recommendation, for receiving via a cable transmission channel 23 an encoded signal from a video transmitter 24 and for decoding the received signal in order to retrieve the transmitted data to be displayed on a video display 25. The functions of the set top box can be efficiently software implemented using a system that executes an interpreted language such as Java in the form of bytecodes. The system comprises a main processor CPU and a memory MEM for storing software code portions representing instructions for causing the main processor CPU to carry out the methods according to the invention as described in figure 1 or 2.
According to another embodiment of the invention, the set top box 20 can receive Java applications containing bytecodes as part of the received signal. In this case, the set top box would comprise a loader to load the bytecode-based programs received from a distant sender.

Claims

CLAIMS:
1. A method of optimizing interpreted programs in a virtual machine interpreter of a bytecode-based language, wherein the virtual machine dynamically reconfigures itself by replacing an original sequence of simple bytecodes with a new sequence of macro bytecodes and wherein the virtual machine interpreter is coded as a threaded code interpreter for translating the bytecodes into their implementation code, comprising a reference table which contains references to the addresses of the implementation of the bytecodes in order that during translation of the current bytecode, the address of the implementation of the next bytecode is retrieved to be able to jump to the next bytecode.
2. A method according to claim 1 , wherein the bytecodes of the original sequence are grouped after the new sequence of said macro operation codes.
3. A method according to any of claims 1 or 2, wherein the virtual machine interpreter comprises a predetermined set of bytecodes, some of which are unused, and wherein said new sequence of macro operation codes is implemented by exploiting said unused bytecodes.
4. A method according to claim 3, wherein the unused bytecodes are encoded with at least a two-byte representation.
5. A method of optimizing interpreted programs, in a virtual machine for a bytecode-based language, comprising the following steps : initialization by loading programs containing the bytecodes, replacement of sequences of simple bytecodes with macro codes, interpretation of the macro bytecodes using an indirect threaded interpreter for translating the bytecodes into their implementation code, comprising a reference table which contains references to the addresses of the implementation of the bytecodes in order that during interpretation of the current bytecode, the address of the implementation of the next bytecode is retrieved to be able to jump to the next bytecode.
6. A computer program product for being loaded into a memory, comprising a set of instructions for causing a processor to carry out the method according to any one of claims l to 5.
7. A receiver for receiving transmission signals, the receiver comprising a processor (CPU) and a memory (MEM) for storing software code portions representing instructions for causing the processor to carry out the method according to any one of claims l to 5.
8. A method of making available for downloading a computer program comprising instructions for executing the method as claimed in any one of the claims 1 to 5, into a receiver as claimed in claim 7.
PCT/EP2000/008976 1999-09-21 2000-09-13 Optimized bytecode interpreter of virtual machine instructions WO2001022213A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00966006A EP1183598A2 (en) 1999-09-21 2000-09-13 Optimized bytecode interpreter of virtual machine instructions
KR1020017006400A KR20010080525A (en) 1999-09-21 2000-09-13 Optimized bytecode interpreter of virtual machine instructions
JP2001525514A JP2003510681A (en) 1999-09-21 2000-09-13 Optimized bytecode interpreter for virtual machine instructions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP99402309.1 1999-09-21
EP99402309 1999-09-21

Publications (2)

Publication Number Publication Date
WO2001022213A2 true WO2001022213A2 (en) 2001-03-29
WO2001022213A3 WO2001022213A3 (en) 2001-11-29

Family

ID=8242118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/008976 WO2001022213A2 (en) 1999-09-21 2000-09-13 Optimized bytecode interpreter of virtual machine instructions

Country Status (5)

Country Link
EP (1) EP1183598A2 (en)
JP (1) JP2003510681A (en)
KR (1) KR20010080525A (en)
CN (1) CN1173262C (en)
WO (1) WO2001022213A2 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2828296A1 (en) * 2001-08-03 2003-02-07 Trusted Logic Method for compression of interpreted object code by factorization of arborescent expressions so that program code can be stored more efficiently in available memory space
WO2003019366A1 (en) * 2001-08-30 2003-03-06 Gemplus Compression of a programme in intermediate language
WO2003019368A2 (en) * 2001-08-24 2003-03-06 Sun Microsystems, Inc Frameworks for generation of java macro instructions for storing values into local variables
WO2003019367A1 (en) * 2001-08-24 2003-03-06 Sun Microsystems, Inc., Replacing java bytecode sequences by macro instructions
EP1308838A2 (en) * 2001-10-31 2003-05-07 Aplix Corporation Intermediate code preprocessing apparatus, intermediate code execution apparatus, intermediate code execution system, and computer program product for preprocessing or executing intermediate code
US6957428B2 (en) 2001-03-27 2005-10-18 Sun Microsystems, Inc. Enhanced virtual machine instructions
US6996813B1 (en) 2000-10-31 2006-02-07 Sun Microsystems, Inc. Frameworks for loading and execution of object-based programs
US7058934B2 (en) 2001-08-24 2006-06-06 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions for instantiating Java objects
US7096466B2 (en) 2001-03-26 2006-08-22 Sun Microsystems, Inc. Loading attribute for partial loading of class files into virtual machines
US7228533B2 (en) 2001-08-24 2007-06-05 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions for performing programming loops
US7543288B2 (en) 2001-03-27 2009-06-02 Sun Microsystems, Inc. Reduced instruction set for Java virtual machines
WO2011008856A3 (en) * 2009-07-14 2011-03-24 Unisys Corporation Systems, methods, and computer programs for dynamic binary translation in an interpreter
CN110262533A (en) * 2019-06-25 2019-09-20 哈尔滨工业大学 A kind of method, apparatus and computer storage medium based on hierarchical task network planning modular reconfigurable satellite via Self-reconfiguration

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7275240B2 (en) * 2003-01-08 2007-09-25 Microsoft Corporation Method and system for recording macros in a language independent syntax
CN100356326C (en) * 2003-03-21 2007-12-19 清华大学 Method for transfering Java line based on recovering of operation stack record
KR100597413B1 (en) * 2004-09-24 2006-07-05 삼성전자주식회사 Method for translating Java bytecode and Java interpreter using the same
KR100678912B1 (en) * 2005-10-18 2007-02-05 삼성전자주식회사 Method for interpreting method bytecode and system by the same
CN102662830A (en) * 2012-03-20 2012-09-12 湖南大学 Code reuse attack detection system based on dynamic binary translation framework

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778232A (en) * 1996-07-03 1998-07-07 Hewlett-Packard Company Automatic compiler restructuring of COBOL programs into a proc per paragraph model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778232A (en) * 1996-07-03 1998-07-07 Hewlett-Packard Company Automatic compiler restructuring of COBOL programs into a proc per paragraph model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PIUMARTA I ET AL: "OPTIMIZING DIRECT THREADED CODE BY SELECTIVE INLINING" ACM SIGPLAN NOTICES,US,ASSOCIATION FOR COMPUTING MACHINERY, NEW YORK, vol. 33, no. 5, 1 May 1998 (1998-05-01), pages 291-300, XP000766278 ISSN: 0362-1340 cited in the application *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6996813B1 (en) 2000-10-31 2006-02-07 Sun Microsystems, Inc. Frameworks for loading and execution of object-based programs
US7096466B2 (en) 2001-03-26 2006-08-22 Sun Microsystems, Inc. Loading attribute for partial loading of class files into virtual machines
US7941802B2 (en) 2001-03-27 2011-05-10 Stepan Sokolov Reduced instruction set for java virtual machines
US7543288B2 (en) 2001-03-27 2009-06-02 Sun Microsystems, Inc. Reduced instruction set for Java virtual machines
US6957428B2 (en) 2001-03-27 2005-10-18 Sun Microsystems, Inc. Enhanced virtual machine instructions
WO2003017097A3 (en) * 2001-08-03 2004-02-26 Trusted Logic Method for compression of object code interpreted by tree-structured expression factorization
WO2003017097A2 (en) * 2001-08-03 2003-02-27 Trusted Logic Method for compression of object code interpreted by tree-structured expression factorization
FR2828296A1 (en) * 2001-08-03 2003-02-07 Trusted Logic Method for compression of interpreted object code by factorization of arborescent expressions so that program code can be stored more efficiently in available memory space
WO2003019367A1 (en) * 2001-08-24 2003-03-06 Sun Microsystems, Inc., Replacing java bytecode sequences by macro instructions
WO2003019368A3 (en) * 2001-08-24 2004-02-05 Sun Microsystems Inc Frameworks for generation of java macro instructions for storing values into local variables
US6988261B2 (en) 2001-08-24 2006-01-17 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions in Java computing environments
US7039904B2 (en) 2001-08-24 2006-05-02 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions for storing values into local variables
US7058934B2 (en) 2001-08-24 2006-06-06 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions for instantiating Java objects
WO2003019368A2 (en) * 2001-08-24 2003-03-06 Sun Microsystems, Inc Frameworks for generation of java macro instructions for storing values into local variables
US7228533B2 (en) 2001-08-24 2007-06-05 Sun Microsystems, Inc. Frameworks for generation of Java macro instructions for performing programming loops
FR2829252A1 (en) * 2001-08-30 2003-03-07 Gemplus Card Int COMPRESSION OF A PROGRAM IN INTERMEDIATE LANGUAGE
WO2003019366A1 (en) * 2001-08-30 2003-03-06 Gemplus Compression of a programme in intermediate language
EP1308838A2 (en) * 2001-10-31 2003-05-07 Aplix Corporation Intermediate code preprocessing apparatus, intermediate code execution apparatus, intermediate code execution system, and computer program product for preprocessing or executing intermediate code
CN100382028C (en) * 2001-10-31 2008-04-16 亚普公司 Intermediate code pretreatment, executive device, executive system and computer program products
EP1308838A3 (en) * 2001-10-31 2007-12-19 Aplix Corporation Intermediate code preprocessing apparatus, intermediate code execution apparatus, intermediate code execution system, and computer program product for preprocessing or executing intermediate code
US7213237B2 (en) 2001-10-31 2007-05-01 Aplix Corporation Intermediate code preprocessing apparatus, intermediate code execution apparatus, intermediate code execution system, and computer program product for preprocessing or executing intermediate code
WO2011008856A3 (en) * 2009-07-14 2011-03-24 Unisys Corporation Systems, methods, and computer programs for dynamic binary translation in an interpreter
CN110262533A (en) * 2019-06-25 2019-09-20 哈尔滨工业大学 A kind of method, apparatus and computer storage medium based on hierarchical task network planning modular reconfigurable satellite via Self-reconfiguration
CN110262533B (en) * 2019-06-25 2021-06-15 哈尔滨工业大学 Modular reconfigurable satellite self-reconfiguration method and device based on hierarchical task network planning and computer storage medium

Also Published As

Publication number Publication date
KR20010080525A (en) 2001-08-22
EP1183598A2 (en) 2002-03-06
JP2003510681A (en) 2003-03-18
WO2001022213A3 (en) 2001-11-29
CN1347525A (en) 2002-05-01
CN1173262C (en) 2004-10-27

Similar Documents

Publication Publication Date Title
EP1183598A2 (en) Optimized bytecode interpreter of virtual machine instructions
US7003652B2 (en) Restarting translated instructions
US6332215B1 (en) Java virtual machine hardware for RISC and CISC processors
US8893079B2 (en) Methods for generating code for an architecture encoding an extended register specification
US7134119B2 (en) Intercalling between native and non-native instruction sets
US7487330B2 (en) Method and apparatus for transferring control in a computer system with dynamic compilation capability
US8578351B2 (en) Hybrid mechanism for more efficient emulation and method therefor
US7000094B2 (en) Storing stack operands in registers
Hoogerbrugge et al. A code compression system based on pipelined interpreters
US7823140B2 (en) Java bytecode translation method and Java interpreter performing the same
KR970703561A (en) Object-Code Com-patible Representation of Very Long Instruction Word Programs
KR100258650B1 (en) A method and system for performing an emulation context save and restore that is transparent to the operating system
US8769508B2 (en) Virtual machine hardware for RISC and CISC processors
US20040215444A1 (en) Hardware-translator-based custom method invocation system and method
KR20040045467A (en) Speculative execution for java hardware accelerator
KR20120064446A (en) Appratus and method for processing branch of bytecode on computing system
Gregg et al. Implementing an efficient Java interpreter
Gregg et al. The case for virtual register machines
Wang et al. Code size reduction by compressing repeated instruction sequences
Casey et al. The Case for Virtual Register Machines
GB2367658A (en) Intercalling between native and non-native instruction sets
Gregg et al. Implementing an E cient Java Interpreter

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 00802974.1

Country of ref document: CN

AK Designated states

Kind code of ref document: A2

Designated state(s): CN IN JP KR

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWE Wipo information: entry into national phase

Ref document number: 2000966006

Country of ref document: EP

ENP Entry into the national phase in:

Ref document number: 2001 525514

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: IN/PCT/2001/696/CHE

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 1020017006400

Country of ref document: KR

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWP Wipo information: published in national office

Ref document number: 1020017006400

Country of ref document: KR

AK Designated states

Kind code of ref document: A3

Designated state(s): CN IN JP KR

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWP Wipo information: published in national office

Ref document number: 2000966006

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2000966006

Country of ref document: EP