Runtime Optimization

In the present situation, where applications must be portable across platforms that differ in number and types of processors and caches, integrated compiled executables will perform suboptimal on most architectures, because important cache, memory and processor information is only available at runtime. At compile time far-reaching structural decisions can be made, but the output of this stage should remain reconfigurable. Even when targeting a single Instruction Set Architecture (ISA), performance differs dramatically depending on whether optional extensions (such as SSE) can be exploited, whether working sets fit cache sizes and whether processes can be effectively scheduled onto available cores, to give just a handful of examples. Optimization to such information cannot but be delayed until runtime. This is not to say that all optimization has to be delayed, of course: both stages have their application. In this work we take a system's view and leave the compiler out of focus. In this discussion, we do not interpret the term at runtime to mean strictly `during execution', but use it in the broader sense of after compilation and installation. We explicitly include last minute optimization prior to execution.

A radical approach to runtime optimization is to switch exclusively to just in time compilation of bytecode interpreted languages [Sta]. The approach is promising, but requires pervasive changes to software. The Inferno distributed operating system [WP97] is the only known mature implementation of this idea, but has not published performance numbers for high rate applications and does not target complex (multicore) computer architectures. We investigate a more moderate approach to software modification, where important computation and communication optimizations are delayed to runtime, but applications are compiled and distributed in their current form. Applications incorporate self optimizing control code in the software that manages (parallel) scheduling, memory management and communication. We show that optimization of performance critical elements can take place behind common APIs, shielding applications from the new complexity. Just in time compilation is more generic and will more easily incorporate fully heterogeneous architectures, but the pragmatic approach fits systems built from one dominant ISA supported by special purpose processors. So far, the products on shelves and in development all fall into this category.

willem 2010-02-03