An important task of computer systems is the swift processing of
sequentially ordered data. Such streaming I/O
is the common pattern behind networking
and multimedia, for instance:
tasks that are increasingly commonplace and whose
bandwidth demands
continue to grow with
Internet uptake and service fidelity.
Streaming I/O applications
cannot scale service naturally with Moore's Law, because they stress
multiple hardware components (CPU cores, caches, main memory and
I/O devices), some of which do not scale with
Moore's Law themselves (such as memory access latency).
This thesis proposes a number of new techniques that ensure
high streaming application throughput on diverse systems.
In I/O, stress originates not only in application code, but
accrues systemwide, especially
at crossings between tasks and processors, in the form
of signaling (task switching, polling and interrupting) and memory
operations (copying, page fault handling and cache/TLB misses). Adapting
application behavior to match hardware features
can significantly reduce this cost and increase throughput.
Cache aware transfers are 3 times as fast
as their default equivalents in Linux
(Section 4.5Ring Types); video player communication is
67% cheaper (Section 7.2Performance). But diversity among
computer system makes these opportunities moving targets. Reaching high
throughput in the face of core and cache variation
demands careful optimization -- which for most users is
too daunting or tedious. Automation offers consistent diligence and is
cost-effective for these long lived streaming I/O applications.
false
Virtualization of I/O shows that
the difficulty of achieving high I/O performance with isolation ;
An end-to-end solution that minimizes these cost factors can reach higher rates than
an isolated approach.
This thesis proposes the Streamline system I/O software layer to
allow applications to achieve high
throughput while hiding most tedious work from the programmer.
Streamline consistently reaches high throughput by reducing
data movement and signaling between applications, kernel and devices.
It is based on Unix [Rit84] and built into Linux.
Streamline extends Unix pipelines to kernel and device contexts, because much
I/O processing takes place here. By spanning across all contexts, it controls
and can optimize all data movement and processing.
Over Unix, it reduces crossing cost
by replacing copying with shared memory and
indirect access and
by replacing context switches with userlevel threading
where possible and by batching switches elsewhere.
It increases processing efficiency by moving logic to the most
suitable context.
It identifies hardware, calculates capacity and plans
execution to maximize end-to-end throughput.
In short, Streamline adapts Unix I/O to diverse systems.
A reference implementation is open source (BSD/LGPL)
software released for Linux 2.6.19 and higher. Streamline
features can be grouped in
interface, communication, computation and control concerns.
For each of these groups, this
thesis contributes new ideas. We briefly summarize all
here (but defer thorough explanation of technical terms to the
respective chapters):
- Interface (Ch. 3Interface)
- A virtual filesystem opens up all kernel I/O to user monitoring
and control using any common tools or scripts. Pipes into the kernel
give processes live access to all OS streams. For application construction,
Streamline revisits the Unix Shell pipeline syntax. It extends it
with parallel
and conditional paths to support
branching, e.g., for protocol demultiplexing, filtering
and connection handling, all without data touching.
Split-join parallelism and backlinks
build arbitrary directed graphs; at joins, duplicate data
can be transparently filtered with Boolean logic.
Loop constructs render the language Turing complete and enable
multi-pass filtering.
To apply all performance
optimizations to existing applications, common socket and
packet filter interfaces are built on top of these interfaces.
- Communication (Ch. 4Communication)
- Large shared buffers replace frequent copying and virtual memory operations,
increasing pipe throughput 4x.
Software indirection
emulates pointer queues across protection domains
to avoid data touching systemwide;
splicing uses the optimization for a 3-10x throughput increase over common
copying.
Buffer specialization optimizes the implementation of
the large shared buffers to match access patterns; for example,
cache scaling increases throughput by 200%,
fragmentation avoidance wins 20%.
- Computation (Ch. 5Computation)
- Userlevel threading
replaces signaling with function calling (up to 100x faster) and
event batching amortizes the remaining cost (3x as cheap).
Write substitution replaces data edits with metadata operations, appends or bookkeeping
to reduce data contention (locking) and update collision (copying).
Parallelism extraction detects safe speedups:
stateless kernels and session-handling filters are identified
along with the task parallelism inherent in the pipeline.
- Control (Ch. 6Control)
- Reification enables runtime selection of filter implementations. Buffer
specialization and stacking do the same for streams.
A graph transformation algorithm maps
all such options onto a single linear network flow program and
computes an end-to-end (theoretical) optimal mapping.
State space grows quickly, but practical problems are calculated
in tens of milliseconds.
Execution logic turns plans into real applications;
for embedded operation and safe hotspot optimization, it
integrates compilation and code loading.
Unix file permissions on filters and
streams enable safe restricted access to kernel and devices
by untrusted users,
e.g., for task offload and zerocopy network reception.
They make trading of isolation for performance
possible on a case-by-case basis.
Collectively, these changes improve
legacy
application throughput on common hardware
1.4x for the Bind DNS daemon, 3x for the MPlayer
video player and 3-10x for the Tcpdump traffic monitor.
Heterogeneous computer systems
reach Gigabit linerates for
deep packet inspection,
application layer intrusion prevention and
cryptographic token-based switching
as a result of employing all hardware
(Chapter 7Evaluation).
In short, Streamline presents an adaptive operating system design
for high throughput I/O.
This thesis follows a standard
challenge-design-implementation-evaluation layout, where
the implementation part is divided into the four mentioned
concerns:
interface, communication, computation and control.
Before we clarify and experimentally verify
all introduced
mechanisms,
the next section
first
examines the application
domain and argues
why pipeline
optimization is uniquely suited in principle to
increase throughput across diverse systems.
willem
2010-02-03