false
Many applications are not truly
physically memory bound, but bogged down by
poor software coordination:
the same data is loaded from memory on multiple
occasions and stored in multiple locations, data access cost is multiplied
by introducing layer upon layer of software invocation overhead, etcetera.
No sane application developer creates duplicates needlessly, of course: these
issues arise as undesirable side effects of software modularity and isolation.
Section 2.1.1Opportunities illustrates this point through specific
examples.
Contrary to pure computation, communication cannot avoid hitting
isolation boundaries because it involves multiple parties.
Regardless of whether this takes the form of inter process communication (IPC)
or peripheral device access, it crosses security perimeters and therefore
requires operating system mediation.
Building applications that span groups of CPU tasks
(each with individual memory protection
domains and access control privileges) and physical devices requires
complicated communication networks.
When networks must be constructed from hardware-specific bridges
interspersed with
and
provisional glue code it should surprise no one when end-to-end application performance
fails to reach physical limits.
I/O bottlenecks take one of two forms. Communication overhead accrues where
data is forwarded: at the crossings between (hardware or software)
compartments. Computation inefficiency occurs when logic fails to
exploit specific hardware features (such as ISA extensions or coprocessors).
On strictly layered systems such as Windows, Linux or BSD
(including MacOS X), applications encounter one or more
of the communication and computation bottlenecks
shown in Figure 2.3Layered OS Architecture with Bottlenecks , each of which deserves a brief
clarification.
- 1. ABI
- For security reasons, applications must communicate with devices by
calling into gateway code that separates untrusted application from
trusted system logic: the Application Binary Interface, or ABI. These
system calls transition execution into and out of the protected
OS kernel environment (kernel mode switch)
and to maintain strict memory isolation they
copy data between the two environments. Both
actions incur cost: kernelmode switches add tens to hundreds of
CPU cycles to save state and incur secondary cost by requiring
cache flushes. Copies cause roundtrips to main memory and unnecessary
cache capacity misses from having to store data duplicates.
The combined overhead can form a sizable portion of
total processing (Section 7.2Performance).
- 2. IPC
- ABI overhead affects not just device access. The trusted kernel also
isolates applications by making Inter Process Communication (IPC)
primitives available in the ABI. Traffic between applications is copied
both into and out of the kernel, resulting in three versions of the same
data. Because buffering in the kernel is minimal,
switching into and out of kernel is frequent. What's more, besides
kernel mode switches between CPU protection rings, IPC cause task
switches between user processes. These are more expensive: in the order of
1000 cycles on modern x86 processors.
- 3. Group I/O
- Multiprocess access to the same data is seen in group
communication, (which subsumes 2-party IPC). This is common in some
HPC applications, but also simply when a virus scanner is enabled.
Group I/O is a generalization of IPC, where the number of data copies
and task switches grows linearly with group membership. Gains of
shared memory and synchronization reduction therefore increase with
group size. Especially on manycore, efficient
group communication is important.
- 4. Kernel
- Within the operating system kernel,
data is copied unnecessarily when subsystems cannot
communicate effectively. A classic example and practical bottleneck [PDZ00]
is data hand-off between the file and network subsystem. When
a process transmits TCP data, the OS has to construct packets,
copy over payloads and add the packets to the
(re)transmission queue. If these packets hold file contents, their
payload already resides in the file cache. Then, creating private
copies for TCP constitutes unnecessary data duplication:
pinning of the file cache page holding this data suffices.
Simpler copying is generally used, however, because
the file cache and network subsystems are functionally isolated.
Similar cost accrues on the network reception of file data. TCP
segments are reassembled, transferred to userspace, transferred
back to the kernel and added to the filecache, causing much
copying and switching.
- 5. Direct I/O
- Data is forced to traverse the trusted kernel even when no trusted operation
has to be performed. Virtual machines make a telling example. These are known
to be poor at I/O intensive tasks. This derives at least in part from the
fact that
I/O devices always trap into the host OS kernel, which does not
touch the data at all, but must now issue a switch to the guest, thereby
degrading
guest VM I/O performance by factors [MCZ06,LHAP06].
The same double task switch occurs in non-virtualized environments when
devices and applications want an exclusive communication channel (and thus
no resource multiplexing by the kernel is asked for)
but must traverse the kernel.
High-speed devices
(e.g., DAG cards [CDG+00])
sometimes bundle software libraries that can
avoid the kernel stack,
but those generally suffer from one or more of these issues:
they require superuser privileges,
they require exclusive device access,
or they introduce vendor-specific application programming interfaces (APIs).
- 6. Strict Software Layers
- Applications frequently hit the above bottlenecks unnecessarily because
I/O logic is strictly layered between devices, kernel and user tasks,
irrespective of application profiles or system abilities.
To give two practical examples of how strict layering hurts
application throughput: a network fileserver saves two
copies and one task switch when it moves fast-path logic from a
user process to its network processing kernel task;
a DNS daemon reduces latency and cache pollution by bypassing the
kernel completely and performing its (minimal) packet processing inline in its user process.
In the figure, we show an expensive IPSec operation that would benefit from using
the cryptographic co-processors on
some network cards (network processors), but that can't, because the implementation is
kernel-specific.
The first optimization has been carried out frequently, but always in a non-portable
manner requiring superuser privileges. The second can be built with
user level networking [vEBBV95], but not without
considerable changes to application interfaces. The challenge is to
design a system that executes both cases efficiently - with preferably
relying little or no interface changes.
Figure 2.3:
Layered OS Architecture with Bottlenecks
|
![\includegraphics[width=0.7\linewidth]{figpriv/newtriarch-bottleneck.eps}](img13.png) |
Subsections
willem
2010-02-03