Avoidable Bottlenecks

 false Many applications are not truly physically memory bound, but bogged down by poor software coordination: the same data is loaded from memory on multiple occasions and stored in multiple locations, data access cost is multiplied by introducing layer upon layer of software invocation overhead, etcetera. No sane application developer creates duplicates needlessly, of course: these issues arise as undesirable side effects of software modularity and isolation. Section 2.1.1Opportunities illustrates this point through specific examples. Contrary to pure computation, communication cannot avoid hitting isolation boundaries because it involves multiple parties. Regardless of whether this takes the form of inter process communication (IPC) or peripheral device access, it crosses security perimeters and therefore requires operating system mediation. Building applications that span groups of CPU tasks (each with individual memory protection domains and access control privileges) and physical devices requires complicated communication networks. When networks must be constructed from hardware-specific bridges interspersed with and provisional glue code it should surprise no one when end-to-end application performance fails to reach physical limits. I/O bottlenecks take one of two forms. Communication overhead accrues where data is forwarded: at the crossings between (hardware or software) compartments. Computation inefficiency occurs when logic fails to exploit specific hardware features (such as ISA extensions or coprocessors). On strictly layered systems such as Windows, Linux or BSD (including MacOS X), applications encounter one or more of the communication and computation bottlenecks shown in Figure 2.3Layered OS Architecture with Bottlenecks , each of which deserves a brief clarification.

1. ABI
For security reasons, applications must communicate with devices by calling into gateway code that separates untrusted application from trusted system logic: the Application Binary Interface, or ABI. These system calls transition execution into and out of the protected OS kernel environment (kernel mode switch) and to maintain strict memory isolation they copy data between the two environments. Both actions incur cost: kernelmode switches add tens to hundreds of CPU cycles to save state and incur secondary cost by requiring cache flushes. Copies cause roundtrips to main memory and unnecessary cache capacity misses from having to store data duplicates. The combined overhead can form a sizable portion of total processing (Section 7.2Performance).

2. IPC
ABI overhead affects not just device access. The trusted kernel also isolates applications by making Inter Process Communication (IPC) primitives available in the ABI. Traffic between applications is copied both into and out of the kernel, resulting in three versions of the same data. Because buffering in the kernel is minimal, switching into and out of kernel is frequent. What's more, besides kernel mode switches between CPU protection rings, IPC cause task switches between user processes. These are more expensive: in the order of 1000 cycles on modern x86 processors.

3. Group I/O
Multiprocess access to the same data is seen in group communication, (which subsumes 2-party IPC). This is common in some HPC applications, but also simply when a virus scanner is enabled. Group I/O is a generalization of IPC, where the number of data copies and task switches grows linearly with group membership. Gains of shared memory and synchronization reduction therefore increase with group size. Especially on manycore, efficient group communication is important.

4. Kernel
Within the operating system kernel, data is copied unnecessarily when subsystems cannot communicate effectively. A classic example and practical bottleneck [PDZ00] is data hand-off between the file and network subsystem. When a process transmits TCP data, the OS has to construct packets, copy over payloads and add the packets to the (re)transmission queue. If these packets hold file contents, their payload already resides in the file cache. Then, creating private copies for TCP constitutes unnecessary data duplication: pinning of the file cache page holding this data suffices. Simpler copying is generally used, however, because the file cache and network subsystems are functionally isolated. Similar cost accrues on the network reception of file data. TCP segments are reassembled, transferred to userspace, transferred back to the kernel and added to the filecache, causing much copying and switching.

5. Direct I/O
Data is forced to traverse the trusted kernel even when no trusted operation has to be performed. Virtual machines make a telling example. These are known to be poor at I/O intensive tasks. This derives at least in part from the fact that I/O devices always trap into the host OS kernel, which does not touch the data at all, but must now issue a switch to the guest, thereby degrading guest VM I/O performance by factors [MCZ06,LHAP06]. The same double task switch occurs in non-virtualized environments when devices and applications want an exclusive communication channel (and thus no resource multiplexing by the kernel is asked for) but must traverse the kernel. High-speed devices (e.g., DAG cards [CDG+00]) sometimes bundle software libraries that can avoid the kernel stack, but those generally suffer from one or more of these issues: they require superuser privileges, they require exclusive device access, or they introduce vendor-specific application programming interfaces (APIs).

6. Strict Software Layers
Applications frequently hit the above bottlenecks unnecessarily because I/O logic is strictly layered between devices, kernel and user tasks, irrespective of application profiles or system abilities. To give two practical examples of how strict layering hurts application throughput: a network fileserver saves two copies and one task switch when it moves fast-path logic from a user process to its network processing kernel task; a DNS daemon reduces latency and cache pollution by bypassing the kernel completely and performing its (minimal) packet processing inline in its user process. In the figure, we show an expensive IPSec operation that would benefit from using the cryptographic co-processors on some network cards (network processors), but that can't, because the implementation is kernel-specific. The first optimization has been carried out frequently, but always in a non-portable manner requiring superuser privileges. The second can be built with user level networking [vEBBV95], but not without considerable changes to application interfaces. The challenge is to design a system that executes both cases efficiently - with preferably relying little or no interface changes.

Figure 2.3: Layered OS Architecture with Bottlenecks 
\includegraphics[width=0.7\linewidth]{figpriv/newtriarch-bottleneck.eps}



Subsections
willem 2010-02-03