generalnewsdownloads projectsstreamlinebeltway buffers pipesfs model-t documentationintroductionpapers slides videos examples dev. manual code mailing listjoin / leavearchive developmentsvn downloadsvn changelog browse svn people
|
Streamline
:: Networking_05
1223
FPL-3: towards language support for distributed packet processingMihai-Lucian Cristeaand Willem de Bruijn and Herbert Boscristea -_at_- liacs.nl, +31715277037, Leiden University, Niels Bohrweg 1, 2333CA, wdb -_at_- few.vu.nl, +31204447790, Vrije Universiteit Amsterdam, De Boelelaan 1081HV, herbertb -_at_- cs.vu.nl, +31204447746, Vrije Universiteit Amsterdam, De Boelelaan 1081HV, The NetherlandsĀAbstractThe FPL-33 packet filtering language incorporates explicit support for distributed processing into the language. FPL-33 supports not only generic header-based filtering, but also more demanding tasks, such as payload scanning, packet replication and traffic splitting. By distributing FPL-33 based tasks across a possibly heterogeneous network of processing nodes, the NET-FFPF2 network monitoring architecture facilitates very high speed packet processing. Results show that NET-FFPF2 can perform complex processing at gigabit speeds. The proposed framework can be used to execute such diverse tasks as load balancing, traffic monitoring, firewalling and intrusion detection directly at the critical high-bandwidth links (e.g., in enterprise gateways). High-speed packet processing, traffic splitting, network monitoring1 IntroductionThere exists a widening gap between advances in network speeds and those in bus, memory and processor speeds. This makes it ever more difficult to process packets at line rate. At the same time, we see that demand for packet processing tasks such as network monitoring, intrusion detection and firewalling is growing. Commodity hardware is not able to process packet data at backbone speeds, a situation that is likely to get worse rather than better in the future. Therefore, more efficient and scalable packet processing solutions are needed. It has been recognised that parallelism can be exploited to deal with processing at high speeds. A network processor (NP), for example, is a device specifically designed for packet processing at high speeds by sharing the workload between a number of independent RISC processors. However, for very demanding applications (e.g., payload scanning for worm signatures) more power is needed than any one processor can offer. For reasons of cost-efficiency it is infeasible to develop NPs that can cope with backbone link rates for such applications. An attractive alternative is to increase scalability by exploiting parallelism at a coarser granularity. We have previously introduced an efficient monitoring framework, Fairly Fast Packet Filters (FFPF) [], that can reach high speeds by pushing as much of the work as possible to the lowest levels of the processing stack. The NIC-FIX architecture [] showed how this monitoring framework could be extended all the way down to the network card. To support such an extensible programmable environment, we introduced the special purpose FPL-2 language. In this paper, we exploit packet processing parallelism at the level of individual processing units (NPs or commodity PCs) to build a heterogeneous distributed monitoring architecture: NET-FFPF2. Incoming traffic is divided into multiple streams, each of which is forwarded to a different processing node (Fig. 1). Simple processing occurs at the lower levels (root nodes), while more complex tasks take place in the higher levels where more cycles are available per packet. The main contribution of this paper consists of a novel language that explicitly facilitates distribution of complex packet processing tasks: FPL-33. Also, with NET-FFPF2 we extend the NIC-FIX architecture upwards, with packet transmission support, to create a distributed filtering platform. Experiments show NET-FFPF2 to be able to handle complex tasks at gigabit line-rate.
2 Architecture2.1 High-level overviewAt present, high speed network packet processing solutions need to be based on special purpose hardware such as dedicated ASIC boards or network processors (see Fig. 1a). Although faster than commodity hardware, solutions based even on these platforms are not sustainable in the long run because of a widening gap between growthrates in networking (link speed, usage patterns) and computing (cpu, main memory and bus speed). To counter this scalability trend we propose the solution shown in Fig. 1b, which consists of splitting the incoming traffic into multiple sub-streams, and then processing these individually. Processing nodes are organised in a tree-like structure, as shown in Figure 2. By distributing these nodes over a number of possibly simple hardware devices, a flexible, scalable and cost-effective network monitoring platform can be built.
2.2 Distributed Abstract Processing TreeThe introduced networked processing system can be expressed in a distributed abstract processing tree ( d-APT) as depicted in Figure 3. The name is derived from a close resemblance to ASTs (abstract syntax trees), as we will see later in this paper. For an easier understanding of the d-APT functionality, we use the following notations throughout the text. A d-APT is a tree composed of individual APTs, each of which has its own dedicated hardware device. An APT is built up of multiple processing elements (e.g., filters) and may be interconnected to other APTs through so-called in-nodes and out-nodes. For example, N0.3, N0.5 are out-nodes, while N1.1, N2.1 are in-nodes.
2.3 The FPL-3 languageAs our architectural design relies on explicit breakpointing, we needed to introduce this functionality into our framework. With FPL-33 we adopted a language-based approach, following our earlier experiences in this field. We designed FPL-33 specifically with these observations in mind: first, there is a need for executing tasks (e.g., payload scanning) that existing packet languages like BPF [], Snort [] or Windmill [] cannot perform. Second, special purpose devices such as network processors can be quite complex and thus are not easy to program directly. Third, we should facilitate on-demand extensions, for instance through hardware assisted functions. Finally, security issues such as user authorisation and resource constraints should be handled effectively. The previous version of the FPL-33 language, FPL-2, addressed many of these concerns. However, it lacked features fundamental to distributed processing like packet mangling and retransmission. We will introduce the language design with an example. First, a program written for a single machine (N0) is introduced in Figure 4. Then, the same example is `mapped' onto a distributed abstract processing tree by using FPL-33 language extensions in Figure 5.
3 Implementation detailsNET-FFPF2 builds on FFPF, a monitoring framework designed for efficient packet processing on commodity hardware, such as PCs. FFPF offers support for commonly used packet filtering tools (e.g., tcpdump, snort, libpcap) and languages (like BPF), as well as for special purpose hardware like network processors (NPs). However, it is a single-device solution. NET-FFPF2 extends it with distributed processing capabilities through language constructs. Currently, we have support for two target platforms: IXP1200 network processors and off-the-shelf PCs running Linux. We briefly introduce the first in the next section.3.1 Network ProcessorsNetwork processors are designed to take advantage of the parallelism inherent in packet processing tasks by providing a number of independent stream processors (mEngines) on a single die. The Radisys ENP 2506 network card that was used to implement NET-FFPF2 is displayed in Figure 7. For input, the board is equipped with two 1Gb/s Ethernet ports . The card also contains a 232 MHz Intel IXP1200 network processor with 8 MB of SRAM and 256 MB of SDRAM and is plugged into a 1.2 GHz PIII over a PCI bus . The IXP is built up of a single StrongARM processor running Linux and six mEngines running no operating system whatsoever.
3.2 The FPL-3 languageThe FFPF programming language (FPL) was devised to give the FFPF platform a more expressive packet processing language than available in existing solutions. The latest version, FPL-2, conceptually uses a register-based virtual machine, but compiles to fully optimised object code. However, FPL-2 was designed for single node processing. We now introduce its direct descendant, FPL-33, which extends FPL-2 with constructs for distributed processing.
3.2.1 EXTERN() construct.External functions allow users to call efficient C or hardware assisted implementations of computationally expensive functions, such as checksum calculation, or pattern matching. The concept of an `external function' is another key to speed and system extensibility.3.2.2 HASH() construct.Applies a hash function to a sequence of bytes in the packet data. This function is hardware-accelerated on IXP1200 network processors.3.2.3 TX() construct.The purpose of this construct is to schedule the current packet for transmission. Currently, this operation involves overwriting the Ethernet destination address (ETH_DEST) of a packet with an entry from the MAC_table (TX_MAC). A simple use of TX() is illustrated in the example below:
TX_MAC[3] = {00:00:E2:8D:6C:F9, 00:02:03:04:05:03, 00:02:B3:50:1D:7A};
// extracted by the compiler from the configuration file
IF (PKT.IP_PROTO == PROTO_TCP) // if pkt is TCP
THEN TX (Mac, 2); // schedule it to be forwarded to the 3rd
ELSE TX (Mac, 1); // or 2nd MAC address from the TX_MAC table
FI
The example shows the first TX parameter to select a table (MAC or another field, such as IP_DEST, in a future implementation) and the second parameter to be the index into the table. Note that by inserting multiple TX() calls into the same program we can easily implement packet replication and load-balancing.
3.2.4 SPLIT() construct.To explain SPLIT we will step through the example in Figure 5. When trying to match the given FPL-33 filter to a distributed system, the compiler detects the SPLIT construct. SPLIT tells the compiler that the code following and bounded by its TILPS construct can be split off from the main program. The example script is split into subscripts as follows: one script for the splitter node N0, and three more for each processing node N1, N2 and N3, as shown in Fig. 9. Then, the scripts are compiled into object code by the native compilers.
3.2.5 Memory data addressingThe FPL-33 language hides most of the complexities of the underlying hardware. For instance, users need not worry about reading data into a mEngine's registers before accessing it. Similarly, accessing bytes in memory that is not byte addressable is handled automatically by the compiler. When deemed useful, however, users may choose to expose some of the complexity: it is, for instance, wise to declare additional arrays in fast hardware when that is available, like in the IXP1200's on-board SRAM. Note that NET-FFPF2 has no inter-node shared memory. Synchronising data across nodes is too costly relative to the high-speed packet processing task itself. To guarantee safe execution enough memory is allocated at each node to support the whole program, not just the subtask at hand. In the Ethernet implementation, limited data sharing is made possible by overwriting a now unused block in the ETH_SRC_ADDR field to communicate a 32 bit word processing state to the next hop.3.3 The FPL-3 compilerThe FPL-33 source-to-source compiler generates straight C target code that can be further handled by any C compiler. Programs can therefore benefit from the advanced optimisers in the Intel mEngine C compiler for IXP devices and gcc for commodity PCs. As a result, the object code will be heavily optimised even though we did not write an optimiser ourselves. In the near future FPL-33-compiler support will be extended to the latest Intel NP generation (e.g., IXP2400, IXP28xx).4 EvaluationTo evaluate NET-FFPF2, we execute the filter shown in Figure 10.a on various packet sizes and measure throughput (Fig. 10.b). This filter is mapped onto a distributed system as shown in Figure 2 and the compilation results are the code objects of the three sub-filters highlighted in same picture.
IF (PKT.IP_PROTO==TCP && PKT.IP_DEST_PORT==80) THEN R[0] = hash(flowfields)%3; SPLIT(R[0]); // thus, the main stream is equally distributed <scan for web-traffic worms> // across of 3 processing nodes 5 Related workUsing traffic splitters for increased packet processing efficiency on a single host was previously explored in [] and []. Our architecture differs in that it supports distributed and heterogeneous multi-level processing. In addition, we provide explicit language support for this purpose. As shown in [], it is efficient to use a source-to-source compiler from a generic language (Snort Intrusion Detection System) to a back-end language supported by the targeted hardware compiler (Intel mEngineC). We propose a more flexible and easy to use language as front-end for users. Moreover, our FPL-33 language is designed and implemented for heterogeneous targets in a distributed multi-level system. Many tools for monitoring are based on BPF in the kernel []. Filtering and processing in network cards is also promoted by some Juniper routers []. However, they lack features introduced in NET-FFPF2 such as extended language constructs, in-place packet handling and a distributed processing environment. BPF+ [] shows how an intermediate representation of BPF can be optimised, and how just-in-time-compilation can be used to produce efficient native filtering code. FPL-33 relies on gcc's optimisation techniques and on external functions for expensive operations. Like FPL-33 and DPF, Windmill protocol filters also target high-performance by compiling filters into native code []. And like MPF, Windmill explicitly supports multiple applications with overlapping filters. However, compared to FPL-33, Windmill filters are fairly simple conjunctions of header field predicates. Nprobe is aimed at monitoring multiple protocols [] and is therefore, like Windmill, geared towards spanning protocol stacks. Also, Nprobe focuses on disk bandwidth limitations and for this reason captures as few bytes of the packets as possible. NET-FFPF2 has no a priori notion of protocol stacks and supports payload processing. The SCAMPI architecture also pushes processing to the NIC []. It assumes that hardware can write packets immediately into host memory (e.g., by using DAG cards []) and implements access to packet buffers through a userspace daemon. SCAMPI does not support user-provided external functions, powerful languages such as FPL-33 or complex filtergraphs. Related to the distributed architecture of NET-FFPF2 are the Lobster EU project and Netbait Distributed Service [] that aim at European-scale passive monitoring and at planetary-scale worm detection, respectively. Netbait, for instance, targets high-level processing using commodity hardware. Therefore, these initiatives could benefit from using the FPL-33 language and its NET-FFPF2 execution environment as low-level layer.6 Conclusions and future workThis paper presented the NET-FFPF2 distributed network processing environment and its FPL-33 programming language, which enable users to process network traffic at high speeds by distributing tasks over a network of commodity and/or special purpose devices such as PCs and network processors. A task is distributed by constructing a processing tree that executes simple tasks such as splitting traffic near the root of the tree while executing more demanding tasks at the lesser-travelled leaves. Explicit language support in FPL-33 enables us to efficiently map a program to such a tree. The experimental results show that NET-FFPF2 can outperform traditional packet filters by processing at Gbps linerate even on a small-scale (two node) testbed. In the future, we plan to extend NET-FFPF2 with a management environment that can take care of object code loading and program instantiation. A first version of this management subsystem will act only when a user issues a recompile request. We envision a later version to be able to automatically respond to changes in its environment like the increase of specific traffic (e.g., tcp because of a malicious worm) or availability of new hardware in the system (e.g., a system upgrade). We also plan to have the FPL-3 compiler optimise code placement. As a result, tasks that are known to be CPU intensive - such as packet inspection, hashing or CRC generation - will be automatically sent to optimal target machines, for instance those with hardware assisted hash functions (NPs).AcknowledgementsThis work was supported by the EU SCAMPI project IST-2001-32404, and the EU LOBSTER project, while Intel donated the network cards.References |
Are you a student interested in computer systems research? Streamline is a Vrije Universiteit Amsterdam research project. We welcome exceptional candidates to apply for our Master's program.