IBM SP High-Performance Networking with a GRF

White Paper by

John-Paul Navarro
Mathematics and Computer Science Division
Argonne National Laboratory
Argonne, IL 60439

Preprint ANL/MCS-P747-1298

Abstract

Increasing use of highly distributed applications, demand for faster data exchange, and highly parallel applications can push the limits of conventional external networking for IBM SP sites. In technical computing applications we have observed a growing use of a pipeline of hosts and networks collaborating to collect, process, and visualize large amounts of realtime data.

The GRF, a high-performance IP switch from Ascend and IBM, is the first backbone network switch to offer a media card that can directly connect to an SP Switch. This enables switch attached hosts in an SP complex to communicate at near SP Switch speeds with other GRF attached hosts and networks.

Introduction

Between April and July 1998 the Mathematics and Computer Science Division of Argonne National Laboratory evaluated an Ascend GRF equipped with recently developed SP Switch Router Adapter cards providing high-speed communications between an IBM RS/6000 SP and the departmental backbone and high network bandwidth hosts.

The SP Switch is the key communications component of an SP complex making high-performance communications possible for use in parallel and distributed applications. For applications running on a single SP complex with little or moderate external communications requirements, conventional Ethernet and ATM networking may be adequate. With increasingly parallel and distributed applications comes an increasing demand for connectivity between an SP complex and external networks and hosts.

The SP Switch offers attached hosts a theoretical 960 Mb/second bi-direction data-transfer (our P2SC nodes). Given the switch design, conducting an aggregate switch throughput test would be a difficult task. Our division's backbone is entirely ATM OC-12 based (622 Mb/second) with an OC-12 uplink. Most divisional servers are either on switched 100 Mb/second Ethernet or ATM OC-3, with one large graphics machine on OC-12.

In this environment the best we could offer between the SP and the division backbone or other network hosts was ATM OC-3 (155 Mb/second). The GRF offers 1 Gbit/second per card with aggregate bandwidth of 4 or 16 Gbits/second (GRF400 and GRF1600 respectively). Using the GRF therefore has the potential of increasing our SP's connectivity by up to a factor of 4.

The primary reasons we evaluated the GRF were to determine theoretical network performance improvements we could achieve versus existing networking, to determine how much real-world gains applications could expect, and to assess the overall installation, configuration, and management process of the GRF within an SP complex.

Testing Environment

The GRF we evaluated ran the Ascend Embedded/OS version GR TA1.4.6.4 with the following hardware components:

GRF1600 base unit
Two RS/6000 SP Switch Router Adapters
One ATM OC-3c MM card (2 interfaces)
One ATM OC-12c SM/IR card (1 interface)
One Parallel HIPPI card (copper)

We connected the GRF to the following components:

IBM SP complex consisting of 7 frames with 98 nodes, a separate switch frame, and a single control workstation
Divisional network backbone with multiple Cisco Lightstream 1010 switches connected via OC-12, a Cisco 7507 router connected via OC-3, and an OC-12 uplink.
SGI Onyx with a serial HIPPI interface.
OC-3 and OC-12 hosts.

The following diagram shows the major connections we established between the GRF and divisional networks and hosts.

Our IBM SP contains:

Processor Type Number Memory Memory Banks Clock CPU Type Disk

SP3 thin nodes 80 256 MB 2 120 MHz P2SC 9 GB

SP3 thin nodes 8 512 MB 4 120 MHz P2SC 9 GB

SP3 wide nodes 8 512 MB 4 135 MHz P2SC 9 GB

8-way SMP 2 1 GB 4 66 MHz 604 6 GB/node

Processor Type	Number	Memory	Memory Banks	Clock	CPU Type	Disk
SP3 thin nodes	80	256 MB	2	120 MHz	P2SC	9 GB
SP3 thin nodes	8	512 MB	4	120 MHz	P2SC	9 GB
SP3 wide nodes	8	512 MB	4	135 MHz	P2SC	9 GB
8-way SMP	2	1 GB	4	66 MHz	604	6 GB/node

It is important to note that the 80 SP3 thin nodes are a compute pool managed through a scheduler. This set of nodes, plus some of the other nodes used for interactive development work, may all require high bandwidth communications.

Alternatives

The GRF is the first device to support the direct connecting of an SP Switch to external networks. As an alternative it is possible to either outfit a large number of nodes with individual fast external network connections or use one or more nodes as gateways. The range of node adapters available for these options are: 10/100 Mb Ethernet, 16 Mb token-ring, 155 Mb ATM OC-3, and HIPPI.

In our environment we used the gateway approach by setting aside a single SP wide node with an ATM OC-3 interface providing TCP/IP gateway services for nodes on the switch. Key disadvantages of this approach include the (miss)use of a host as a network device and the capacity limitations and bottlenecks imposed by that host. It also limited our ability to provide high-performance switch-to-multipoint connectivity. Under heavy network load our single wide node could easily achieve 70-80% CPU utilization just forwarding traffic through its single ATM interface. Alternatively we could have outfitted additional hosts with ATM interfaces; an approach that could impose management headaches, significant costs for interfaces and switched ports, and still limited individual connections to single OC-3 speeds.

Installation, configuration, and management overview

Physical environment preparation and configuration was similar to other large network devices.

The GRF Embedded OS is a Unix variant making configuration and management similar to configuring network interfaces on a Unix host. The high-levels configuration steps are:

From the serial console give the GRF a host identity.
From the serial console configure the administrative Ethernet interface and connect it to the SP control Ethernet network.
Additional configuration can now be performed by telnet'ing to the Ethernet interface.
Next we configured OC-3 and OC-12 interfaces as uplinks to our backbone network. We also integrated the GRF into our routing management cloud by configuring it to OSPF peer with other switches and the router. This worked very well and allowed routes to dynamically change as we enabled or disabled GRF connections to the SP Switch or other devices. Because the GRF currently doesn't support ATM Switched Virtual Circuits "SVC"s we also had to define Private Virtual Circuits "PVC"s to various network devices.
Configured the HIPPI interface for host connectivity.
Configured SNMP so the GRF can participate in the SP Management Framework. The GRF must connect to the SP control Ethernet network so it can be managed similar to other SP nodes.

GRF support was introduced with PSSP 2.3 where it's identified and configured as a new class of node called an Extension Node. Although PSSP 2.3 is required, it only needs to be installed on the Control Workstations and the Primary and Backup switch nodes. PSSP commands are then used to define the GRF to the SP and to remotely configure it. Once configured, the GRF is brought online to the switch using the usual Eunfence command.

After upgrading our SP to the required software level, we encountered one configuration issue: ARP has to be running over the SP switch. Our SP was not running ARP on the switch and SP networking requires that ARP be running on all the switch interfaces or none at all. One of the key implications of running ARP is how TCP/IP addresses are assigned to SP switch nodes. Without ARP, addresses must sequentially match node numbers, whereas with ARP more flexibility is possible. Our main difficulty came from a lack of reconfiguration documentation, all we could find was how to enable ARP during initial SP complex configuration. To avoid the outage associated with reconfiguring the entire complex and pushing the changes though a process called a customize we contacted IBM technical support looking for a faster and simpler approach. The procedure suggested should have required minimal downtime but ended up causing significant problems because the configuration change disappeared on a reboot. It took extensive investigation by us and IBM to discover that the changed ARP parameter needed to be saved to the boot device using the command savebase. We would recommend following the standard customize process for changing ARP on the entire complex, in spite of the outage it would impose.

Since the GRF behaves as a switch node it must be attached to a switch port on a node frame and cannot be attached to a switch frame. This made it necessary to attach it to a frame containing wide or high nodes with unused switch ports. The last node in our last frame was a wide node, so we were able to attach the GRF to the switch port for the 16th node slot and therefore above all other nodes in the complex. When the GRF is attached to a switch port on a node frame it also must be assigned a unique node number and switch node number within the complex. These numbers must be determined and entered via PSSP when configuring the GRF Extension node and it's SP interface. It's essential to understand the relationship of SP node number to SP switch node number as described in the PSSP Installation and Configuration Guide.

The GRF connects to an SP switch port using a standard switch cable.

Because we evaluated the GRF attached to a production system we were concerned about maintaining external connectivity during GRF reconfigurations or faults. To achieve this we elected to maintain all existing external network connections (primarily our ATM OC-3 gateway) and to configure the Unix gated daemon to enable most of our nodes to use the GRF by default, but to automatically fail over to the ATM OC-3 gateway if the GRF wasn't available. This type of route configuration was also necessary between the GRF and our router to ensure that inbound traffic used the GRF whenever possible instead of the ATM OC-3. This multiple gateway approach did introduce more management complexity into our environment.

Performance and reliability results

We conducted two types of performance test. Raw performance using the TTCP tool, and application level performance. To maximize TTCP use of the network we opted for a 64000 byte buffer, a large socket buffer size and at least 4K buffers transmitted per test. This is equivalent to a 2.6 GB network transfer between the TTCP peers. In most test multiple concurrent transfers were each also 2.6 GB. We chose 64000 because it approximates the MTU of both the SP Switch network and HIPPI.

To determine how effective the GRF switch interface handles switch traffic we first measured the performance of the switch interface on an SP wide node. Although this value most likely does not represent the actual switch port capability, it was the closest we could come, and also provided us with a comparison value. SP wide node to SP wide node tests measured up to 610 Mb/second (single direction). Multiple SP wide nodes to a single SP wide node achieved 620 Mb/second. Bydirectional tests results were 640 Mb/second. Published bydirectional limits are 960 Mb/second. These were not achievable because our SP wide node was CPU bound at 640 Mb/second.

In an attempt to measure aggregate switch capacity we conducted a small 8 way test between nodes on different frames. This experiment measured aggregate throughput near 3.5 Gbit/second with near linear scaling in the 1 to 8 way range. This is consistent with the switch design.

Graph 1 - switch performance.

To test the GRF's HIPPI interface we were able to attach it to an 8 CPU Onyx. The following graph shows SP node to HIPPI node performance results.

Graph 2 - GRF performance forwarding between SP switch adapter and HIPPI.

The first issue we encountered with the HIPPI test dealt with SP to HIPPI traffic. One SP node to HIPPI performance was 380 Mb/second, two SP nodes achieved 660 Mb/second aggregate, and three SP nodes crashed our switch. We suspect that three nodes was enough to saturate the HIPPI interface, which probably exposed a backoff or throttling problem. The second issue involved test in the opposite direction, HIPPI host to SP. A single process achieved 370 Mb/second. Two HIPPI host processes to two SP hosts showed 190 Mb/sec (380 aggregate). Three HIPPI host processes to three SP hosts at 127 Mb/sec (still 380 aggregate). It appears that we encountered either a host HIPPI outbound or GRF HIPPI inbound limit of 380 Mb/second.

To test the GRF's performance forwarding from the SP switch adapter to OC-12 we attached both the GRF and a 16 CPU Origin2000 host via OC-12 to our ATM OC-12 backbone and configured an ATM PVC between the GRF and the Origin2000 host. It is important to note that this test involved a beta ATM driver for the Origin2000 and no tuning. The graph below show performance between 1, 2, and 3 SP hosts, through the GRF switch adapter, GRF OC-12 adapter, two switches at OC-12, and a single OC-12 adapter on an Origin2000 host.

Graph 3 - GRF performance forwarding between SP switch adapter and OC-12.

myhtml3.gif (5853 bytes)

The above numbers demonstrate the GRF's ability to effectively forward SP switch adapter traffic through it's OC-12 adapter. A reverse benchmark, from the OC-12 host through the switching and GRF fabric to multiple SP switch hosts yielded very poor performance which we attributed primarily to the beta ATM driver on the Origin2000 host.

In the above tests we benchmarked specific pairs of GRF adapters looking for point-to-point performance between an SP complex and a target HIPPI or ATM host. To measure the performance limits of the GRF SP switch adapter and benchmark more than two GRF interfaces at a time we conducted a test from 5 SP nodes to 2 HIPPI targets and 3 OC-12 targets. The observed results of 900 Mb/sec were the highest we've benchmarked through a single GRF SP switch adapter.

GRF backplane performance limits are irrelevant because it theoretically scales at 1 Gb/second per port. We suspect that a three card test (HIPPI, OC-12, SP Switch) will not strain the backplane. Additional fast interfaces would be necessary to conduct this type of test.

We conducted application level benchmarks using MPI RIO. These test involved one or more instances of an application running on a client performing remote I/O to a server. In the first set of benchmarks the clients ran on the SP and communicated to the RIO server which was directly attached to the GRF via ATM OC-3. Key measurements included a round-trip latency of 1.214 ms. and anywhere from 10.69 MB/s (64% of peak network rate) to 98%. We found the ability to achieve 98% of the theoretical ATM OC-3 peak of 133 Mb/second to be quite impressive.

A second set of MPI RIO tests used another OC-3 as the server, except this server was one ATM switch away from the GRF and involved a hop through our router which is OC-3 attached. These test achieved about 66% of theoretical peak performance for ATM OC-3. We believe these results suggest several things about downstream connections through the GRF. The most obvious is that for best results machines collaborating with the SP should be attached directly to the GRF. When the peer is not attached to the GRF, extra care should be taken to understand the route and relative performance of the links. In this case putting peers on the other end of an OC-3 limited router, which is serving many other hosts, imposes a significant bottleneck. If SVCs had been available we could have avoided going through our router. Given our large number of ATM hosts and use of LANE, we could not have effectively managed a manual PVC based configuration.

Based on the above results and IBM/Ascend provided theoretical limits we believe if multiple SP hosts concurrently communicate with multiple GRF attached hosts the first component to saturate would be the SP switch interface on the GRF. GRF literature indicates capacity for 1 Gbit/second per adapter card, so in theory an SP switch adapter should be able to operate at near peek bydirectional performance. We were unable to perform real world tests demonstrating adapter saturation. Our best results through the GRF were 660 Mb/second, a value slightly above our best switch performance for a single SP node. If aggregate throughput beyond a single SP switch interface is required it is technically feasible to have multiple GRF switch adapters on the same switch. In such a configuration performance could be optimized by connecting the GRF to multiple SP node frames.

Unresolved Issues

During our 3 month test we experienced several switch failures. Some were tracked down to hardware and some had an unknown cause. When hardware was not the problem we were unable to determine if the GRF was a factor. Additional tools and documentation to assist in switch problem resolution would be very valuable. For example, operations such as Estart and Eunfence log information in /var/adm/SPlogs/css/fs_daemon_print.file, but apparently not much is logged about extension node response to these commands. More extension node related logging by the Switch Primary fault service Worm would also have assisted with Eunfence problems we had with the GRF. On several occasion we were unable to Eunfence the GRF using multiple combinations of Eunfence, Estart, configure extension node, and GRF SP Switch interface reset. In some of these situations, simply waiting 15 to 30 minutes and retrying the Eunfence would work.

On a couple of occasions we found the GRF's date and time to be off by as much as a month. We were unable to find a cause although it appears that it happens on reboot.

A software bug would cause broadcasts out the HIPPI interface to fill the ARP table and bring down the interface. The temporary solution was to disable outbound broadcasts on HIPPI the interface.

In a large ATM infrastructure managing PVCs is an unwieldy process. In our environment with 3+ switches, a router, and many ATM hosts, creating PVCs between the SP and various ATM hosts would require us to define many PVCs. Since this would be a network management headache we feel that SVCs are essential for large ATM environments wanting to integrate the GRF.

Additional information would have been useful on how to enable ARP on the switch,

Acknowledgment

This work was supported by the Mathematical, Information, and Computational Sciences Division subproram of the Office of Computational and Technology Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.

Conclusions

For environments requiring high-performance network connectivity between many SP nodes and external networks and hosts the GRF can offer a four fold aggregate performance improvement over a single ATM OC-3 connection. Another attractive use of the GRF is for connecting multiple SP switches. For additional performance improvement it's theoretically possible to connect a GRF multiple times to the same SP switch.

In an environment such as ours, where high-speed connectivity from an SP complex to external hosts is required, nothing can compete with the GRF for performance. Using the GRF to connect multiple SP switches would also provide unrivaled networking performance.

October 1998