Voltaire Fabric Collective Accelerator™ (FCA™)

Features & Benefits

  • Significantly reduce MPI collective operations runtime by up to 10 times
  • Improve collective function scalability above and beyond any proprietary interconnect
  • Eliminate congestion caused by collective traffic
  • No need for any additional hardware to install or manage
  • No space/power/cooling penalty
  • Seamless integration with MPI job scheduler
  • Zero provisioning penalty (parallel to job scheduler initialization)
  • Supports Open MPI 1.4.1 & up and Platform MPI 8.0 & up

Enabling Extreme Application Scalability

Voltaire Fabric Collective Accelerator™ (FCA™) software enables Voltaire switches to offload significant parts of group communication onto the switching fabric and Voltaire Unified Fabric Manager™ (UFM™) software to orchestrate an efficient, topology-based collective flow. Working in concert, these products ensure that all bottlenecks are removed, at both the node and interconnect levels, and only a single message is generated over each physical wire. The computational acceleration is achieved transparently without requiring changes to the application.

Previous solutions for accelerating collectives over standards-based clusters were based on host-based offload, addressing only a small part of the challenge residing at the node level. The Voltaire fabric-wide solution leverages Voltaire’s unique switch design which incorporates a CPU on every module, eliminating congestion and reducing latency throughout the fabric.

The MPI Scalability Challenge

Collective functions in the MPI API involve communication between all processes in a process group (which can mean the entire process pool or a program-defined subset). These types of calls are often useful at the beginning or end of a large distributed calculation, where each processor operates on a part of the data and then combines it into a result. Examples of popular collective functions are MPI_Barrier, MPI_Broadcast, MPI_Reduce and MPI_Allreduce.

The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations directly implies that they will become the bottleneck when scaling to thousands of ranks (where a rank is an MPI process, typically running on a single core).

A common approach to this challenge improves the implementation of MPI collective operations by using intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s) within the NIC. However, these hardware-based implementations are typically limited in coverage and flexibility, and cannot support the growing variety of functions and topologies.

Slashing MPI Job Runtime with FCA

Voltaire designed Fabric Collective Accelerator (FCA) with the aim of scaling out fabrics and improving performance from an application perspective, not just a server/network perspective.

Using the FCA algorithm with Voltaire switches and UFM ensures a single message per physical wire for any collective function, as opposed to potentially hundreds or thousands of messages per wire using traditional algorithms for collective function handling. This non-blocking collective architecture finally allows InfiniBand to scale collective communication to thousands of nodes better than any interconnect (standard or proprietary) in the market.

As a result, this well integrated, application-oriented, fabric-wide solution can reduce the run time of collective operations by more than 10X, resulting in up to a 2X reduction in total application runtime.

FCA slashes runtime of collective communications

FCA slashes runtime of collective communications