Elsevier

Parallel Computing

Volume 81, January 2019, Pages 32-57
Parallel Computing

Planning for performance: Enhancing achievable performance for MPI through persistent collective operations

https://doi.org/10.1016/j.parco.2018.08.001Get rights and content

Highlights

  • Definition and motivation for persistent collective communication in MPI.

  • Discussion of the prototype implementation called LibPNBC

  • Design choices for why the API was so-designed and proposed for the standard.

  • A clear state transition diagram that explains the behavior of persistence operations and requests.

  • Performance Results with a realistic benchmark and a complete application.

  • Discussion of related work (including extensions for Endpoints).

  • Discussion of potential future work (including extensions to MPI I/O).

  • Appendix A: API illustrations.

  • Appendix B: counter-examples for achieving persistence without a new API.

Abstract

Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, significant additional optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example application (based on 3D conjugate gradient) illustrating the potential performance enhancements for such operations. Persistent operations enable MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., to support overlap of communication with computation and/or other communication). Further enhancement of the current reference implementation, as well as additional opportunities to enhance performance through the application of these new APIs, comprise future work.

Introduction

Advantages of nonblocking collective communication [5], [25], [26] in MPI [31] have been established over the past quarter century and even predate MPI-1 [30]. For data-parallel and regular computations with fixed communication patterns, more optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 interface except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1.

MPI presently defines the concepts of collective communication and persistent operations. However, the current MPI standard only supports persistent point-to-point operations. Support for persistent collective operations was not considered for MPI-1 since nonblocking collective operation prerequisites were not originally supported, nor were they accepted when first proposed in MPI-2. Nonblocking collective operations are now part of the MPI standard (since MPI-3.0) and, as a result, persistent collective operations are now under consideration for standardization. Several of the authors of this paper have an active proposal in the MPI Forum for precisely this functionality.

Semantically, a blocking point-to-point MPI operation is equivalent to a nonblocking point-to-point MPI operation combined with a “wait” that blocks until it is completed (i.e., a call to the function MPI_WAIT).

Syntactically, nonblocking operations in MPI are designated by prefixing the blocking function with “I” for immediate or incomplete—indicating that the call returns immediately, even if the operation has not yet completed. In addition, nonblocking operations in MPI (with some exceptions, for example, the probe routines and some of the nonblocking RMA routines) return a parameter (of type MPI_Request) to enable the subsequent call to MPI_WAIT (or any other completion function).

Semantically, a nonblocking point-to-point MPI operation is equivalent to a persistent point-to-point MPI operation that is started exactly once and is automatically freed upon completion. In other words, a nonblocking point-to-point function call is equivalent to the sequence of a persistent point-to-point initialization function call plus a call to the function MPI_START. Furthermore, a completion function call for a nonblocking point-to-point operation is equivalent to the sequence of a completion function call for a persistent point-to-point operation plus a call to the function MPI_REQUEST_FREE.

Syntactically, persistent operations in MPI are designated by suffixing the blocking function with “_INIT” for initialization, thereby indicating that the call will only initialize the operation but not actually start an instance of it.

Syntax and semantics consistent with the point-to-point definitions have been accepted into the MPI Standard for nonblocking collective operations on a one-to-one basis for blocking collective operations as well as for the neighborhood collective operations that were introduced in MPI-3.0. Here, we offer additional syntax and semantics for persistent collective operations based on these extant nonblocking collective operations. As with persistent point-to-point operations, persistent collective operations enable the MPI programmer to specify collective communication operations with the same argument list1 that is repeatedly executed within a parallel computation. Persistent collective operations allow an MPI implementation to select a more efficient way to perform the collective operation based on the parameters specified at initialization than might otherwise be possible when confronted with a collective operation. This “planned-transfer” approach can offer significant performance benefits for programs with repetitive communication patterns in which the only change between invocations is the data buffer values. The trade-off is that the operation must be precisely the same; for instance, counts and offsets cannot change from call to call.

First, we provide the motivations and use cases for this work in Section 2. The overall design of the persistent collective application programmer interfaces (APIs) and the rationale for choosing the specific semantics is provided in Section 3. The prototype implementation of LibPNBC (based on LibNBC [20]) is described in Section 4 and MPI-4 standardization status of persistent nonblocking collective operations is covered in Section 5. We provide early performance results, using a modified version of NBCBench and an example illustrating the potential performance enhancements for such operations in Section 6. This section also describes the potential benefits of persistent collectives for Deep Learning Training and demonstrates early performance benefits of persistent collectives using Baidu allreduce benchmark [2], [3]. Further enhancement of the current reference implementation, and additional opportunities to enhance performance through the application of these new APIs, is described in Section 7.

Section snippets

Motivations and use cases

The primary objective of persistence is to exploit temporal locality present in highly iterated operations; that is, those operations in which the same arguments are passed consistently among all processes participating in the operation. Several optimization opportunities accrue from the ability to persist and reuse communication operations in MPI, notably including the following:

  • 1.

    Rather than recreating an MPI request handle each time, a single request object is created and reused for each

Design

In MPI, a persistent operation consists of a one-off planning step, followed by zero or more pairs of start and completion operations (that transfer data), and finally a one-off step to recover (or free) the resources used. Persistent operations for point-to-point communication have been specified as part of MPI since MPI-1 [30]. However, this semantic has not yet officially been extended to other MPI operations, such as collective communication, single-sided communication, or file I/O. In this

Prototype implementation

Implementation of persistent nonblocking collectives, as carried out in this research [11], borrows heavily from the Modular Component Architecture (MCA) in Open MPI [13], in particular the LibNBC component [20]. The MCA provides extensibility to much of MPI’s specified functionality through the use of compiler toolchains, file nomenclature, and protocol that guides build automation. Our strategy leverages this automation and the existing capabilities of LibNBC by replicating its code into a

State of standardization

Together with the prototype described here (a required step for acceptance to the standard), a full proposal for persistent collective has been presented over the past several meetings of the MPI-3/MPI-4 Forum. A formal reading was completed in February, 2018 MPI Forum meeting and overall finalization is expected in 2018. All conceptual issues, syntax, semantics, and concerns from previous discussions and feedback have been addressed.

Results and discussions

A suitable starting point for performance evaluation and benchmarking of LibPNBC is NBCBench [23], the corresponding microbenchmark used initially to evaluate LibNBC. The use of NBCBench to measure the impact of persistence on collective operations leverages proven benchmarking methods [23], and allows for direct comparison and validation of results with prior benchmark tests. This fits well with the main objectives of this research, which are to measure overhead, determine if there is any

Future work

Here we consider two aspecs of future work: the prototype implementation of LibPNBC, and the extension of the MPI-4 standard to add persistence more broadly to its APIs.

Conclusion

In this paper, we presented the design of the first prototype implementation of persistent collective communication operations targeted for MPI-4. This model implementation, LibPNBC, is based on LibNBC. We also based our initial implementation on Open MPI; porting to other MPI middleware systems remains as future work.

We justified why the persistent API syntax and semantics lead to useful expressions of optimizable collective operations, and why these properties cannot be inferred locally for a

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants Nos. CCF-1562659, CCF-1562306, CCF-1617690, CCF-1822191, CCF-1821431, OAC-1541310, and CNS-1229282. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work was part-funded bythe European Union’s Horizon 2020 Framework Programme Research and Innovation programme

References (38)

  • V. Bala et al.

    The IBM external user interface for scalable parallel systems

    Parallel Comput.

    (1994)
  • T. Hoefler et al.

    Optimization principles for collective neighborhood communications

    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

    (2012)
  • A.A. Awan et al.

    S-caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters

    Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

    (2017)
  • Baidu, Baidu DeepBench, 2017a, (https://github.com/baidu-research/DeepBench) Accessed:...
  • Baidu, Baidu DeepBench on KNL-OPA systems, 2017b,...
  • Baidu, Baidu TensorFlow, 2017c, (https://github.com/baidu-research/tensorflow-allreduce) Accessed:...
  • J. Bruck et al.

    Efficient algorithms for all-to-all communications in multiport message-passing systems

    IEEE Trans. Parallel Distributed Syst.

    (1997)
  • A. Carpen-Amarie et al.

    On the expected and observed communication performance with MPI derived datatypes

    Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, Edinburgh, United Kingdom, September 25–28, 2016

    (2016)
  • I.A. Comprés, On-line application-specific tuning with the Periscope tuning framework and the MPI tools interface,...
  • D. Das et al.

    Distributed deep learning using synchronous stochastic gradient descent

    CoRR

    (2016)
  • dholmes-epcc-ed-ac uk, INFO query, 2016,. INFO query issue #63 in MPI Forum (created 7 October 2016, accessed 30...
  • dholmes-epcc-ed-ac uk, LibPNBC - persistent collectives for Open MPI, 2017,. Open MPI Pull Request #4515 (created 18...
  • R.P. Dimitrov

    Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations

    (2001)
  • E. Gabriel et al.

    Open MPI: Goals, concept, and design of a next generation MPI implementation

    Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary

    (2004)
  • E. Gallardo et al.

    MPI advisor: A minimal overhead tool for MPI library performance tuning

    Proceedings of the 22Nd European MPI Users’ Group Meeting

    (2015)
  • R. Ganian et al.

    Polynomial-time construction of optimal MPI derived datatype trees

    2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23–27, 2016

    (2016)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • M. Hatanaka, M. Takagi, A. Hori, Y. Ishikawa, Offloaded MPI persistent collectives using persistent generalized request...
  • T. Hoefler et al.

    sPIN: high-performance streaming processing in the network

  • Cited by (0)

    View full text