MPI Symmetric Mode is widely used in systems equipped with Intel® Xeon Phi™ coprocessors. In a system where one or more coprocessors are installed on an Intel® Xeon® host, Transmission Control Protocol (TCP) is used for MPI messages sent between the host and coprocessors or between coprocessors on that same host. For some critical applications this MPI communication may not be fast enough.
In this blog, I show how we can improve the MPI Intra-Node communication (between the Intel® Xeon® host and Intel® Xeon Phi™ Coprocessor) by installing the OFED stack in order to use the Direct Access Programming Library (DAPL) as a fabric instead. Even when the host does not have an InfiniBand* Host Channel Adapter (HCA) installed, the DAPL fabric can still be used to transfer MPI messages via scif0, to a virtual InfiniBand* interface.
On an Intel® Xeon® E5-2670 system running the Linux* kernel version 2.6.32-279 and equipped with two Intel® Xeon Phi™ C0 stepping 7120 coprocessors (named mic0 and mic1), I installed MPSS 3.3.2 and Intel® MPI Library 5.0 on the host. Included in the Intel® MPI Library the benchmark tool IMB-MPI1. For illustration purposes, I ran the Intel MPI Benchmark Sendrecv before and after installing the OFED stack obtained results for comparison. In this test used with two processes, each process sends a message and receives a message from the other process. The tool reports the bidirectional bandwidth.
To run the test, I copied the coprocessor version of the Intel® MPI Benchmark tool (IMB-MPI1) to the coprocessors:
# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic0:/tmp
# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic1:/tmp
I enabled host-coprocessor and coprocessor-coprocessor communication:
# export I_MPI_MIC=enable
# /sbin/sysctl -w net.ipv4.ip_forward=1
For the first test, I ran the benchmark Sendrecv between host and mic0:
# mpirun -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 \
Sendrecv : -host mic0 -n 1 /tmp/IMB-MPI1
benchmarks to run Sendrecv
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.0, MPI-1 part
#------------------------------------------------------------
# Date : Mon Nov 24 12:26:53 2014
# Machine : x86_64
# System : Linux
# Release : 2.6.32-279.el6.x86_64
# Version : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version : 3.0
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Sendrecv
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 129.93 129.93 129.93 0.00
1 1000 130.12 130.12 130.12 0.01
2 1000 130.48 130.48 130.48 0.03
4 1000 130.63 130.63 130.63 0.06
8 1000 130.25 130.25 130.25 0.12
16 1000 130.40 130.40 130.40 0.23
32 1000 126.92 126.92 126.92 0.48
64 1000 121.18 121.18 121.18 1.01
128 1000 119.91 119.92 119.91 2.04
256 1000 118.83 118.83 118.83 4.11
512 1000 139.81 139.83 139.82 6.98
1024 1000 146.87 146.88 146.87 13.30
2048 1000 153.28 153.28 153.28 25.48
4096 1000 146.91 146.91 146.91 53.18
8192 1000 159.63 159.64 159.63 97.88
16384 1000 212.52 212.55 212.53 147.03
32768 1000 342.03 342.08 342.05 182.70
65536 640 484.54 484.78 484.66 257.85
131072 320 808.74 809.64 809.19 308.78
262144 160 1685.54 1688.78 1687.16 296.07
524288 80 2862.96 2875.35 2869.16 347.78
1048576 40 4978.17 5026.92 5002.55 397.86
2097152 20 8871.96 9039.75 8955.85 442.49
4194304 10 16531.30 17194.01 16862.65 465.28
# All processes entering MPI_Finalize
The above table shows the average time and bandwidth for different message lengths.
Next, I ran the benchmark to collect data between mic0 and mic1:
# mpirun -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv : -host mic1 -n 1 /tmp/IMB-MPI1
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 210.77 210.77 210.77 0.00
1 1000 212.45 212.69 212.57 0.01
2 1000 218.84 218.84 218.84 0.02
4 1000 209.84 209.84 209.84 0.04
8 1000 212.45 212.47 212.46 0.07
16 1000 208.90 209.15 209.03 0.15
32 1000 227.80 228.07 227.94 0.27
64 1000 223.61 223.62 223.62 0.55
128 1000 210.82 210.83 210.83 1.16
256 1000 211.61 211.61 211.61 2.31
512 1000 214.33 214.34 214.34 4.56
1024 1000 225.15 225.16 225.15 8.67
2048 1000 317.98 318.28 318.13 12.27
4096 1000 307.00 307.32 307.16 25.42
8192 1000 320.62 320.82 320.72 48.70
16384 1000 461.89 462.26 462.08 67.60
32768 1000 571.72 571.76 571.74 109.31
65536 640 1422.02 1424.80 1423.41 87.73
131072 320 1758.98 1759.17 1759.08 142.11
262144 160 4234.41 4234.99 4234.70 118.06
524288 80 5433.75 5453.23 5443.49 183.38
1048576 40 7511.45 7560.68 7536.06 264.53
2097152 20 12764.95 12818.46 12791.71 312.05
4194304 10 22333.29 22484.09 22408.69 355.81
# All processes entering MPI_Finalize
In the second phase, I downloaded the OFED stack OFED-3.5.2-MIC.gz from https://www.openfabrics.org/downloads/ofed-mic/ofed-3.5-2-mic/ and install it (refer to Section 2.4 in readme file) in order to use DAPL fabric.
I started OFED services and to set the fabric DAPL fabric, used ofa-v2-scif0 as the provider (the file /etc/dat.conf shows all DAPL providers):
# service openibd start
# service ofed-mic start
# service mpxyd start
I enabled DAPL and specified DAPL provider:
# export I_MPI_FABRICS=dapl
# export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
With DAPL configured, I repeated the test between the host and mic0. Note that when the environment variable I_MPI_DEBUG is set, the output of an MPI program shows the underlying protocol used for communication:
# mpirun -genv I_MPI_DEBUG 2 -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv: \ -host mic0 -n 1 /tmp/IMB-MPI1
[0] MPI startup(): Single-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
benchmarks to run Sendrecv
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.0, MPI-1 part
#------------------------------------------------------------
# Date : Mon Nov 24 15:05:55 2014
# Machine : x86_64
# System : Linux
# Release : 2.6.32-279.el6.x86_64
# Version : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version : 3.0
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Sendrecv
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 19.11 19.11 19.11 0.00
1 1000 20.08 20.08 20.08 0.09
2 1000 20.09 20.09 20.09 0.19
4 1000 20.19 20.19 20.19 0.38
8 1000 19.89 19.89 19.89 0.77
16 1000 19.99 19.99 19.99 1.53
32 1000 21.37 21.37 21.37 2.86
64 1000 21.39 21.39 21.39 5.71
128 1000 22.40 22.40 22.40 10.90
256 1000 22.73 22.73 22.73 21.48
512 1000 23.34 23.34 23.34 41.84
1024 1000 25.33 25.33 25.33 77.11
2048 1000 27.48 27.49 27.49 142.11
4096 1000 33.70 33.72 33.71 231.72
8192 1000 127.15 127.16 127.16 122.88
16384 1000 133.82 133.84 133.83 233.49
32768 1000 156.29 156.31 156.30 399.85
65536 640 224.67 224.70 224.69 556.30
131072 320 359.13 359.20 359.16 696.00
262144 160 174.61 174.66 174.63 2862.76
524288 80 229.66 229.76 229.71 4352.29
1048576 40 303.32 303.55 303.44 6588.60
2097152 20 483.94 484.30 484.12 8259.35
4194304 10 752.81 753.69 753.25 10614.46
# All processes entering MPI_Finalize
Similarly, I ran the benchmark to collect data between mic0 and mic1:
# mpirun -genv I_MPI_DEBUG 2 -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv : -host mic1 -n 1 /tmp/IMB-MPI1
[0] MPI startup(): Single-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 30.13 30.13 30.13 0.00
1 1000 20.28 20.28 20.28 0.09
2 1000 20.43 20.43 20.43 0.19
4 1000 20.38 20.39 20.39 0.37
8 1000 20.70 20.70 20.70 0.74
16 1000 20.84 20.85 20.84 1.46
32 1000 21.79 21.80 21.79 2.80
64 1000 21.61 21.62 21.62 5.65
128 1000 22.63 22.63 22.63 10.79
256 1000 23.20 23.21 23.20 21.04
512 1000 24.74 24.74 24.74 39.47
1024 1000 26.14 26.15 26.15 74.69
2048 1000 28.94 28.95 28.94 134.95
4096 1000 44.01 44.02 44.01 177.49
8192 1000 149.33 149.34 149.34 104.63
16384 1000 192.89 192.91 192.90 162.00
32768 1000 225.52 225.52 225.52 277.13
65536 640 319.88 319.89 319.89 390.76
131072 320 568.12 568.20 568.16 439.99
262144 160 390.62 390.68 390.65 1279.81
524288 80 653.20 653.26 653.23 1530.78
1048576 40 1215.85 1216.10 1215.97 1644.61
2097152 20 2263.20 2263.70 2263.45 1767.02
4194304 10 4351.90 4352.00 4351.95 1838.24
# All processes entering MPI_Finalize
The MPI bandwidth is improved significantly when running DAPL fabric. For example, in the case of a message length of 4 MB, host-coprocessor bandwidth went from 465.28 MB/sec to 10,614.46 MB/sec and coprocessor-coprocessor bandwidth went from 355.81 MB/sec to 1,838.24 MB/sec. To obtain this improvement it was only necessary to enable the DAPL fabric by installing the OFED stack and configuring DAPL, without regard to whether an InfiniBand* HCA was installed on the host or not.
NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Intel, the Intel logo, Cilk, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright© 2012 Intel Corporation. All rights reserved.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
This sample source code is released under the BSD 3 Clause License.
Copyright (c) 2012-2013, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Optimization Notice |
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. |