Quantcast
Channel: Intel Developer Zone Blogs
Viewing all articles
Browse latest Browse all 181

Improving MPI Communication between the Intel® Xeon® Host and Intel® Xeon Phi™

$
0
0

MPI Symmetric Mode is widely used in systems equipped with Intel® Xeon Phi™ coprocessors. In a system where one or more coprocessors are installed on an Intel® Xeon® host, Transmission Control Protocol (TCP) is used for MPI messages sent between the host and coprocessors or between coprocessors on that same host.  For some critical applications this MPI communication may not be fast enough.

In this blog, I show how we can improve the MPI Intra-Node communication (between the Intel® Xeon® host and Intel® Xeon Phi™ Coprocessor) by installing the OFED stack in order to use the Direct Access Programming Library (DAPL) as a fabric instead. Even when the host does not have an InfiniBand* Host Channel Adapter (HCA) installed, the DAPL fabric can still be used to transfer MPI messages via scif0, to a virtual InfiniBand* interface.

On an Intel® Xeon® E5-2670 system running the Linux* kernel version 2.6.32-279 and equipped with two Intel® Xeon Phi™ C0 stepping 7120 coprocessors (named mic0 and mic1), I installed MPSS 3.3.2 and Intel® MPI Library 5.0 on the host. Included in the Intel® MPI Library the benchmark tool IMB-MPI1. For illustration purposes, I ran the Intel MPI Benchmark Sendrecv before and after installing the OFED stack obtained results for comparison. In this test used with two processes, each process sends a message and receives a message from the other process. The tool reports the bidirectional bandwidth.

To run the test, I copied the coprocessor version of the Intel® MPI Benchmark tool (IMB-MPI1) to the coprocessors:

# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic0:/tmp

# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic1:/tmp

 

I enabled host-coprocessor and coprocessor-coprocessor communication:

 

# export I_MPI_MIC=enable

# /sbin/sysctl -w net.ipv4.ip_forward=1

 

For the first test, I ran the benchmark Sendrecv between host and mic0:

 

# mpirun -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 \

Sendrecv : -host mic0 -n 1 /tmp/IMB-MPI1

 

benchmarks to run Sendrecv

#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.0, MPI-1 part   
#------------------------------------------------------------
# Date                  : Mon Nov 24 12:26:53 2014
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-279.el6.x86_64
# Version               : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
 
# Calling sequence was:

# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM 
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
                  0         1000        129.93       129.93       129.93           0.00
                  1         1000        130.12       130.12       130.12           0.01
                  2         1000        130.48       130.48       130.48           0.03
                  4         1000        130.63       130.63       130.63           0.06
                  8         1000        130.25       130.25       130.25           0.12
                16         1000        130.40       130.40       130.40           0.23
                32         1000        126.92       126.92       126.92           0.48
                64         1000        121.18       121.18       121.18           1.01
              128         1000        119.91       119.92       119.91           2.04
              256         1000        118.83       118.83       118.83           4.11
              512         1000        139.81       139.83       139.82           6.98
            1024         1000        146.87       146.88       146.87         13.30
            2048         1000        153.28       153.28       153.28         25.48
            4096         1000        146.91       146.91       146.91         53.18
            8192         1000        159.63       159.64       159.63         97.88
          16384         1000        212.52       212.55       212.53       147.03
          32768         1000        342.03       342.08       342.05       182.70
          65536           640        484.54       484.78       484.66       257.85
        131072           320        808.74       809.64       809.19       308.78
        262144           160      1685.54     1688.78     1687.16       296.07
        524288             80      2862.96     2875.35     2869.16       347.78
      1048576             40      4978.17     5026.92     5002.55       397.86
      2097152             20      8871.96     9039.75     8955.85       442.49
      4194304             10    16531.30   17194.01   16862.65       465.28


# All processes entering MPI_Finalize

The above table shows the average time and bandwidth for different message lengths.

Next, I ran the benchmark to collect data between mic0 and mic1:

# mpirun -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv :  -host mic1 -n 1 /tmp/IMB-MPI1

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
                  0         1000         210.77        210.77          210.77           0.00
                  1         1000         212.45        212.69          212.57           0.01
                  2         1000         218.84        218.84          218.84           0.02
                  4         1000         209.84        209.84          209.84           0.04
                  8         1000         212.45        212.47          212.46           0.07
                16         1000         208.90        209.15          209.03           0.15
                32         1000         227.80        228.07          227.94           0.27
                64         1000         223.61        223.62          223.62           0.55
              128         1000         210.82        210.83          210.83           1.16
              256         1000         211.61        211.61          211.61           2.31
              512         1000         214.33        214.34          214.34           4.56
            1024         1000         225.15        225.16          225.15           8.67
            2048         1000         317.98        318.28          318.13         12.27
            4096         1000         307.00        307.32          307.16         25.42
            8192         1000         320.62        320.82          320.72         48.70
          16384         1000         461.89        462.26          462.08         67.60
          32768         1000         571.72        571.76          571.74       109.31
          65536           640       1422.02       1424.80       1423.41         87.73
        131072           320       1758.98       1759.17       1759.08       142.11
        262144           160       4234.41       4234.99       4234.70       118.06
        524288             80       5433.75       5453.23       5443.49       183.38
      1048576             40       7511.45       7560.68       7536.06       264.53
      2097152             20     12764.95     12818.46     12791.71       312.05
      4194304             10     22333.29     22484.09     22408.69       355.81


# All processes entering MPI_Finalize

In the second phase, I downloaded the OFED stack OFED-3.5.2-MIC.gz from https://www.openfabrics.org/downloads/ofed-mic/ofed-3.5-2-mic/ and install it (refer to Section 2.4 in readme file) in order to use DAPL fabric.

I started OFED services and to set the fabric DAPL fabric, used ofa-v2-scif0 as the provider (the file /etc/dat.conf shows all DAPL providers):

# service openibd start
# service ofed-mic start
# service mpxyd start

I enabled DAPL and specified DAPL provider:

# export I_MPI_FABRICS=dapl
# export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

With DAPL configured, I repeated the test between the host and mic0. Note that when the environment variable I_MPI_DEBUG is set, the output of an MPI program shows the underlying protocol used for communication:

# mpirun -genv I_MPI_DEBUG 2 -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv: \ -host mic0 -n 1 /tmp/IMB-MPI1
 

[0] MPI startup(): Single-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
 benchmarks to run Sendrecv
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.0, MPI-1 part
#------------------------------------------------------------
# Date                  : Mon Nov 24 15:05:55 2014
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-279.el6.x86_64
# Version               : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time


# Calling sequence was:

# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
                  0         1000         19.11         19.11         19.11             0.00
                  1         1000         20.08         20.08         20.08             0.09
                  2         1000         20.09         20.09         20.09             0.19
                  4         1000         20.19         20.19         20.19             0.38
                  8         1000         19.89         19.89         19.89             0.77
                16         1000         19.99         19.99         19.99             1.53
                32         1000         21.37         21.37         21.37             2.86
                64         1000         21.39         21.39         21.39             5.71
              128         1000         22.40         22.40         22.40           10.90
              256         1000         22.73         22.73         22.73           21.48
              512         1000         23.34         23.34         23.34           41.84
            1024         1000         25.33         25.33         25.33           77.11
            2048         1000         27.48         27.49         27.49         142.11
            4096         1000         33.70         33.72         33.71         231.72
            8192         1000       127.15       127.16       127.16         122.88
          16384         1000       133.82       133.84       133.83         233.49
          32768         1000       156.29       156.31       156.30         399.85
          65536           640       224.67       224.70       224.69         556.30
        131072           320       359.13       359.20       359.16         696.00
        262144           160       174.61       174.66       174.63       2862.76
        524288             80       229.66       229.76       229.71       4352.29
      1048576             40       303.32       303.55       303.44       6588.60
      2097152             20       483.94       484.30       484.12       8259.35
      4194304             10       752.81       753.69       753.25     10614.46


# All processes entering MPI_Finalize

Similarly, I ran the benchmark to collect data between mic0 and mic1:

# mpirun -genv I_MPI_DEBUG 2 -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv : -host mic1 -n 1 /tmp/IMB-MPI1

[0] MPI startup(): Single-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode

 

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
                  0         1000          30.13          30.13          30.13           0.00
                  1         1000          20.28          20.28          20.28           0.09
                  2         1000          20.43          20.43          20.43           0.19
                  4         1000          20.38          20.39          20.39           0.37
                  8         1000          20.70          20.70          20.70           0.74
                16         1000          20.84          20.85          20.84           1.46
                32         1000          21.79          21.80          21.79           2.80
                64         1000          21.61          21.62          21.62           5.65
              128         1000          22.63          22.63          22.63         10.79
              256         1000          23.20          23.21          23.20         21.04
              512         1000          24.74          24.74          24.74         39.47
            1024         1000          26.14          26.15          26.15         74.69
            2048         1000          28.94          28.95          28.94       134.95
            4096         1000          44.01          44.02          44.01       177.49
            8192         1000        149.33        149.34        149.34       104.63
          16384         1000        192.89        192.91        192.90       162.00
          32768         1000        225.52        225.52        225.52       277.13
          65536           640        319.88        319.89        319.89       390.76
        131072           320        568.12        568.20        568.16       439.99
        262144           160        390.62        390.68        390.65      1279.81
        524288             80        653.20        653.26        653.23      1530.78
      1048576             40      1215.85      1216.10      1215.97      1644.61
      2097152             20      2263.20      2263.70      2263.45      1767.02
      4194304             10      4351.90      4352.00      4351.95      1838.24


# All processes entering MPI_Finalize

The MPI bandwidth is improved significantly when running DAPL fabric. For example, in the case of a message length of 4 MB, host-coprocessor bandwidth went from 465.28 MB/sec to 10,614.46 MB/sec and coprocessor-coprocessor bandwidth went from 355.81 MB/sec to 1,838.24 MB/sec. To obtain this improvement it was only necessary to enable the DAPL fabric by installing the OFED stack and configuring DAPL, without regard to whether an InfiniBand* HCA was installed on the host or not.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.


Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, Cilk, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
 

This sample source code is released under the BSD 3 Clause License. 

Copyright (c) 2012-2013, Intel Corporation

All rights reserved.

 

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

    * Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND

ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION BE LIABLE FOR ANY

DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.  These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations.  Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.  Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.  Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 

Notice revision #20110804

 

 

 

 

 


Viewing all articles
Browse latest Browse all 181

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>