Meshcentral - Secure Intel AMT IDE-R Virus Scan

August 27, 2014, 3:19 pm

Latest and popular articles on Intel Technologies

≫ Next: Meshcentral - New Agent Site with File Management and WebRTC transfers

Meshcentral continues to lead the way in cloud based security usages. Thanks to work from Jacob Gauthier, Meshcentral can now securely boot a trusted Linux operating system using Intel® AMT IDE-R and perform a AV scan of all attached disks on a remote system over the cloud. That is right, we now extended the Intel AMT IDE redirect feature of Meshcentral so that you could use it to trigger a trusted remote AV scan. Why is this interesting?

In most cases, anti-virus software run on the same operating system that is the target of viruses and malware. A better way to go is to boot a separate trusted operating system that would then scan the drives. The operating system would have to be sent over a trusted channel and use a set of tools that are downloaded and integrity checked. Today, we are announcing that we did just that. We use Intel AMT IDE Redirect feature as a way to remotely boot a trusted operating system, we then download ClamAV an open source anti-virus software that then automatically runs on all attached drives. This new feature builds on top of the Meshcentral cloud IDE-R support we announced a few weeks ago. The trusted Linux operating system is built on-the-fly into a single use ISO image that is then sent over the cloud to the target machine. Intel AMT is required to make all this work.

Jacob Gauthier built an innovative “package stuffing” system. Once the basic recovery OS is running, we want to try to limit IDE-R data transfer to boost boot speed. The recovery OS will check local disk storage or HTTP or IDE-R to get required application packages. The recovery OS checks the package hashes and pushes packages into local storage for future use. As a result, you always get the fastest possible boot speed over the cloud with the remote computer locally caching much of the data.

Check out our video demonstration and talk on this new feature:

Youtube: Overview of Meshcentral support for IDE-R (6 minutes)

With this release, Meshcentral continues blaze the path forward for innovative security usages. With just a few clicks, administrators can remotely run fully secure AV scans on machines. Intel AMT IDE-R session works over CIRA or agent relay making it easier than ever to perform an out-of-band AV scan over the cloud.

Questions and feedback appreciated,
Ylian Saint-Hilaire
info.meshcentral.com

In this YouTube video, Jacob Gauthier and myself demonstrate and talk about
Meshcentral Intel® AMT IDE-R feature and the new package stuffing system for accelerated boot.

Performing a trusted AV scan on a remote machine over the cloud has never been easier. With just a few
clicks you can remotely boot and launch a the scan using a fully verified trusted recovery OS.

Meshcentral uses an innovative “package stuffing” system to keep the IDE-R session fast.
Usage packages like anti-virus and others are pulled from local disk, HTTP or IDER-R and hash checked.
If downloaded & validated, they are pushed back into local storage for future use.

↧

Meshcentral - New Agent Site with File Management and WebRTC transfers

September 4, 2014, 11:07 am

Latest and popular articles on Intel Technologies

≫ Next: Four new virtualization technologies on the latest Intel® Xeon - are you ready to innovate?

≪ Previous: Meshcentral - Secure Intel AMT IDE-R Virus Scan

In the past months we have made huge improvements in making Meshcentral work for computer management and especially Internet-of-Things usages. I am very happy to relase the Mesh Agent v1.88 with improved local management on HTTPS port 16990. While many people know about Meshcentral.com, they don’t often know that the mesh agent itself hosts a small web site on port 16990 that can be used to login and remotely manage the device without cloud management. This local web site already offered remote desktop, remote terminal and GPIO control. Today we are announcing improvements to all of these and the addition to remote file management. Basically, it’s like having secure a web based file transfer tool build into all mesh devices.

The new file management tab allows you to navigate drives, create folders, rename and delete files. But the best is the new WebRTC upload / download feature. Users can now hit a button or drag & drop files into the page and have them uploaded to the remote device. You can also click a file or select many files and hit download to perform batch downloads. For internet-of-things devices like Intel Galileo boards, managing and transferring files into the device has never been easier. Once setup, point a browser to HTTPS:16990 and login. Works on any platform, you just need a WebRTC browser like Firefox, Chrome or Opera.

Some people will notice that we are doing all this while keeping the mesh agent very small and the local web site is blazing fast even when served from IoT devices. This is all because of Bryan Roe’s amazing Web Site Compiler. This most excellent tool takes a development web site and performs a series of steps that results in a single C header file with a set of highly optimized web sites. All you need to do is #define the web site you want to use at compile time. What’s great is the at site is actually a single compressed file that is served to the browser in compressed form. The mesh agent’s 289k web site is packed into a 30k block that is embedded inside the agent’s source code.

Check out our video demonstration and talk on this new feature:

Youtube: Meshcentral.com - Mesh Agent site and Web Site Compiler (6 minutes, 52 seconds)

Questions and feedback appreciated,
Ylian Saint-Hilaire
info.meshcentral.com

A demonstration of the new Mesh Agent site with File management and
a chat with Bryan Roe who authored the new web site compiler tool.
- Click here to watch video -

All new WebRTC based file management on the Mesh Agent local web site.
It’s like having a file management tool built into every mesh device.

The Web Site Compiler is the secret trick that makes the Mesh Agent’s local management so
amazing small and fast. All the steps for each build is all done in one click, takes less than 1 second.
The original site is developer with conditional markers for inclusion in select devices.

↧

Four new virtualization technologies on the latest Intel® Xeon - are you ready to innovate?

September 8, 2014, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: General Introduction to Networking Technologies for the Packet Processing Plane

≪ Previous: Meshcentral - New Agent Site with File Management and WebRTC transfers

Four new virtualization technologies on the latest Intel® Xeon - are you ready to innovate?

(By: Sunil Jain, Virtualization Marketing Manager, Intel Data Center Group)

In case you didn’t catch the news - the latest Intel® Xeon® E5-2600 v3 Product Family (formerly codename Haswell) has added four new technologies to the already strong Intel® Virtualization Technology (Intel VT) portfolio. Doesn’t matter whether you use containers and virtual machines (VMs), or you focus on servers, storage or networking, or you work in the cloud or enterprise or some hybrid environment – with the new Intel® VT technologies, you are in for a treat.

The additions are:

Cache Monitoring Technology (CMT)
Virtual Machine Control Structure (VMCS) Shadowing
Logging of the Accessed and Dirty bits in Extended Page Tables (EPT A/D bits), and
Data Direct IO (DDIO) enhancements

Intel® Xeon® E5-2600 v3 product family - coupled with Intel’s XL710 10/40GbE Ethernet controllers (code name Fortville), P3700 series enterprise SSDs (code name Fultondale), software optimizations (e.g. Intel^® DPDK and Intel^® CAS) and a broad hypervisor (HV) support in the industry – is moving virtualization into a whole new level of sophistication… the question is if you are ready to innovate and ride the wave!

Here is a brief overview of the new Intel® VT technologies:

Cache Monitoring Technology (CMT) - allows flexible real time monitoring of the last level cache (LLC) occupancy on per core, per thread, per application or per VM basis. Read the raw value from the IA32_QM_CTR register, multiply by a factor given in the CPUID field CPUID.0xF.1:EBX to convert to bytes, and voila! This monitoring can be quite useful in detecting the cache hungry “noisy neighbors,” characterizing the quiet threads, profiling the workloads in multi-tenancy environments, advancing cache-aware scheduling and/or all of the above. Based on the CMT readings, schedulers can take subsequent intelligent actions to move and balance the loads to meet any service level agreement (SLA) in a policy driven manner. Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) volume-3 chapter-17.14 provides the CMT programming details. CMT reference code is also available for evaluation under BSD license. For commercial use, please use the CMT cgroup and perf monitoring code being upstreamed for Linux, and both KVM and Xen.

VMCS Shadowing - accelerates nested virtualization - basically a hypervisor in a hypervisor. The root HV privileges are extended to the guest HV. Thanks to the acceleration that the shadow VMCS provides, a guest software can run with minimal performance impact and without needing any modification. But why would you do that? Because this technology enables you to consolidate heterogeneous application VMs, containers, and workloads within a single super host VM. You could reduce your cost of using the cloud by extracting more benefit from a single licensed host VM – “virtualization of the cloud” if you will. Your cloud service providers (CSP) could make you feel more empowered in controlling your HV and software choices without intervention from the CSP. Other practical use cases include creating web based labs, software development and test environments, trainings, make shift arrangements during migration, disaster recovery, rapid prototyping, and reduction of security attack surfaces, etc. VMCS Shadowing code is upstreamed in KVM-3.1 and Xen-4.3 onwards. More than 58% reduction in kernel build time, >50% reduction in cpu signaling, and >125% increase in IO throughput have been reported on Haswell with VMCS Shadowing applied to nested virtualization test cases. Please refer to Intel (SDM) volume-3 chapter-24 for VMCS Shadowing programming details.

Extended Page Table Accessed and Dirty bits (EPT A/D bits)– This technology improves performance during memory migration and creates interesting opportunities for virtualized fault tolerance usages. You probably already understand that guest OS expects contiguous physical memory, and the host VMM must preserve this illusion. EPT maps guest physical address to host address that allows guest OS to modify its own page tables freely, minimizes VM exits and saves memory. The new addition of (A)ccessed and (D)irty flag bits in EPT further optimizes the VM Exits during live migration, especially when high-freq resetting of permission bits is required. Up to date memory is pre-migrated leaving only the most recently modified pages to be migrated at the final migration stage. In turn, this minimizes the migration overhead and the migrated VM downtime. EPT(A) bits code has been upstreamed in KVM-3.6 and Xen-4.3; and EPT(D) bits code up-streaming is in the works. Programing details for EPT A/D bits can be found in Intel SDM volume-3, chapter-28.

Data Direct IO Enhancements - improve application bandwidth, throughput and CPU utilization. Now in addition to targeting the LLC for IO traffic, you can also control the LLC way assignment to specific cores. On Haswell, a direct memory access (DMA) transaction can end up in 8 ways of the LLC without hitting the memory first. Because both the memory and in-cache utilization due to networking IO is reduced, the IO transaction rate per socket improves, latency shrinks and power is saved. Cloud and data center customers can profusely benefit from the increased IO virtualization throughput performance. Storage targets and appliances can practically eliminate the need of full offload solutions. Data Plane application and appliance makers can improve and optimize transaction rates, especially for small packets and UDP transactions. DDIO use cases galore. For a detailed discussion about your specific application, please do contact your local Intel representative.

Happy virtualizing with the latest Intel® Xeon® E5-2600 v3 Product Family! At Intel, we’ll be eagerly waiting to hear about all those cool innovations and new businesses that you’ll be building around these newly introduced virtualization technologies. Comments are very welcome!

↧

General Introduction to Networking Technologies for the Packet Processing Plane

September 22, 2014, 1:08 pm

Latest and popular articles on Intel Technologies

≫ Next: Working with Mellanox* InfiniBand Adapter on System with Intel® Xeon Phi™ Coprocessors

≪ Previous: Four new virtualization technologies on the latest Intel® Xeon - are you ready to innovate?

Efficiently integrating network hardware and software components to provide a unified solution to support different workloads can be a daunting task within the data center environment. The network infrastructure typically will include a wide range of network elements (Switches, routers, firewalls, VPN etc.) on different, proprietary architectures with different configuration interfaces and they must operate as an integrated whole. Deployment, maintenance, integration, vendor support, and scaling to support new and varied requirements are common problems. It is easy to imagine the benefits of being able to design the majority of network functions on a single architecture providing a common code base and development tools.

To solve this problem Intel laid the groundwork with a 4:1 workload consolidation strategy by integrating a unified architecture solution that covers four key communications workloads; Application Processing, Control Processing, Packet Processing, and Signal Processing. Intel® processors, network adapters, chipsets, and switch silicon work together to tackle the Application and Control Processing Planes. The Intel® QuickAssist Technology, Intel® Data Plane Development Kit (Intel® DPDK), and Hyperscan tools work together to handle functions in the Packet Processing Plane. Lastly the Intel® Media SDK and Intel® System Studio tool suites assist with integration of the Signal Processing Plane. From a network and datacenter infrastructure perspective these components can help move from proprietary solutions to a general purpose “off the shelf” solution for savings in cost and maintenance and improvements in flexibility.

Figure 1 – The communications infrastructure consolidates four workloads simultaneously on an Intel® processor based platform.

Hardware Requirements

For best performance in an enterprise environment, you will need Intel® Xeon® E5-2600v2 or better processors with the 8920 to 8950 communications chipset. Together these are the hardware requirements for an optimally performing, unified solution. For workloads that may not be as processor-intensive, we recommend Intel® Atom™ processor C2000 product family, specifically product models with an 8 at the end of the product model number (i.e. C2758, C2738, etc.) which include the integrated communications hardware.

We will focus our discussion on the Packet Processing Plane. Once you have the required hardware these key components can be utilized to take advantage of the 4:1 strategy:

Intel® QuickAssist Technology accelerates cryptographic workloads by offloading the data to hardware capable of optimizing those functions. This makes it easier for developers to integrate built-in cryptographic accelerators into their designs. Intel® QuickAssist Technology is enabled for direct access or via open source frameworks. The integrated hardware acceleration includes support for the following Ciphers: AES, DES/3DES, Kasumi, RC4, Snow3G; Authentication: MD5, SHA1, SHA2, AES-XCBC; Public Key: Diffie-Hellman, RSA, DSA, ECC.

Intel® Data Plane Development Kit (Intel® DPDK) is a set of optimized data plane software libraries and drivers that can be used to accelerate packet processing on Intel® architecture. The performance of Intel® DPDK scales with improvements in processor technology from Intel® Atom^TM to Intel® Xeon® processors, and is offered under the open source BSD* license. Intel® DPDK can also be very useful when incorporated within virtualized environments.

A recent trend in software defined networks is increasing demand for fast host based packet handling and a move towards Network Functions Virtualization (NFV). NFV is a new way to provide network functions such as firewalls, domain name service and network address translation as a fully virtualized infrastructure. One example of this is Open vSwitch which is an open source solution capable of providing virtual switching. Intel® DPDK has been combined with Open vSwitch to provide an accelerated experience. For more information see Intel® DPDK vSwitch.

In the white paper NEC* Virtualized EPC Innovation Powered by Multi Core Intel Architecture Processors, NEC* was able to deploy a virtualized Evolved Packet Core (vEPC), which is a framework for converging data and voice on 4G Long-Term Evolution (LTE) networks, on a common Intel® architecture server platform and achieve carrier grade service. NEC adopted the Intel® DPDK for its vEPC in order to significantly improve the data plane forwarding performance in a virtualization environment.

Aspera* and Intel investigated ultra-high-speed data transfer solutions built on Aspera’s fasp* transport technology and the Intel® Xeon® processor E5-2600 v3 product family. The solution was able to achieve predictable ultra-high WAN transfer speeds on commodity Internet connections, on both bare metal and virtualized hardware platforms, including over networks with hundreds of milliseconds of round-trip time and several percentage points of packet loss characteristic of typical global-distance WANs. By using Intel® DPDK, software engineers were able to reduce the number of memory copies needed to send and receive a packet. This enabled Aspera to boost single stream data transfer speeds to 37.75 Gbps on the tested system**, which represents network utilization of 39 Gbps, when Ethernet framing and IP packet headers are accounted for. The team also began preliminary investigation of the transfer performance on virtualized platforms by testing on a kernel-based virtual machine (KVM) hypervisor and obtained initial transfer speeds of 16.1 Gbps. The KVM solution was not yet NUMA or memory optimized, and thus the team expects to obtain even faster speeds as it applies these optimizations in the future. For details about performance findings, system specifications, software specifications, etc. see the white paper Big Data Technologies for Ultra-High-Speed data Transfer and Processing.

Hyperscan is a software pattern matching library that can match large groups of regular expressions against blocks or streams of data. This library is ideal for applications that need to scan large amounts of data at high speed, such as Intrusion Prevention (IPS), Antivirus (AV), Unified Threat Management (UTM) and Deep Packet Inspection (DPI). Hyperscan runs entirely in software and is supported on a wide range of Intel® processors and operating systems.

Large scale pattern matching can be of value in several areas including deep packet inspection of data in real time. A practical use of this capability is deployment of firewalls with intrusion detection/prevention systems and network anti-virus and malware scanners. These articles go into a deeper discussion on this topic for both the Intel® Atom^TM processor in the white paper Delivering 36Gbps DPI (Pattern Matching) Throughput on the Intel® Atom^TMProcessor C2000 Product Family using HyperScan and for the Intel® Xeon® processor in the white paper Delivering 160Gbps DPI Performance on the Intel® Xeon® Processor E5-2600 Series using HyperScan. Intel® DPDK and Intel® QuickAssist can be combined with Hyperscan in these use cases to further increase network performance and hardware cryptographic offloading such as with Virtual Private Networks.

In addition, using Hyperscan to look at DPI can extend into the virtual environment such as with this example of a Next-Generation IPS on a software-defined data center (SDDC). Hyperscan is capable of scaling within a virtual environment, allowing you to allocate additional virtual machines as desired to improve performance. This can be of value in cases where your infrastructure grows over time or when you might need to adjust resources to meet a service level agreement.

Hyperscan is part of Intelligent Network Platform, please contact Wind River* for additional information.

Resources

Intel® System Studio: Main Website

https://software.intel.com/en-us/intel-system-studio

Intel® System Studio: Signal Processing Use Case

https://software.intel.com/sites/default/files/managed/09/46/signal-processing-with-intel-cilk-plus-2up.pdf

Intel® Media SDK: Main Website

https://software.intel.com/en-us/vcsource/tools/media-sdk-clients

Hyperscan: General Information

http://www.intel.com/content/www/us/en/communications/content-inspection-hyperscan-video.html

Hyperscan: Intel Atom Processor C2000 Product Family Use Case

http://www.intel.com/content/www/us/en/communications/atom-c2000-hyperscan-pattern-matching-brief.html

Hyperscan: Intel Xeon Processor Product Family Use Case

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/160gbps-dpi-performance-using-intel-architecture-paper.pdf

Hyperscan is Part of Wind River* Intelligent Network Platform

http://www.windriver.com/announces/intelligent-network-platform/

Intel® DPDK: Overview

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/dpdk-packet-processing-ia-overview-presentation.html

Intel® DPDK: Installation and Configuration Guide

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-getting-started-guide.html.

Intel® DPDK: Programmer’s Guide

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-programmers-guide.html.

Intel® DPDK: API Reference Documentation

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-api-reference.html.

Intel DPDK: Sample Applications

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-sample-applications-user-guide.html.

Intel® DPDK: Latest Source Code Packages for the Intel® DPDK Library

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/dpdk-source-code.html

Intel® QuickAssist Technology: General information

http://www.intel.com/content/www/us/en/io/quickassist-technology/quickassist-technology-developer.html?wapkw=quick+assist+technology

Intel® Communications Chipset 8900 to 8920 Series Software Programmer’s Guide

http://www.intel.com/content/www/us/en/embedded/technology/quickassist/communications-chipset-8900-8920-software-programmers-guide.html?wapkw=8920

Reference Designs Using Intel Hardware and Software for Network (Intel® Network Builders)

http://networkbuilders.intel.com/

** The specifications of the platform are as follows, for more details see this white paper

Big Data Technologies for Ultra-High-Speed data Transfer and Processing.

Hardware

Intel® Xeon® processor E5-2650 v2 (eight cores at 2.6 GHz with hyperthreading)
128-GB DDR3-1333 ECC (16 x 8 GB DIMM)
Twelve Intel® Solid-State Drives DC S3700 series (800GB, 6Gb/s, 2.5" MLC per server)
Two Intel® Ethernet Converged Network Adapters X520-DA2 (dual port, 10G NIC with four ports total)
Two Intel® Integrated RAID Modules RMS25PB080 (PCIe2 x8 with direct attach to disk)

Software

DPDK 1.4 from dpdk.org
Prototype Aspera Fasp* sender and receiver with Intel® DPDK integrated
XFS File system
1MB RAID Stripe size for 12MB blocks

*Other names and brands may be claimed as the property of others.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk, and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

This sample source code is released under the Intel Sample Source Code License Agreement

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

↧

Working with Mellanox* InfiniBand Adapter on System with Intel® Xeon Phi™ Coprocessors

September 23, 2014, 10:09 am

Latest and popular articles on Intel Technologies

≫ Next: The Ultimate Intel® Black Belt Software Developer Summit Experience

≪ Previous: General Introduction to Networking Technologies for the Packet Processing Plane

InfiniBand* is a network communications protocol commonly used in the HPC area because the protocol offers very high throughput. Intel and Mellanox* are among the most popular InfiniBand* adapter manufacturers. In this blog, I will share my experience of installing and testing Mellanox* InfiniBand* adapter cards with three different versions of OFED* (Open Fabrics Enterprise Distribution), OpenFabrics OFED-1.5.4.1, OpenFabrics OFED-3.5.2-mic and Mellanox* OFED 2.1, on systems containing Intel® Xeon Phi™ coprocessors.

In order to allow native applications on the coprocessors to communicate with the Mellanox* InfiniBand adapters, the Coprocessor Communication Link (CCL) must be enabled. All three mentioned above OFED stacks support CCL when used with the Mellanox* InfiniBand adapters.

1. Hardware Installation

Two systems, each equipped with an Intel® Xeon® E5-2670 2.60 GHz processor and two Intel® Xeon Phi™ coprocessors, were used. Both systems were running RHEL 6.3. They had Gigabit Ethernet adapters and were connected through a Gigabit Ethernet router.

Prior to the test, both systems were power-off and one Mellanox* ConnectX-3 VPI InfiniBand Adapter card was installed into an empty PCIe slot in each machine. Since there were only two systems, the ports of the adapters were connected using an InfiniBand cable with no intervening router (a back-to-back connection). After powering up the two systems, the “lspci” command was used on each system to check that the Mellanox* InfiniBand Adapter cards were correctly detected:

# lspci | grep Mellanox
84:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

The first field of the output (84:00.0) shows the PCI bus slot number; the second field shows the slot name (Network controller); the last field shows the device name, which is the Mellanox* InfiniBand Adapter in this case.

2. Installing MPSS with OFED Support

After installing the Mellanox*adapter successfully, each of the three different OFED stacks for use with Mellanox* InfiniBand Adapter supported under MPSS 3.3 (OFED-1.5.4.1, OFED-3.5.2-mic and Mellanox* OFED 2.1) was installed following the directions in the MPSS User Guide shipped with MPSS 3.3. Each of the three different OFED stacks was installed on a different hard drive, allowing the different versions to be verified independently without requiring complete uninstallation of the previous version between tests.

2.1 OFED 1.5.4.1

Follow the steps described in section 2.3 to download from www.openfabrics.org. Need to have zlib-devel and tcl-devel packages in order to install OFED-1.5.4.1.
Install the basic MPSS 3.3 according to section 2.2 in readme file.
Install Intel MPSS OFED from the folder mpss-3.3/ofed. The warning message is expected and can be ignored.

# cd mpss3-3
# cp ofed/modules/*`uname –r`*.rpm ofed
# rpm –Uvh ofed/*.rpm
warning: ofed/ofed-ibpd-3.3-r0.glibc2.12.2.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID d536787c: NOKEY
Preparing...             ########################################### [100%]
1:dapl                   ########################################### [ 11%]
2:ofed-ibpd              ########################################### [ 22%]
3:ofed-driver-2.6.32-279.
                         ########################################### [ 33%]
4:libibscif              ########################################### [ 44%]
5:libibscif-devel        ########################################### [ 56%]
6:ofed-driver-devel-2.6.3
                         ########################################### [ 67%]
7:dapl-devel             ########################################### [ 78%]
8:dapl-utils             ########################################### [ 89%]
9:dapl-devel-static      ########################################### [100%]

Reboot the system

2.2 OFED 3.5.2-MIC

Follow the instructions in User’s Guide Section 2.4 to install OFED-3.5.2-MIC from https://www.openfabrics.org/downloads/ofed-mic/ofed-3.5-2-mic/ and then reboot the system. Note that this package installation is different from the previous one and there is no need to install the additional package in mpss-3.3/ofed/.

2.3 Mellanox* OFED 2.1

Follow the instructions in User’s Guide Section 2.5 to install Mellanox* OFED 2.1.x:

From www.mellanox.com, navigate to Products > Software > InfiniBand VPI Driver, and download Mellanox OpenFrabrics Enterprise Distribution for Linux OFED software: MLNX_OFED_LINUX-2.1-1.0.6.-rhel6-3–x86_64.tgz
Untar and read the documentation
# tar xvf MLNX_OFED_LINUX-2.1-1.0.6.-rhel6-3–x86_64.tgz
# cd MLNX_OFED_LINUX-2.1-1.0.6.-rhel6-3–x86_64
Install the following packages tcl, tk and libnl-devel from the RHEL installation disk.
Install the stack:
```
# ./mlnxofedinstall
```
Install Intel MPSS OFED ibpd rmp:
```
# rpm –U mpss-3.3/ofed/ofed-ibpd*.rpm
```

From the mpss-3.3/src folder, compile dapl, libibscif and ofed-driver source RPMs:

# rpmbuild –rebuild –define “MOFED 1” mpss-3.3/src/dapl*.src.rpm mpss-3.3/src/libibscif*.src.rpm mpss-3.3/src/ofed-driver*.src.rpm

Install the resultant RPMs now in $HOME/rpmbuild/RPMS/x86_64

# ls $HOME/rpmbuild/RPMS/x86_64
dapl-2.0.42.2-1.el6.x86_64.rpm
dapl-devel-2.0.42.2-1.el6.x86_64.rpm
dapl-devel-static-2.0.42.2-1.el6.x86_64.rpm
dapl-utils-2.0.42.2-1.el6.x86_64.rpm
libibscif-1.0.0-1.el6.x86_64.rpm
libibscif-devel-1.0.0-1.el6.x86_64.rpm
ofed-driver-2.6.32-279.el6.x86_64-3.3-1.x86_64.rpm
ofed-driver-devel-2.6.32-279.el6.x86_64-3.3-1.x86_64.rpm
# rpm –U $HOME/rpmbuild/RPMS/x86_64/*.rpm

Reboot the system

3. Starting MPSS with OFED Support

For the versions of OFED discussed here, the process for starting the MPSS with OFED support is the same regardless of the version used. This section describes the steps used to bring up the MPSS and all InfiniBand related services. Prior to performing these steps, password-less SSH login to the coprocessors was set up following the instructions in section 2.5 of the MPSS readme file. The following steps where performed before testing for each version of OFED used:

First, if the MPSS service is not running, start it:

# service mpss start
Starting Intel(R) MPSS:                                    [  OK  ]
mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)

Next, start the InfiniBand and HCA service:

# service openibd start
Starting psmd:                                             [  OK  ]
Setting up InfiniBand network interfaces:                  [  OK  ]
No configuration found for ib0
Setting up service network . . .                           [  done  ]

On one (and only one) of the systems, start the subnet manager for the InfiniBand network:

# service opensmd start
Starting IB Subnet Manager.                                [  OK  ]

Start the ibscif virtual adapter for the coprocessor:

#service ofed-mic start
Starting OFED Stack:
host                                                       [  OK  ]
mic0                                                       [  OK  ]
mic1                                                       [  OK  ]

Finally, start the CCL-proxy service:

#service mpxyd start
Starting mpxyd daemon:                                     [  OK  ]

4. Basic Testing

For each version of OFED tested, after starting the MPSS with OFED support as shown in section 3, the InfiniBand device on each host was queried using the command “ibv_devinfo”. The output from one of the hosts is shown with a virtual device scif0 and a physical adapter mlx4_0:

# ibv_devinfo
hca_id: scif0
        transport:                      iWARP (1)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe14:0033
        sys_image_guid:                 4c79:baff:fe14:0033
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1000
                        port_lmc:               0x00
                        link_layer:             Ethernet
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.30.8000
        node_guid:                      f452:1403:007c:bd30
        sys_image_guid:                 f452:1403:007c:bd33
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

Similarly, executing the ibv_devinfo command on coprocessors mic0 and mic1 on each system provided the following output (note that the physical adapter mlx4_0 is shown because of CCL enabled):

# ssh mic0 ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.30.8000
        node_guid:                      f452:1403:007c:bd30
        sys_image_guid:                 f452:1403:007c:bd33
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: scif0
        transport:                      SCIF (2)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe14:0032
        sys_image_guid:                 4c79:baff:fe14:0032
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             SCIF

# ssh mic1 ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.30.8000
        node_guid:                      f452:1403:007c:bd30
        sys_image_guid:                 f452:1403:007c:bd33
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: scif0
        transport:                      SCIF (2)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe1a:03da
        sys_image_guid:                 4c79:baff:fe1a:03da
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1002
                        port_lmc:               0x00
                        link_layer:             SCIF

The existence of InfiniBand HCA and virtual scif in /sys/class/infiniband were verified:

# ls /sys/class/infiniband
mlx4_0  scif0

# ssh mic0 ls /sys/class/infiniband
mlx4_0
scif0

# ssh mic1 ls /sys/class/infiniband
mlx4_0
scif0

To display all InfiniBand host nodes, the command “ibhosts” was used. In this case, it shows two hosts, knightscorner5 and knightscorner7:

#ibhosts
Ca      : 0xf4521403007d2b90 ports 1 "knightscorner5 HCA-1"
Ca      : 0xf4521403007cbd30 ports 1 "knightscorner7 HCA-1"

The command iblink was used to display link information. In this case, the output shows one port on knightscorner5 (0xf4521403007d2b91) and one port on knightscorner7 (0xf4521403007cbd31):

#iblinkinfo
CA: knightscorner5 HCA-1:
      0xf4521403007d2b91      2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "knightscorner7 HCA-1" ( )
CA: knightscorner7 HCA-1:
      0xf4521403007cbd31      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "knightscorner5 HCA-1" ( )

The command ibstat was used to query the status of only the local InfiniBand available link on a host, which shows “LinkUp” status in this case:

# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.30.8000
        Hardware version: 0
        Node GUID: 0xf4521403007cbd30
        System image GUID: 0xf4521403007cbd33
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0xf4521403007cbd31
                Link layer: InfiniBand

Finally, the utility “ibping” (equivalent to the traditional Internet Protocol ping utility) was used to test the connectivity between InfiniBand nodes. To do this, the ibping server must first be started on the system to be pinged (knightscorner5 in this case):

# ibping –S

Then on the system knightscorner7, ibping was started using the specific port GUID for the destination (knightscorner5) system (0xf4521403007d2b91) as shown in the ibstat output:

# ibping -G 0xf4521403007d2b91
Pong from knightscorner5.(none) (Lid 2): time 0.106 ms
Pong from knightscorner5.(none) (Lid 2): time 0.105 ms
Pong from knightscorner5.(none) (Lid 2): time 0.135 ms
. . . . . . . . . . . . . . . . . . . . . . . . . . .

This confirmed the InfiniBand connectivity between the two systems.

5. Conclusion

This blog briefly showed how Mellanox* InfiniBand Adapter cards were installed on two systems and connected back-to-back with a cable. Then three different OFED stacks that support coprocessors were installed with the Intel® MPSS. A set of standard commands were used to bring the necessary services up in order to enable CCL. Finally, the configuration and connectivity were checked with simple InfiniBand test commands to make sure the hardware was working.

↧

The Ultimate Intel® Black Belt Software Developer Summit Experience

September 26, 2014, 12:16 pm

Latest and popular articles on Intel Technologies

≫ Next: Server-side Java* (JDK 8) provides optimized performance on Intel® Xeon® processor E5-2600 v3 product family

≪ Previous: Working with Mellanox* InfiniBand Adapter on System with Intel® Xeon Phi™ Coprocessors

Twenty eight Intel® Black Belt Software Developers journeyed to San Francisco for IDF2014 and to attend the sixth annual Black Belt Summit Networking/Dinner Event. As Black Belt Program manager I had the pleasure of greeting them and spending time with them during the week. Here I am driving down to the big city. Even though I live in the Bay Area I am nevertheless thrilled every time I cross this amazing bridge!

From the moment they arrived at the airport to when they departed, Intel had a number of surprises for attendees this year!

To begin with, Black Belts were picked up by a judogi-clad driver who met them with a gift and drove them to their hotel.

At the hotel another gift was in their rooms and they settled in to get ready for the opening of IDF at Moscone the next day.

These VIPs were on hand for BK’s Keynote and then proceeded to enjoy the sights, sounds and learning that is the Intel Developer Forum. Marketing Manager Russ Beutler interviewed about nine of our Black Belts on videos that will shortly show on Intel® Developer Zone. I heard many compliments for the Ultimate Make Space! They attended various parties also and on Wednesday they prepared to attend their special event.

This year’s Black Belt Developers were delighted when the bus pulled up to the California Academy of Sciences in San Francisco’s Golden Gate Park. They were met with refreshments and the opportunity to mingle with their peers and Intel Staff. Their names were up in lights!

After a tour of the facility, including a stop at the Living Roof, guests were escorted to the Planetarium lobby where Morris Beton welcomed and awarded our newest Black Belts – Christopher Price, our first Android Black Belt lives in the Bay Area; Gregor Biswanger, hailing from Germany is our a Windows/HTML Black Belt. Taylor Kidd and Ron Green, both come from DPD and were nominated for long-time work in HPC and parallel computing, and Gergana Slavova for her contributions in Cluster and MPI software development. We were then treated to a couple of Speed Talks given by Black Belts (Clay Breshears and Gregor Biswanger) followed by a gourmet sit-down dinner.

Then special guest speaker Genevieve Bell, Vice President, Intel Labs Director, User Experience Research Intel Fellow took the podium and presented “A prehistory of robots & why it still matters for today’s technologists.” Genevieve’s talk was both informative and entertaining and certainly the highlight of the evening.

Lastly, all were treated to a special showing of Dark Universe and chocolates!

Comments on the Summit and IDF week:

“Off the charts!”

“Kathy, I felt like a Rock Star last week.”

“I enjoyed IDF of course, but really appreciated the opportunity to talk with [fellow Black Belts].”

I think it’s safe to say a good time was had by all. Many thanks to my Summit Planning Partner, Black Belt Program Marketing Manager Russ Beutler. Now, the question is, what shall we do next year?

In the meantime, might you be one of the elite group at Summit time next year? For more info check out http://software.intel.com/blackbelt.

↧

Server-side Java* (JDK 8) provides optimized performance on Intel® Xeon® processor E5-2600 v3 product family

November 5, 2014, 10:53 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Black Belt Software Developers on Film!

≪ Previous: The Ultimate Intel® Black Belt Software Developer Summit Experience

The introduction of Java* SE Development Kit (JDK) 8 in March 2014 is regarded as the most significant set of changes to the Java platform since the initial Java 1.0 release in 1995. Through co-engineering and joint innovation by Oracle and Intel, JDK 8 is highly tuned for performance on Intel® Xeon® processors.

With the release of the Intel® Xeon® processor E5-2600 v3 product family, dramatically accelerates Java workloads while also delivering energy-efficiency improvements.

↧

Intel® Black Belt Software Developers on Film!

November 6, 2014, 3:37 pm

Latest and popular articles on Intel Technologies

≫ Next: Developer API Documentation for Intel® Performance Counter Monitor

≪ Previous: Server-side Java* (JDK 8) provides optimized performance on Intel® Xeon® processor E5-2600 v3 product family

In the last couple of months we have captured a number of Black Belts on video. Some were done in the studio, some at September's Intel Developer Forum. No matter the circumstances and method, they are great opportunities to check the pulse of the members of our elite group and to find out what they are doing as well as how they feel about Intel technology and the Industry overall. Thanks to BB Marketing Manager Russ Beutler for working with me on this.

Recent Black Belt Stars:

↧

Developer API Documentation for Intel® Performance Counter Monitor

July 24, 2014, 6:10 am

Latest and popular articles on Intel Technologies

≫ Next: Introduction to OpenMP* on YouTube

≪ Previous: Intel® Black Belt Software Developers on Film!

The Intel® Performance Counter Monitor (Intel® PCM: www.intel.com/software/pcm) is an open-source tool set based on an API. This API can be used directly by developers in their software. Besides the API usage example in the article, other samples of code using the API can be found in pcm.cpp, pcm-tsx.cpp, pcm-power.cpp, pcm-memory.cpp and other sample tools contained in Intel PCM package.

An important resource for learning about Intel PCM API can be found in the embedded Doxygen documentation. For example it lists all functions to extract available processor metrics supported by Intel PCM. Generating HTML browsable documentation with Doxygen from the source code is very easy: a Doxygen project file is already in Intel PCM package and most of the source code is annotated with Doxygen tags (description of function parameters, return values, etc).

Here are the steps to generate the documentation:

Download doxygen tool for your operating system from www.doxygen.org (Doxygen is available on many operating systems including Windows, Linux, MacOS X, etc)
Run doxygen in the Intel PCM directory
Open generated html/classPCM.html in your favourite browser
Click on the classes and structure of your interest, browse class hierarchies, functions implementing access to processor metrics, etc

For the current Intel PCM 2.6, there is already documentation made available here.

Best regards,

Roman

↧

Introduction to OpenMP* on YouTube

December 3, 2014, 11:25 am

Latest and popular articles on Intel Technologies

≫ Next: Improving MPI Communication between the Intel® Xeon® Host and Intel® Xeon Phi™

≪ Previous: Developer API Documentation for Intel® Performance Counter Monitor

Tim Mattson (Intel), has authored an extensive series of excellent videos as in introduction to OpenMP*. Not only does he walk through a series of programming exercises in C, he also starts with a background introduction on parallel programming.

Check out the series: https://www.youtube.com/watch?v=nE-xN4Bf8XI&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=27

The slide set can be obtained here: http://openmp.org/mp-documents/Intro_To_OpenMP_Mattson.pdf

The exercise files can be downloaded here: http://openmp.org/mp-documents/Mattson_OMP_exercises.zip

↧

Improving MPI Communication between the Intel® Xeon® Host and Intel® Xeon Phi™

December 9, 2014, 11:07 am

Latest and popular articles on Intel Technologies

≫ Next: Enabling Virtual Machine Control Structure Shadowing On A Nested Virtual Machine With The Intel® Xeon® E5-2600 V3 Product Family

≪ Previous: Introduction to OpenMP* on YouTube

MPI Symmetric Mode is widely used in systems equipped with Intel® Xeon Phi™ coprocessors. In a system where one or more coprocessors are installed on an Intel® Xeon® host, Transmission Control Protocol (TCP) is used for MPI messages sent between the host and coprocessors or between coprocessors on that same host. For some critical applications this MPI communication may not be fast enough.

In this blog, I show how we can improve the MPI Intra-Node communication (between the Intel® Xeon® host and Intel® Xeon Phi™ Coprocessor) by installing the OFED stack in order to use the Direct Access Programming Library (DAPL) as a fabric instead. Even when the host does not have an InfiniBand* Host Channel Adapter (HCA) installed, the DAPL fabric can still be used to transfer MPI messages via scif0, to a virtual InfiniBand* interface.

On an Intel® Xeon® E5-2670 system running the Linux* kernel version 2.6.32-279 and equipped with two Intel® Xeon Phi™ C0 stepping 7120 coprocessors (named mic0 and mic1), I installed MPSS 3.3.2 and Intel® MPI Library 5.0 on the host. Included in the Intel® MPI Library the benchmark tool IMB-MPI1. For illustration purposes, I ran the Intel MPI Benchmark Sendrecv before and after installing the OFED stack obtained results for comparison. In this test used with two processes, each process sends a message and receives a message from the other process. The tool reports the bidirectional bandwidth.

To run the test, I copied the coprocessor version of the Intel® MPI Benchmark tool (IMB-MPI1) to the coprocessors:

# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic0:/tmp

# scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic1:/tmp

I enabled host-coprocessor and coprocessor-coprocessor communication:

# export I_MPI_MIC=enable

# /sbin/sysctl -w net.ipv4.ip_forward=1

For the first test, I ran the benchmark Sendrecv between host and mic0:

# mpirun -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 \

Sendrecv : -host mic0 -n 1 /tmp/IMB-MPI1

benchmarks to run Sendrecv

#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.0, MPI-1 part
#------------------------------------------------------------
# Date                  : Mon Nov 24 12:26:53 2014
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-279.el6.x86_64
# Version               : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]   Mbytes/sec
                  0         1000       129.93       129.93       129.93          0.00
                  1         1000      130.12       130.12       130.12          0.01
                  2         1000        130.48       130.48       130.48           0.03
                  4         1000       130.63       130.63       130.63           0.06
                  8         1000      130.25       130.25       130.25         0.12
                16         1000        130.40       130.40       130.40           0.23
                32         1000        126.92       126.92       126.92         0.48
                64         1000        121.18       121.18       121.18           1.01
              128         1000        119.91       119.92       119.91          2.04
              256         1000       118.83       118.83       118.83           4.11
              512         1000      139.81       139.83       139.82           6.98
            1024         1000        146.87       146.88       146.87         13.30
            2048         1000        153.28       153.28       153.28         25.48
            4096         1000        146.91       146.91       146.91         53.18
            8192         1000       159.63       159.64       159.63         97.88
          16384         1000      212.52       212.55       212.53       147.03
          32768         1000        342.03       342.08       342.05       182.70
          65536           640        484.54       484.78       484.66       257.85
        131072           320        808.74       809.64       809.19       308.78
        262144           160     1685.54     1688.78     1687.16       296.07
        524288             80      2862.96     2875.35     2869.16       347.78
      1048576             40      4978.17     5026.92     5002.55       397.86
      2097152             20      8871.96     9039.75     8955.85       442.49
      4194304           10    16531.30   17194.01   16862.65      465.28

# All processes entering MPI_Finalize

The above table shows the average time and bandwidth for different message lengths.

Next, I ran the benchmark to collect data between mic0 and mic1:

# mpirun -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv : -host mic1 -n 1 /tmp/IMB-MPI1

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]   Mbytes/sec
                  0         1000        210.77       210.77         210.77          0.00
                 1         1000         212.45      212.69         212.57           0.01
                  2         1000        218.84        218.84          218.84         0.02
                  4         1000         209.84        209.84          209.84           0.04
                  8         1000        212.45       212.47          212.46         0.07
                16         1000         208.90      209.15          209.03           0.15
                32         1000        227.80        228.07         227.94          0.27
                64         1000       223.61        223.62          223.62           0.55
              128         1000         210.82        210.83          210.83           1.16
             256         1000       211.61       211.61        211.61          2.31
              512         1000        214.33      214.34          214.34           4.56
           1024         1000         225.15        225.16          225.15           8.67
            2048         1000         317.98        318.28          318.13         12.27
            4096         1000         307.00       307.32          307.16        25.42
            8192         1000        320.62      320.82          320.72       48.70
          16384         1000         461.89        462.26        462.08         67.60
         32768         1000         571.72       571.76          571.74       109.31
          65536          640      1422.02      1424.80      1423.41         87.73
        131072           320     1758.98     1759.17     1759.08       142.11
      262144           160       4234.41       4234.99       4234.70       118.06
        524288             80       5433.75       5453.23       5443.49       183.38
      1048576             40      7511.45      7560.68      7536.06       264.53
      2097152             20     12764.95     12818.46     12791.71       312.05
      4194304            10     22333.29     22484.09     22408.69       355.81

# All processes entering MPI_Finalize

In the second phase, I downloaded the OFED stack OFED-3.5.2-MIC.gz from https://www.openfabrics.org/downloads/ofed-mic/ofed-3.5-2-mic/ and install it (refer to Section 2.4 in readme file) in order to use DAPL fabric.

I started OFED services and to set the fabric DAPL fabric, used ofa-v2-scif0 as the provider (the file /etc/dat.conf shows all DAPL providers):

# service openibd start
# service ofed-mic start
# service mpxyd start

I enabled DAPL and specified DAPL provider:

# export I_MPI_FABRICS=dapl
# export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

With DAPL configured, I repeated the test between the host and mic0. Note that when the environment variable I_MPI_DEBUG is set, the output of an MPI program shows the underlying protocol used for communication:

# mpirun -genv I_MPI_DEBUG 2 -host localhost -n 1 /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv: \ -host mic0 -n 1 /tmp/IMB-MPI1

[0] MPI startup(): Single-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
benchmarks to run Sendrecv
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.0, MPI-1 part
#------------------------------------------------------------
# Date                  : Mon Nov 24 15:05:55 2014
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.32-279.el6.x86_64
# Version               : #1 SMP Wed Jun 13 18:24:36 EDT 2012
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# Calling sequence was:

# /opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 Sendrecv

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]   Mbytes/sec
                  0         1000         19.11         19.11         19.11           0.00
                  1         1000         20.08        20.08        20.08             0.09
                  2         1000        20.09       20.09       20.09            0.19
                  4         1000       20.19         20.19         20.19             0.38
                  8         1000         19.89         19.89         19.89             0.77
                16         1000         19.99        19.99        19.99             1.53
                32         1000        21.37       21.37       21.37             2.86
                64         1000       21.39         21.39         21.39             5.71
              128         1000         22.40         22.40         22.40           10.90
              256         1000         22.73        22.73        22.73           21.48
              512         1000         23.34       23.34       23.34          41.84
            1024         1000         25.33         25.33         25.33           77.11
           2048         1000        27.48         27.49         27.49        142.11
            4096         1000         33.70         33.72         33.71         231.72
            8192         1000       127.15       127.16       127.16        122.88
          16384         1000       133.82       133.84       133.83         233.49
         32768         1000       156.29       156.31       156.30        399.85
          65536           640       224.67       224.70       224.69         556.30
        131072          320       359.13       359.20       359.16       696.00
        262144           160       174.61       174.66       174.63     2862.76
        524288             80       229.66       229.76       229.71       4352.29
      1048576             40       303.32       303.55       303.44       6588.60
      2097152             20       483.94       484.30       484.12      8259.35
      4194304           10       752.81       753.69       753.25     10614.46

# All processes entering MPI_Finalize

Similarly, I ran the benchmark to collect data between mic0 and mic1:

# mpirun -genv I_MPI_DEBUG 2 -host mic0 -n 1 /tmp/IMB-MPI1 Sendrecv : -host mic1 -n 1 /tmp/IMB-MPI1

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]   Mbytes/sec
                  0         1000          30.13          30.13        30.13          0.00
                  1         1000         20.28        20.28         20.28           0.09
                  2         1000          20.43          20.43          20.43           0.19
                  4         1000          20.38          20.39         20.39         0.37
                8         1000         20.70        20.70          20.70           0.74
              16         1000          20.84          20.85          20.84          1.46
                32         1000         21.79          21.80        21.79           2.80
               64         1000          21.61         21.62          21.62           5.65
              128         1000          22.63          22.63          22.63         10.79
             256         1000        23.20          23.21        23.20         21.04
              512         1000          24.74         24.74          24.74        39.47
           1024         1000        26.14          26.15         26.15         74.69
            2048         1000          28.94         28.95          28.94       134.95
            4096         1000         44.01          44.02          44.01       177.49
          8192         1000       149.33        149.34        149.34       104.63
          16384         1000      192.89       192.91       192.90       162.00
          32768         1000        225.52      225.52      225.52       277.13
        65536           640        319.88        319.89        319.89       390.76
      131072           320        568.12        568.20        568.16       439.99
        262144         160       390.62       390.68       390.65      1279.81
      524288           80        653.20        653.26      653.23      1530.78
      1048576             40      1215.85      1216.10      1215.97      1644.61
      2097152             20      2263.20      2263.70      2263.45      1767.02
      4194304             10      4351.90      4352.00      4351.95      1838.24

# All processes entering MPI_Finalize

The MPI bandwidth is improved significantly when running DAPL fabric. For example, in the case of a message length of 4 MB, host-coprocessor bandwidth went from 465.28 MB/sec to 10,614.46 MB/sec and coprocessor-coprocessor bandwidth went from 355.81 MB/sec to 1,838.24 MB/sec. To obtain this improvement it was only necessary to enable the DAPL fabric by installing the OFED stack and configuring DAPL, without regard to whether an InfiniBand* HCA was installed on the host or not.

Notices

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, Cilk, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

This sample source code is released under the BSD 3 Clause License.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND

ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION BE LIABLE FOR ANY

DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Optimization Notice

↧

Enabling Virtual Machine Control Structure Shadowing On A Nested Virtual Machine With The Intel® Xeon® E5-2600 V3 Product Family

December 12, 2014, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Intel’s Cache Monitoring Technology Software-Visible Interfaces

≪ Previous: Improving MPI Communication between the Intel® Xeon® Host and Intel® Xeon Phi™

Introduction

For those interested in creating a nested virtual machine setup, this blog covers some of the details that one should keep in mind when sequencing the set-up, as well as commands that are useful to inspect and adjust configuration. Recently I set up a nested VM in order to test a new feature known as Virtual Machine Control Structure (VMCS) Shadowing, which is available on the Intel® Xeon® Processor E5-2600 V3 Product Family. Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. VMCS shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. As a result, VMCS shadowing increases efficiency by reducing virtualization latency.

Figure 1 – Reduction of VM Exits with VMCS Shadowing

So, why might you want to set up a nested virtual machine? A nested VM creates an additional layer of abstraction. This extra level of abstraction can be useful in separating the nested VM from the hypervisor and the cloud infrastructure. Infrastructure as a Service (IaaS) is one of the areas where nested VMs can be of benefit, as the VMs can more easily transfer between different clouds and different third-party hypervisors.

When describing the setup I will use the following terminology:

L0 refers to the host operating system
L1 refers to the first guest virtual machine that is created
L2 refers to the nested guest virtual machine that is created inside of L1.

For my particular situation I was using the following operating system versions.

L0 = Red Hat Enterprise Linux* 7 (3.17.4-2.el7.repo.x86_64)

L1 = Red Hat Enterprise Linux* 7 (3.10.0-123.9.3.el7.x86_64)

L2 = Red Hat Enterprise Linux* 7 (3.10.0-123.9.3.el7.x86_64)

Setting up the Host (L0)

The nested and VMCS Shadowing features need to be enabled at the host level (L0). To verify this is the case the following commands can be used.

To check if VMCS Shadowing is enabled.

L0 Command Line:cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs

You should get a response of ‘Y’

To check if the Nested feature is enabled.

L0 Command Line:cat /sys/module/kvm_intel/parameters/nested

You should get a response of ‘Y’

If you get a response of N for either parameter you can setup a configuration file for the module.

L0 Command Line:cd /etc/modprobe.d/

Create a file called kvm-intel.conf (if it doesn’t exist already). Edit kvm-intel.conf to include the following:

options kvm-intel nested=1

options kvm-intel enable_shadow_vmcs=1

save the file and reboot the host.

You will also need to make sure that the VMX feature is enabled. It should be enabled by default, but if you want to check you will find it listed as one of the cpu flags.

L0 Command Line:cat /proc/cpuinfo

Setting Up the First Guest VM (L1)

You will need to make sure the VMX feature is passed through to L1. If after you install L1 you find this is not the case use the virsh tool to edit the configuration file for the L1 guest.

L0 Command Line:virsh edit /etc/libvirt/qemu/L1_guest.xml

Alter the CPU information so that it appears as follows and then save the file.

<cpu mode='host-passthrough'>

</cpu>

You should be able to see vmx as one of the listed cpu flags for L1.

L1 Command Line:cat /proc/cpuinfo

Setting Up the Nested VM (L2)

There are a few different ways you might want to set up L2. You could set up as an image file, you could set it up as a raw partition, or you might want to set it up with an image file but then add in access to a raw partition. One of the reasons you may choose raw partitions over the default image file might be if your goal is to test performance.

Regardless of how you choose to set up your Nested VM (L2) you will need to set up a network bridge.

Setting Up a Network Bridge for the Nested VM (L2)

You will need to set up a network bridge for the nested VM (L2) to the first guest VM’s (L1) network adapter. If you do not do this you will get errors during the installation of the operating system.

In my particular case, I accomplished this through the virtual machine manager under Gnome on L1.

System Tools – Virtual Machine Manager - Edit – Connection Details – Virtual Networks

Click the + symbol in the bottom left corner.

By default the network used by my guest VM was 192.168.122.0/24.

I created an additional network with the following configuration.

Name: nested bridge

Device: virbr1

Active

On Boot

IPv4 Forwarding: NAT to eth0

IPv6 Forwarding: Isolated network, routing disabled

Network: 192.168.166.0/24

DHCP start: 192.168.166.128

DHCP end: 192.168.166.254

When installing your Nested VM you will want to choose the nested bridge for the network interface.

Setting Up L2 As An Image File

When you are installing the L2 Nested VM under L1 VM, if you want to install the L2 VM as an image file you can use the default options for the hard drive settings. You will of course still want to use the nested bridge for the network interface that you set up in the previous step.

Setting Up L2 As a Raw Partition

Setting Up a Raw Partition For Installation of L2

Before installing the VM to a raw partition you will need to make sure you have a partition created using fdisk. Let’s assume that the drive (/dev/sda) where L0 is installed has extra free space that could be used for this purpose.

L0 Command Line:fdisk /dev/sda

n (to setup a partition)

p (to make a primary partition)

When asked about where to start the first cylinder, just hit return to accept the default

On defining the Last cylinder, just hit return to accept the default (make sure that there is enough space for your VM)

w command to write the changes, then exit fdisk.

If L0 was installed on /dev/sda1 then the partition that is available for L2 will be /dev/sda2. Note we will not be formatting the volume at this time, this will be taken care of during the installation of the VM.

Modifying L1 Configuration File

Before doing the installation of L2 you will need to make sure that the raw partition is accessible to L1. To do this, modify L1’s configuration file to include the location that contains the partition that you want to install L2 on.

L0 Command Line:virsh edit /etc/libvirt/qemu/L1_guest.xml

<controller type='scsi' model='virtio-scsi'/>

<disk type='block' device='lun'>

      <driver name='qemu' type='raw' cache='none'/>

      <source dev='/dev/sda'/>

      <target dev='sda' bus='scsi'/>

</disk>

Save the file.

Now when you install the operating system for L2 you will be able to access /dev/sda2.

Installing L2 On a Raw Partition

Below is an example of creating a nested VM on a raw partition from the command line of L1.

L1 Command Line:virt-install -n nested-L2 -r 16000 --vcpus=4 -l /RHEL-7.0-Server-x86_64-dvd1.iso –disk path=/dev/sda2 -w network=nested_bridge

Where:

-n is the name of the nested vm

-r is the memory (16GB in this case)

-vcpus is how many cpus the nested VM will have

-l is the location of the ISO image for installing Redhat

-disk is the path location to install; in this case we are pointing to the direct access partition that we set up.

-w is the bridged network we previously set up. During this process you may get a message about the partition not having enough space and if you want to reclaim it. You’ll need to reclaim the space in order to do the installation. Just make sure you are pointing to the correct partition for your particular setup before proceeding!

Setting Up L2 As An Image But Adding a Partition For Direct Access

Under L0 you will want to use fdisk to setup an EXT3 partition to be used by L2. In this case let’s assume that you have a second drive (/dev/sdb) that will be used strictly for this purpose.

L0 Command Line:fdisk /dev/sdb

n (to setup a partition)

p (to make a primary partition)

Question on first cylinder, just hit return to accept the default

Question on the Last cylinder, just hit return to accept the default (make sure that there is enough space for your VM)

w command to write the changes, then exit fdisk.

Then use this command to format the disk

L0 Command Line:mkfs.ext3 /dev/sdb1

Once you have your formatted partition set up, do not mount it under the host OS (L0) or the first guest VM (L1). This is to avoid issues that can be created with permissions between the different levels.

In order to make sure that L2 can access the EXT3 partition that you have set up, you will need to make sure that both L1 and L2 can see the drive where the partition is located. To do this you will need to edit the configuration files. Edit your guest VM configuration file with virsh edit. This time you need to add the following.

L0 Command Line:virsh edit /etc/libvirt/qemu/L1_guest.xml

<controller type='scsi' model='virtio-scsi'/>

<disk type='block' device='lun'>

      <driver name='qemu' type='raw' cache='none'/>

      <source dev='/dev/sdb'/>

      <target dev='sdb' bus='scsi'/>

</disk>

Save the file.

L1 Command Line:virsh edit /etc/libvirt/qemu/L2_guest.xml

<controller type='scsi' model='virtio-scsi'/>

<disk type='block' device='lun'>

      <driver name='qemu' type='raw' cache='none'/>

      <source dev='/dev/sdb'/>

      <target dev='sdb' bus='scsi'/>

</disk>

Save the file.

L2 Command Line: mount /dev/sdb1 /mnt

You now have an EXT3 volume under the /mnt directory that you can directly access from the nested VM (L2).

Working With the VM’s

Here are some common tools that you may need to use when interacting with the different VM’s. They are included for your convenience.

Moving Files Between VM’s

You can use the scp command to move files back and forth between the VMs.

To move a file from the system you are currently logged into, to a different system (a push method) use:

scp ~/my_local_file.txt user@remote_host.com:/some/remote/directory

To move a file from a different system to the system you are currently logged into (a pull method) use:

scp user@192.168.1.3:/some/path/file.txt .

Registering From the Command Line

Registering your VMs can make it easier to get updates via yum. You may need to set the proxy server if you have one prior to registration

subscription-manager config --server.proxy_hostname=name_of_your_proxy_server--server.proxy_port=port_number_for_your_proxy_server

To register your VM

subscription-manager register --usernameyour_account_name--password your_password--auto-attach

Determining the IP Address Of Your VM

If you are using ssh to interact with a VM instead of the GUI interface and need to know the network address that was assigned to your VM (if you did not set a static IP) you can find out by using the following command.

L1 Command Line: nmap -v -sP 192.168.166.0/24

This will show you a list of the active connections and allow you to identify the IP address of L2.

Collecting Data On VM Exits

If you want to look at the data associated with VM exits you can do the following.

Find the 1^st Guest VM process ID Number

L0 Command Line:pgrep qemu-kvm

Run the perf tool and record the events from L0: perf kvm stat record -pqemu-kvm_ID_number

Important note: You must use Control-C to terminate the data collection process. Any other method will result in corrupt data.

The output file will be created in the directory where the perf command was run from. It will generate a perf.data.guest file. These files can get large, so be aware of that you might need to delete them after you are done with them or move them off the system if you need to archive.

Once you have the output file if you want to get the stats there are some of the different ways to do it depending on what you want to look at. Run this command from the same directory where the output file is located.

L0 Command Line:perf kvm stat report --event=vmexit

L0 Command Line:perf kvm stat report --event vmexit --key=time

L0 Command Line:perf kvm stat report --event vmexit --key=time --vcpu=0

There are other more general stats for the VM’s that be of interest in some cases. You can look at them by doing the following.

Find the 1^st Guest VM process ID Number

L0 Command Line:pgrep qemu-kvm

L0 Command Line:perf stat –e ‘kvm:*’–p qemu-kvm_ID_number

Conclusion

Creating a nested VM can have its challenges depending on what you are trying to accomplish and what type of configuration you desire. I hope if you are trying to do this that some of the information provided here is of benefit to you.

Resources

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-vmcs-shadowing-paper.pdf

The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 15 years. His areas of focus have included software automation, server power and performance analysis, and cloud security.

*Other names and brands may be claimed as the property of others.

↧

Intel’s Cache Monitoring Technology Software-Visible Interfaces

December 11, 2014, 4:19 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel's Cache Monitoring Technology: Use Models and Data

≪ Previous: Enabling Virtual Machine Control Structure Shadowing On A Nested Virtual Machine With The Intel® Xeon® E5-2600 V3 Product Family

Introduction

Intel‘s Cache Monitoring Technology (CMT) feature has just launched with the Intel® Xeon™ E5 2600 v3 line of server processors. Initial blog posts here (https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family) and here (https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring) provided an introduction and overview of the feature. This blog discusses details of the software interfaces, while future blogs will provide example data and details on Operating System (OS) and Virtual Machine Monitor (VMM) / Hypervisor support.

Key details discussed in this installment include Resource Monitoring IDs (RMIDs), an abstraction used to track threads/applications/VMs, the CPUID enumeration process and the Model-Specific Register (MSR) based interface used to retrieve monitoring data.

Resource Monitoring IDs (RMIDs)

The CMT feature enables independent and simultaneous monitoring of many concurrently running threads on a multicore processor through the use of an abstraction known as a Resource Monitoring ID (RMID).

A per-thread architectural MSR (IA32_PQR_ASSOC at address 0xC8F) exists which allows each hardware thread to be associated with an RMID (specifying the RMID for the given hardware thread).

Figure 1. The per-thread IA32_PQR_ASSOC (PQR) MSR enables each thread to be associated with an RMID for resource monitoring. CLOS stands for Class of Service. The CLOS field is used for resource enforcement, which is beyond the scope of this article.

A plurality of independent RMIDs are provided, enabling multiple independent threads to be individually tracked. The number of available RMIDs per processor is one of the parameters enumerated in CPUID (see below).

Threads can be monitored individually or in groups, and multiple threads can be given the same RMID. This provides a flexible mapping (Figure 2) to span a wide variety of virtualized and non-virtualized usage models.

Figure 2. Threads, applications, VMs or any combination can be associated with an RMID, enabling very flexible monitoring. As an example, all threads in a VM could be given the same RMID for simple per-VM monitoring.

Since each application or VM running on the platform consists of one or more threads, each application or VM can be monitored. For instance, all threads in a given VM could be assigned the same RMID. Similarly, all threads in an application could be assigned the same RMID. If the RMID is used only for monitoring that application (not a group of applications) then the occupancy reported by the system for that RMID will include only the specified application.

It is expected that in typical cases where an OS or VMM is enabled to support CMT, the RMID will simply be added to each thread’s state structure (Figure 3). Then when a thread is swapped onto a core, the PQR can be updated with the proper RMID to enable per-application or per-VM tracking.

Figure 3. The PQR register (containing an RMID) stored as part of a thread or VCPU state is written onto the thread-specific registers when a software thread is scheduled on a hardware thread for execution.

Note that if a CMT-supported OS or VMM is not available, software may still make use of CMT by pinning RMIDs to cores, then carefully tracking which applications are allowed to run on which cores, which can be mapped to cache occupancy.

Additional details are available in [1].

The RMIDs described are a convenient resource tagging scheme which may be expanded in the future to encompass other resource types or functionality.

Cache Monitoring Technology: CPUID Enumeration

The CPUID instruction is used to enumerate all CMT parameters which may change across processor generations, including the number of RMIDs available.

Typically an enabled OS or VMM would enumerate these capabilities and provide a standardized interface to determine the capabilities of these features by software running on the platform. This section gives a high-level overview of the details provided in CPUID.

The enumeration of CMT is hierarchical (Figure 4). To detect the presence of monitoring features in general on the platform, check bit 12 within CPUID.0x7.0 (a vector which contains bits to indicate the presence of multiple different types of features on the processor).

Figure 4. Hierarchical CPUID enumeration of monitoring features.

Once the presence of monitoring has been confirmed, the resources on which monitoring is supported can be enumerated through a new CPUID 0xF leaf. General information about which resources are supported is enumerated within CPUID.0xF.0 (note – subleaf zero is a special case which gives details about all monitoring features on the platform).

Once support for a particular resource has been confirmed, various subleaves (CPUID.0xF.[ResourceID]) can be polled to determine the attributes of each level of monitoring. For instance, L3 CMT details are enumerated in CPUID.0xF.1. Details enumerated include the number of RMIDs supported for L3 CMT, and an upscaling factor to use in converting sampled values retrieved from the Model-Specific Register (MSR) interface into cache occupancy in bytes.

Specific details about the leaves, sub-leaves and field encodings and details provided in CPUID are provided in [1].

Cache Monitoring Technology: Model-Specific Register (MSR) Interface

Once support for CMT has been confirmed via CPUID and the number of RMIDs is known, each thread can be associated with an RMID via the PQR MSR RMID field (Figure 1).

After a period of time (as defined by the software) the occupancy data for a given RMID can be read back through a pair of keyhole MSRs which provide the ability to input an RMID and Event ID (EvtID) in a selection MSR, and the hardware retrieves and returns the occupancy in the data MSR.

The event selection MSR (IA32_QM_EVTSEL) is shown in Figure 5. The software programs an RMID and Event ID pair corresponding to the type of data to be retrieved (for instance, L3 cache occupancy data for RMID). Available event codes are enumerated via CPUID and documented in [1].

Figure 5. The IA32_QM_EVTSEL MSR is used to select an RMID+EventID pair for which data should be retrieved. The data is then returned in the IA32_QM_CTR MSR (Figure 6).

Once the software has specified a valid RMID+Event ID pair, the hardware looks up the specified data, which is returned in the data MSR (IA32_QM_CTR, Figure 6). A pair of bits are provided in this MSR (Error and Unavailable) to indicate whether the data is valid or not. The precise meanings of these fields are documented in [1], however for the purposes of software if both bits are not set then the data in bits 61:0 is valid and can be consumed by the software. The error bits should always be checked before assuming that that data returned is valid.

Figure 6. The IA32_QM_CTR MSR provides resource monitoring data for an RMID+EventID specified in the IA32_QM_EVTSEL MSR. If the E/U bits are not set then the data is valid.

In the case of the L3 CMT feature, the data returned from the IA32_QM_CTR MSR may be optionally multiplied by an upscaling factor from CPUID to convert to bytes before consumption in software. If software does not apply the upscaling factor the value returned is still useful for relative occupancy comparisons between applications/VMs however as the scale is linear.

Any monitoring features added in the future will make use of the same MSR interface, meaning that the software enabling effort is incremental, and would be limited to new CPUID leaves and monitoring event codes.

Conclusion

Through the use of RMIDs, the CMT feature enables threads, applications, VMs or any combination to be tracked simultaneously in a flexible manner to suit a wide variety of software usage models.

CPUID is used to enumerate all CMT parameters which may change across processor generations, including the number of RMIDs available, which is expected to increase over time.

One model-specific per-thread register is used to associate threads with RMIDs. A pair of MSRs is used to retrieve monitoring resource data to enable the usage models described in the next blog in this series, which focuses on example data and use models.

References

[1] Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B, chapter 17.14

Authors

Andrew Herdrich is a Research Scientist within Intel Labs, where he has served as an architect on Cache Monitoring Technology and future thread contention mitigation technologies since 2008, and more recently, advanced future architectures and IA optimizations for NFV. Prior to joining Intel Labs Andrew worked on the Merom, Nehalem and Westmere CPUs as well as the first-generation Knights products.

↧

Intel's Cache Monitoring Technology: Use Models and Data

December 11, 2014, 5:26 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel's Cache Monitoring Technology: Software Support and Tools

≪ Previous: Intel’s Cache Monitoring Technology Software-Visible Interfaces

Introduction

The Intel's Cache Monitoring Technology (CMT) feature was introduced with the Intel® Xeon™ E5 2600 v3 product family in 2014. CMT provides visibility into shared platform resource utilization (via L3 cache occupancy), which enables application profiling, better scheduling, improved determinism, and improved platform visibility to track down applications which may be over-utilizing shared resources and thus preventing applications from running properly. CMT exposes cache consumption details, which allows resource orchestration software to ensure better Service Level Agreement (SLA) management.

Additional blog posts as referenced below provide an overview of various aspects of the feature:

Intel® Xeon E5-2600 v3 product family Product page: https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family
Blog 1: Introduction to CMT: https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring
Blog 2: Discussion of RMIDs and CMT Software Interfaces: https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces

This blog, the third in the series, discusses software use models and provides sample data for selected use cases.

High-Level Monitoring Usage Models

Shared resources within a multiprocessor chip may be managed through a combination of monitoring and allocation control hooks (Figure 1) in a closed-loop fashion to enable resource-aware application management. Resource monitoring provides increased visibility so that resource utilization can be tracked, application sensitivity to available resources can be profiled, and performance/resource inversion cases can be detected.

Figure 1. Resource Monitoring is a critical component in any shared resource management system to enable informed resource allocation decisions in a dynamic datacenter environment.

The Intel's Cache Monitoring Technology feature provides Last-Level (L3) cache occupancy monitoring. A simplified view of the hardware/software interface model common to all usages is shown in Figure 2. Starting from the left of the diagram, the first step is to enumerate the presence of the features on the specific processor in question, which is accomplished via CPUID. Specific sub-leaves as discussed in a previous blog post [1] provide details on the level of CMT support and parameters which may change across generations including the number of Resource Monitoring IDs (RMIDs) supported.

An RMID allows threads, applications or VMs to be tracked using the CMT feature. As discussed in a previous post [1] each thread is associated with an RMID, and multiple threads may be associated with the same RMID. This means that all threads in an application, or all applications in a VM can be tracked with the same RMID, and the cache sensitivity of an entire application or VM can be determined (dynamically, in real-time, while running concurrently with other applications and/or VMs).

The association of a thread with an RMID is handled by the OS/VMM, which writes a per-thread register (IA32_PQR_ASSOC, or PQR for short) on swap-on to a thread (Step 2 in Figure 2).

Figure 2. The three-step process common across CMT use models.

After a software-determined period of time, cache occupancy can be read back per-RMID via a pair of Machine Specific Registers (MSRs), as described in detail in the second blog in the series [1].

The retrieved monitoring data is scaled and can be used for a variety of purposes as discussed in subsequent sections.

Specific Cache Monitoring Technology Usage Models

Across all usage models, CMT enables dynamic and simultaneous monitoring of many threads/apps/VMs through the use of RMIDs.

The primary set of use cases include:

Real-time profiling: Application performance vs. Cache Occupancy (Figure 3)
Detection of cache-starved applications (which can be migrated for better performance)
Advanced Cache-Aware Scheduling for better system throughput

Figure 3. Example Cache Monitoring Technology profiling deployment

Additional use cases possible with CMT include but are not limited to:

Long-term dynamic application profiling for aberration detection and software tuning/optimization
Precise and accurate cache sensitivity measurement without the need for simulators
Cache contention detection and measurement
Monitoring performance to SLAs
Finding cache-starved applications
Optimal insertion of new applications on a cluster
Charging/bill-back
Administration: data can be aggregated and provided back to datacenter administrators to gauge the level of efficiency within the datacenter

Adetailed example of a real-time profiling use case follows to illustrate the capabilities that CMT enables.

Cache Monitoring Technology Example Data: Application Profiling

Through the use of CMT, applications can be monitored simultaneously while running on a platform. In the typical non-virtualized case shown in Figure 4 below, a number of applications are run on a 14-core Intel® Xeon® E5-2600 v3 processor based system with RMIDs pinned to each core. As applications run, their cache occupancy can be sampled periodically. In Figure 4 periodic spikes in occupancy (green line) are visible. In the middle of the plot a new memory streaming application is invoked on core, which quickly consumes all of the L3 cache and then terminates. Using CMT this aggressor application can be detected, and if its behavior is found to interfere with more important applications, the aggressor application could be moved to another processor or another node. If the aggressor application is simply resource-hungry but high-priority then its true cache sensitivity can be measured over time using CMT (Figure 5, discussed later).

Figure 4. CMT using one RMID pinned to each core, shown as a time series. The large spike in occupancy in the middle of the plot was caused by a memory-streaming application which quickly consumed all of the cache then exited. This data was collected on an Intel® Xeon® E5 v3 processor-based server.

Through the use of Cache Monitoring Technology the sensitivity of applications, especially cache-hungry applications can be measured dynamically, and a history can be constructed (Figure 5).

Figure 5. Cache Sensitivity plotted using CMT for a variety of applications drawn from SPEC* CPU2006. This data was collected on an Intel Xeon E5 v3 processor based server.

Shown in Figure 5 is an aggregated set of data collected by sampling occupancy periodically for each application in the presence of a variety of other background applications running on a real system. The data was then aggregated by averaging all samples within a given range. In this case the ranges were selected by cache occupancy of 0-1MB, 1-2MB, 2-3MB, etc. (a simple “bucketing” scheme). The normalized application performance was then plotted vs. cache occupancy, providing direct visibility into the cache sensitivity of each application. As shown in the figure, povray, a ray-tracing application, is not cache sensitive (its performance is nearly constant across cache sizes). The bzip2 application is moderately cache-sensitive, showing sensitivity up to around 8MB of L3 cache (implying that applications should be scheduled to provide approximately this amount of cache to bzip2). The two remaining applications (mcf and bwaves) show significant cache sensitivity, and should be given as much cache as possible. There is some noise in the mcf data, caused by the highly variable and multi-phase behavior of mcf, however a good view of its cache sensitivity is clearly presented and if a more precise characterization is required, a longer sampling period could be used.

If an application’s cache sensitivity is curve-fit (typically using a logarithmic fit of the form y = coefficient * ln(x) + constant) and if the correlation coefficient is reasonably high then then the entire cache sensitivity of an application can be characterized and stored as a single pair of numbers, the coefficient and constant values.

More advanced analysis can also be conducted once the curve fit is in place. For instance, taking the derivative of the performance vs. occupancy (cache sensitivity) curve with respect to occupancy yields a curve providing cache sensitivity per unit of cache occupancy. By adding a threshold (Figure 6), cache sensitivity can be precisely derived for a given application, and a tunable optimal operating point can be determined. For instance, the optimal cache operating point might be defined as the point where adding an additional 1MB of L3 cache only increases application performance by 1%. Such thresholds can be used for dynamic scheduling, and for detecting cache-starved applications.

Figure 6. Curve fitting application cache sensitivity (left side) then taking the derivative with respect to occupancy can generate a view of cache sensitivity vs. occupancy (right side), which can be used with simple thresholds to algorithmically measure the optimal cache operating point of any application. Here Instructions per Cycle (IPC) is used as a proxy for application performance.

The ability to dynamically measure the optimal cache operating point of an application is not possible without CMT, and is one of many new use models enabled with this technology. Without CMT the only way to obtain this data would be via simulation (not practical in a very dynamic datacenter environment) or estimation techniques (typically not practical due to inaccuracies and workload mutual interference in the datacenter).

The occupancy curves collected for various applications could be used to build long-term histories of applications and schedule optimally across sockets. For instance as shown on the left side of Figure 7 if two compute-intensive applications are co-located on a processor with small working sets (e.g., 4MB ideal cache size on a processor with 35MB L3 cache) then applications could be rebalanced across sockets to optimize L3 cache utilization and potentially increase performance (or stored for the next time the applications are run rather than dynamically moving them, simplifying the NUMA memory image implications).

Figure 7. Rebalancing applications across processors for optimal cache utilization using CMT.

The example profiling data above shows only one use case of many.

Conclusion

The Cache Monitoring Technology (CMT) feature enables threads, applications, VMs or any combination to be tracked simultaneously in a flexible manner to suit a wide variety of software usage models. The monitoring data collected can be used for application profiling, cache sensitivity measurement, cache contention detection, monitoring performance to SLAs, finding cache-starved applications, advanced cache-aware scheduling, optimal insertion of new applications, charging/bill-back and a variety of other advanced resource-aware scheduling optimizations. Data can be aggregated and provided back to datacenter administrators for instance to gauge the level of efficiency within the datacenter and guide future software optimizations.

Previous blogs are linked at the top of this page. The next blog discusses software enabling, tools and OS/VMM support availability.

References

[1] https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces

Authors

↧

Intel's Cache Monitoring Technology: Software Support and Tools

December 11, 2014, 6:22 pm

Latest and popular articles on Intel Technologies

≫ Next: Best Known Methods for Setting Locked Memory Size

≪ Previous: Intel's Cache Monitoring Technology: Use Models and Data

Introduction

The Intel's Cache Monitoring Technology (CMT) feature launched with the Xeon E5 2600 v3 line of server processors.

Previous blog posts as referenced below provide an overview of various aspects of the feature:

Product page: https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family
Blog 1: Introduction to CMT: https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring
Blog 2: Discussion of RMIDs and CMT Software Interfaces: https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces
Blog 3: Use Models and Example Data: https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-use-models-and-data

This blog, the fourth in the series, discusses details of available Operating System (OS) support, and software packages which can be used to test the feature.

Key details discussed in this installment include Linux*, perf and a software package which can be used on POSIX Operating Systems to monitor the L3 cache usage of applications (or pinned VMs) on a per-app/VM basis by pinning apps/VMs to cores.

Standalone vs Scheduler based monitoring

Utilizing the CMT capabilities is simple from a code development perspective with model specific registers providing the interface to set up this new feature. All modern Operating Systems provide API’s that enable users with the appropriate privilege to read and write the MSR’s. Linux* provides the msr-tools package which integrates both the readmsr and writemsr commands. MS Windows has a similar interface. There are two approaches to Cache Monitoring:

Standalone Cache Monitoring looks at the Last Level Cache usage from a Core or Logical Thread (referred to as CPU hereafter) perspective, regardless of what task is executing. An RMID is statically assigned to a CPU and periodically the occupancy is read back. If the platform has been statically configured and applications have been pinned to resources then this method will yield appropriate results. If system administrators are interested in whether the platform is suitably balanced and there are no misbehaving applications, this approach is acceptable.
Scheduler based Cache Monitoring involves the operating system scheduler. Obviously when RMID’s are assigned statically, following the Standalone based method, they do not track the process or thread id and therefore occupancy cannot be reported on a per application basis. In order to track the applications occupancy scheduler changes are required. Software will be required to assign an RMID to a process, in turn the scheduler will need to associate the core with the appropriate RMID when the application of interest is scheduled to execute on a CPU. Also when the application de-scheduled or migrated to a different core or thread, the scheduler is required to update RMID assignment to make sure occupancy is only updated when the application is executing. Systems software is also responsible for any remapping across CMT settings which may be required across processor sockets. Since RMIDs are local per socket, for instance, if an application with a given RMID is moved to another processor the OS or VMM is responsible for finding an available RMID on the destination socket to track the migrated application (if monitoring is required).

To enable standalone and scheduler based monitoring several software development initiatives are in progress that are described in subsequent sections.

Scheduler Based Monitoring – Linux* Operating System Support Overview

As explained in the above section, Scheduler based Cache Monitoring makes sure that the application of interest will be tracked with appropriate core and RMID association. This is achieved by integrating CMT into perf and its kernel support which is tightly bound to Linux* scheduler functionality.

In supported platforms (where both the processor and OS have support for CMT) Perf is used to specify which process or thread is to be monitored and assigns it an RMID. All threads not being monitored will be assigned a default RMID used to capture the occupancy associated with those threads not specifically being monitored. Once perf configures the system for monitoring, context switches for the monitored threads result in a callback into the perf_events subsystem. When the CMT callback from the scheduler occurs (during ‘context_switch’ kernel function), the perf_events subsystem selects the RMID associated with the thread being scheduled and assigns it to the CPU. The associated RMID may be for explicit monitoring or the default RMID in the case where the scheduled thread has not been configured for monitoring. From this point until the next context switch, the memory read requests and their subsequent cache loads from this logical processor will be assigned to the RMID just set up.

When a process or thread that is tracked for Cache occupancy terminates or the sched_out function call occurs, the perf CMT callback functionality selects a new RMID. In this instance the default RMID will be selected so that cache loads are not counted towards any explicitly monitored thread. After the monitored process terminates the associated RMID will be returned to a pool of unused RMID and will be recycled for new monitoring request. Mainstream support for these capabilities is trending to kernel version 3.19

Perf Implementation

CMT (Cache Monitoring Technology) Perf Implementation: The perf Linux* application provides an interface into kernel based performance counters. An extension has been developed to support the Cache Monitoring Technology feature. This allows users to monitor last level cache occupancy on a per process or thread basis. The name of the new event is intel_cqm/llc_occupancy/. This new event returns the occupancy in bytes. The patches to perf and Linux* kernel are available in the mailing list here:

https://lkml.kernel.org/r/1415999712-5850-1-git-send-email-matt@console-pimps.org.

The Perf driver module will check for CMT hardware availability using the CPUID instruction (see blog #2). If CMT has been detected a number of function calls will be registered with Perf. Below mentioned are some of the registered events and their functionality:

.event_init:
- Event Handled – Start perf monitoring on a PID or TID
- CMT callback –Allocates and Sets an unique RMID per PID/TID
.start:
- Event Handled – Start perf monitoring on a PID or TID after event_init
- CMT callback – Starts the monitoring capability
.add:
- Event Handled – on Schedule in of monitoring PID/TID
- CMT callback – Sets the monitoring capability on the scheduled core
.del:
- Event Handled – on Schedule out of monitoring PID/TID
- CMT callback – Resets the monitoring capability on the scheduled out core
.read
- Event Handled – Read monitoring counters for the PID/TID
- CMT callback – Read CMT occupancy value from MSR with RMID associated with the PID/TID
.stop
- Event Handled – End of perf monitoring on a PID or TID
- CMT callback – Resets the monitoring capability and clears all the allocated RMID

To make sure that the occupancy associated with CPU is accurate the Perf kernel component associates the RMID only with the specific application thread while it is running on the CPU. As explained in the previous section, when the Linux scheduler swaps the process the RMID will no longer be associated with the core. In addition to RMID tracking, Perf also has process or thread inheritance support (any child process will inherit the RMID of its parent).

Basic operation of perf with CMT:

User Space CMT APIs

The motivation for proposing to use a limited set of User Space CMT APIs is to provide easier usage and integration of CMT into applications. This enables developers to use a small subset of API to retrieve cache occupancy information in their applications. Such unified access API implementation methodology would provide better management of shared level platform resources like RMIDs, access to MSRs etc.

Below are proposed functions which would wrap around the perf_event system calls. It will help tracking cache occupancy for task/pids.

pqos_register_cmt(taskid, cpu) : This API provides pid/tid along with cpuid which needs to be tracked for CMT. Internally, perf will internally take care of RMID assignment, RMID recycling with scheduler implementation.
pqos_get_cmt_occupancy(taskid, cpu) : This API reports last level cache occupancy for registered task.
pqos_unregister_cmt(taskid, cpu): This API provides way to unregister tasks and release all associated RMID which were tracking this task to monitor level cache occupancy.

Research is ongoing to provide a user space library that allow developers or system administrators to take advantage of CMT without the need to worry about RMID management to track number of applications.

Propose design for above implementation in below diagram:

Virtual Machine Monitor Support (KVM & Xen)

Since KVM is a type two hypervisor it inherits the scheduler enhancements discussed in the previous section. Administrator or developers can utilize perf to track the last level cache occupancy of a virtual machine. The process or thread id’s of the virtual machines can be retrieved from the operating system through top or the Qemu monitor.

Since Xen is type 1 hypervisor scheduler enhancement will have to be made to track the last level cache occupancy. Xen 4.5 will be the first version of Xen supporting CMT. The hypervisor implementation associates an RMID with each Domain (DomU or guest VM). Those that have been specified for monitoring will be associated with their own RMID while those not specified will be associated with the default RMID used to collect all non-monitored occupancy data. As the hypervisor schedules each domain on to a CPU and performs the context switch it also writes the RMID to the CPU specific MSR thus associating this CPU with the RMID and its associated domain. As long as the domain continues to run on the CPU the LLC cache loads caused by domain memory reads from the CPU will be tracked in the RMID specific location. When the next domain is schedule for this CPU and the current monitored domain is switched out, its associated RMID is replaced on the CPU so no further association exists.

Xen’s xl command tool has a few additions to support CMT. The additions allow users to attach monitoring to a domain, detach monitoring and to show the LLC occupancy information. They have the following form:

$ xl psr-cmt-attach domid

$ xl psr-cmt-detach domid

$ xl psr-cmt-show cache_occupancy

where domid is the id number of the domain (guest VM) of interest.

Multi-OS Support via the Standalone Cache Monitoring Technology Library

This standalone library enables developers to monitor the last level cache occupancy on per CPU basis. When the library / application initially comes up it will check for the Cache Monitoring support. Once initialization is complete the monitoring functionality provides a “top” like interface listing the last level cache occupancy on a per CPU basis. The library implements a number of API’s that enable developers to take advantage of CMT without the need to setup the MSR’s that configure the RMID assignment or retrieval of the last level cache occupancy data. Developers can also utilize the library from within the virtual machine however either PV or MSR’s bitmaps might be required to gain access to the CMT Model Specific Registers.

Other Operating Systems and VMMs

Additional OSes and VMMs will be enabled over time. Check the documentation or feature list for your preferred OS/VMM to determine if CMT is supported on a particular version.

If your preferred OS/VMM doesn’t yet support CMT their customer support organization may be able to track the feature request and provide an estimated time when support will be ready.

Conclusion

Several mainstream OSes and VMMs now include support for Intel's Cache Monitoring Technology (CMT), and for non-enabled OSes a software library will be available soon to enable experimentation, prototyping of resource management heuristics and deployment of the features.

Authors

Andrew Herdrich

Edwin Verplanke

Priya Autee

Will Auld

↧

Best Known Methods for Setting Locked Memory Size

December 16, 2014, 5:07 pm

Latest and popular articles on Intel Technologies

≫ Next: Experimenting with OpenStack* Sahara* on Docker* Containers

≪ Previous: Intel's Cache Monitoring Technology: Software Support and Tools

If you use the Direct Access Programming Library (DAPL) fabric when running your Message Passing Interface (MPI) application, and your application fails with an error message like this:

………………..
[0] MPI startup(): RLIMIT_MEMLOCK too small
[0] MPI startup(): RLIMIT_MEMLOCK too small
libibverbs: Warning: RLIMIT_MEMLOCK is 0 bytes.
………………..

The warning “RLIMIT_MEMLOCK too small” coming from the OpenFabrics Enterprise Distribution (OFED) driver, indicates that the permitted size of the locked memory in your system is too small. When running an MPI in symmetric mode, that warning may indicate the locked memory is insufficient in either the Intel ®Xeon® host or the Intel® Xeon Phi™ coprocessors. So you need to increase the maximum permitted memory in order to run your MPI program successfully.

For example, if I run the command below on the host, it will display user’s current limits on the host:

$ ulimit –a

core file size               (blocks, -c) 0
data seg size                (kbytes, -d) unlimited
scheduling priority                     (-e) 0
file size                         (blocks, -f) unlimited
pending signals                           (-i) 514546
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                                 (-n) 1024
pipe size                   (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority                          (-r) 0
stack size                      (kbytes, -s) 10240
cpu time                     (seconds, -t) unlimited
max user processes                   (-u) 1024
virtual memory               (kbytes, -v) unlimited
file locks                                    (-x) unlimited

As shown in the above example, the maximum size of locked memory in this system is only 64 K bytes, I need to increase it as suggested by the program. In this blog, I would like to share with you some Best Known Methods (BKMs) on how to set the locked memory size properly on the Intel® Xeon® host and the Intel® Xeon Phi™ coprocessor.

There are multiple methods available to set the locked memory size. In this blog, I discuss two of them.

1. Method 1: Changing Locked Memory Permanently

This method shows how to alter a configuration limit to allow a user to change the locked memory size.

1.1 Change locked memory size in the host:

Edit the file /etc/security/limits.conf

$ sudo vi /etc/security/limits.conf

Add the following line at the end of the file to set an unlimited locked memory for the user named “user1”. The second field value can be “hard” or “soft”: “hard” enforces the resource limit, and “soft” defines the default value within the maximum value specified by “hard”.

user1 hard memlock unlimited
user1 soft memlock unlimited

By adding this, user1 can raise the locked memory size without limit. To allow all users to set unlimited memlock, simply replace user1 with the wildcard *

* hard memlock unlimited
* soft memlock unlimited

Reboot the system. Login and verify that the locked memory is now set to unlimited:

$ ulimit –l
unlimited

1.2 Change locked memory size in the coprocessor:

If the locked memory problem actual happens in the coprocessor, then you need to increase the permitted size in the coprocessor. You can change the default setting using micctrl, a multi-purpose toolbox for the system administrator. Depending on the Intel® Manycore Platform Software Stack (Intel®MPSS) version that you installed, the way for setting the locked memory limit differs depending upon whether you are using a Yocto-based (Intel® MPSS 3.x) or pre-Yocto-based (Intel® MPSS 2.x) distribution.

1.2.1 Intel® MPSS 3.x and later:

For Intel® MPSS 3.x, you can use the standard process for setting the parameters in the configuration file limits.conf.

Create a new directory if it was not created previously:

$ sudo mkdir /var/mpss/mic0/etc/security

Copy the standard configuration limits.conf:

$ sudo /etc/security/limits.conf /var/mpss/mic0/etc/security/.
$ sudo vi /var/mpss/mic0/etc/security/limits.conf

Add the following line at the end of the limits.conf file to set an unlimited locked memory for the user named “user1”:

user1 hard memlock unlimited
user1 soft memlock unlimited

Restart the MPSS service (for RHEL 6.x and SUSE):

$ sudo service mpss restart

For RHEL 7.x, type the following command instead:

$ sudo systemctl restart mpss

$ ssh mic0 ulimit -l
unlimited

1.2.2 Intel® MPSS 2.x:

For Intel® MPSS 2.x, you need to get the permission to run programs with security privileges:

$ sudo su

Create a new file: /opt/intel/mic/filesystem/base/etc/limits.sh and add the following line to that file:

ulimit -l unlimited

Change the permissions of this file:

# chmod 744 limits.sh

Edit the existing file /opt/intel/mic/filesystem/base/etc/rc.d/rc.sysinit, and then add this one line just above the line that reads: echo_info "Start runlevel 3 services":

[ -x /etc/limits.sh ] && . /etc/limits.sh

Edit the existing file /opt/intel/mic/filesystem/base.filelist, and append the following line to the end of the file:

file /etc/limits.sh etc/limits.sh 0744 0 0

Restart the MPSS service:

# service mpss restart

$ ssh mic0 ulimit –l
unlimited

2. Method 2: Changing Locked Memory by Using Scripts

In this method, I create a shell script to change the locked memory size. Inside the script, I set the locked memory properly and then specify the program to be executed. Instead of running the MPI program directly, now I pass the scripts instead of the program. Inside the script, the shell sets the locked memory size accordingly and then runs the application.

The example below shows how I create two scripts, one for the host (hostscript.sh) and one for the coprocessor (micscript.sh). Inside each script, I use the command “ulimit” to set the locked memory to unlimited before launching the application, IMB-MPI1 in this case.

$ cat hostscript.sh
echo "Current max locked memory in host: "
ulimit -l
echo "Set max locked memory to unlimited in host"
ulimit -l unlimited
echo "New max locked memory: "
ulimit -l
# MPI application
/opt/intel/impi/5.0.1.035/bin64/IMB-MPI1 PingPong

$ cat micscript.sh
echo "Current max locked memory in coprocessor: "
ulimit -l
echo "Set max locked memory to unlimited in coprocessor"
ulimit -l unlimited
echo "New max locked memory: "
ulimit -l
# MPI application
/tmp/IMB-MPI1

Transfer the coprocessor script to the home directory on the coprocessor and the application:

$ scp micscript.sh mic0:/home/user1
$ scp /opt/intel/impi/5.0.1.035/mic/bin/IMB-MPI1 mic0:/tmp

Finally, run the MPI program by passing the scripts:

$ export I_MPI_MIC=enable
$ mpirun -host localhost -n 1 ./hostscript.sh : -host mic0 -n 1 ./home/user1/micscript.sh

In summary, the above methods can be used to change the locked memory size. Method 1 is preferred when one needs to reboot the system many times during the test since the change is permanent in the system. Method 2 is used when a user wants the change only occurs in the running session, thus all the default setting are reserved after the running session is done.

↧

Experimenting with OpenStack* Sahara* on Docker* Containers

December 28, 2014, 11:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel's baremetal provisioning patch for DevStack

≪ Previous: Best Known Methods for Setting Locked Memory Size

PREFACE

Docker* is an emerging technology that has become very popular recently in the market. It provides a flexible architecture to deploy applications. OpenStack* is another hot technology on the market. It has been available for several years, became more stable and also added more features support in recent releases. Sahara is a project to bring Big Data technology (Hadoop*, Spark*, …etc) into OpenStack*. It would be a perfect match when we consider the use of Docker instead of Hypervisors in the OpenStack. Docker + OpenStack can provide a better resource utilization and also may have a better performance compare with Hypervisors like KVM*, VMware*,etc. On the other hand, when we consider Big Data solutions in the cloud, people always have performance concerns about Hypervisor and Bare Metal. Docker is a good solution to solve this performance concern. This blog is a tutorial to help people to enable Docker in OpenStack Sahara*. During our installation, we suffered several issues and had assistance from Docker support or Nova Docker driver support. I also listed them in this blog as tips if you are interested in Docker and OpenStack as well.

PREREQUISITES

Hardware Configuration

We used 6 Intel based Servers with the same hardware configuration to build up OpenStack environment. Listed below is the machine details as a reference.

Machine Numbers x 6: Controller w/ Computing Node x 1, Computing Nodes x 5

Hardware Details
Items	Details
CPU	Intel Xeon X5670 2.93Ghz
Memory	64GB Memory(1333Mhz 8GB x 8)
Storage	1TB SATA HDD

The commands we used to check the hardware configuration.

The Commands check the configuration
CPU Information	# cat /proc/cpuinfo
Memory Information	# cat /proc/meminfo
More Momory Information	# dmidecode
Disk Information	# fdisk -l

Software Configuration

Below is the software configuration we used in our experimental environment. At the beginning, we used an older version of Docker (v0.9), but the support functionality from Docker is not very robust since Docker is an incubation project. We would like to recommend the latest Docker to be used in your environment. A newer release can provide more features with OpenStack and also more stability. In our environment, we use Docker v1.3, it works well with OpenStack Juno. The operating system is CentOS, you can also use Ubuntu. The support from Docker and OpenStack should be the same.

Software Details
Software	Version
Operation System	CentOS 7.0
Docker	v1.3
OpenStack	Juno

Network Topology

The network is an important issue if you would like to run in a complex or production environment. In our experiment, there is no need to run in a complex environment. We choose to use a simple network topology to support Docker in OpenStack. We only used one network interface in each machine. It’s a minimal solution but also enough to support OpenStack. If you are considering having complete OpenStack feature support, you may need more networks.

Network Name	Subnet	Details
Public Network	192.168.0.0/16	Used for Floating IP Assigning
Private Network	10.0.0.0/16	Used for Private IP Assigning

Logical Architecture and Working Flow

The graph below describes the logical architecture and the working flow in Sahara and the Nova Docker Driver. In Sahara, we use a CDH Plugin as an example to build up a Hadoop cluster and run Hadoop jobs in it. CDH Plugins use the existing OpenStack project, Heat, Nova, Neutron, Glance, etc., to provision a cluster and use Cloudera Manager to install the rest of the Hadoop services.

OpenStack Sahara Working Flow with Docker

Step1: Call Heat to provision a cluster by using Nova API and other OpenStack services.

Step2: Enable Cloudera Manager Service

Step3: Call CM API to install the other services in the cluster

OpenStack ENVIRONMENT

We use OpenStack Juno as our experimental platform. OpenStack Juno has enabled Sahara project, but it remove novadocker driver support from Nova project to Stackforge. We installed the novadocker driver and modified it to support Docker v1.3. The detail configuration will be described in below.

Step 1: Software Repositories

Update the current packages.

# sudo yum update -y

Setup RDO repositories.

# sudo yum install –y https://rdo.fedorapeople.org/rdo-release.rpm

Please see https://repos.fedorapeople.org/repos/openstack/openstack-juno/ to download different OpenStack distributions.

In this case, we use https://repos.fedorapeople.org/repos/openstack/openstack-juno/rdo-release-juno-1.noarch.rpm for our experiments.

Step 2: Install Packstack Installer

Install Packstack installer from RDO repo.

# sudo yum install -y openstack-openstack

Step 3: Edit Packstack Configuration File (Optional)

Generate a configuration file.

# packstack --gen-answer-file=$answer_file_template

Customize the answer file for your needs.

Please go to the reference chapter for an answer file example.

Step 4: Run Packstack to install OpenStack

Run below command to install OpenStack via Packstack with an answer file.

packstack --anser-file=$answer_file_template

Step 5: Install Sahara

Install Sahara package
Edit Sahara Configuration files in
For more information, please follow http://docs.openstack.org/developer/sahara/userdoc/configuration.guide.html
Create Database Schema
Start sahara service
Set sahara on run level

Step 6: Wait for the installation finish

After the installation has been done, enjoy your OpenStack environment.

CONFIGURE WITH DOCKER

The Docker driver is a hypervisor driver for OpenStack Nova. It was introduced with OpenStack Havana. Although it has been removed in Juno, we can still use it in Juno, with some modification. It is also expected the driver will return to mainline Nova in Kilo release.

Nova Docker driver Working Flow

The Nova driver embeds an HTTP client to talk with Docker’s internal REST API thru a Unix socket. The driver will fetch images from OpenStack Glance and load them into the Docker file system. Images can use the ‘docker save’ command to export a Docker Image to Glance and build a Docker container in Docker registry.

Configure OpenStack to enable Docker

Install Docker at first.
Option 1 - Automatically Install from repo
# sudo yum install docker
Option 2 - Manually Install the latest Docker
# wget https://get.docker.com/builds/Linux/x86_64/docker-latest -O docker
# chmod +x docker
# sudo ./docker -d &
For more information, please refer to https://docs.docker.com/installation/binaries/
For RHEL6, you will need RHEL 6.5 or higher, with a RHEL 6 kernel version 2.6.32-431 or higher as this has specific kernel fixes to allow Docker to work. For more details, just refer to link above. In order for Nova to communicate with Docker over its local socket, add nova to the docker group and restart the compute service to pick up the change:
# usermod -G docker nova
# service openstack-nova-computerestart
Install Nova Docker Driver:
# pip install -e git+https://github.com/stackforge/nova-docker#egg=novadocker
Install the required modules:
# cd src/novadocker/
# python setup.py install
Nova Configuration
Nova must to be configured to use nova docker driver. Edit “/etc/nova/nova.conf” to configure below options:
[DEFAULT]
compute_driver = novadocker.virt.docker.DockerDriver
Create the directory /etc/nova/rootwrap.d(consistent with the “filters_path” in the file /etc/nova/rootwrap.conf), inside the directory create a file “docker.filters” with the following content:
[Filters]
# nova/virt/docker/driver.py:'ln', '-sf', '/var/run/netns/.*'
ln: CommandFilter, /bin/ln, root
Glance Configuration
Configure the options below in glance conf.
[DEFAULT]
container_formats = ami,ari,aki,bare,ovf,docker

How to use Docker in OpenStack

Below is an example in how to use the Docker image in OpenStack. You can create a custom Docker image and upload the image by using below commands.

Search a docker image available in Docker public registry
# docker search $image_name
Pull the image
# docker pull $tags/$image_name
Save the image and register it in Glance
# docker save #tags/#image_name | glance image-create --is-public=True --container-format=docker --disk-format=raw --name $tags/$image_name
Boot the instance using Docker image
# nova boot --image “samalba/hipache” --flavor m1.tiny test
Check the instance is booted
# nova list
Check the instance in Docker
# docker ps

BUILD A CUSTOM DOCKER IMAGE

You may want to build your own Docker image for OpenStack. Docker can build an image automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands you would like to execute in the image. By calling “docker build” command, you can build your own image.

How to Build Docker Image

# sudo docker build -t $tags/$image_name /path/to/dockerfiledir

Note: Docker will re-use the intermediate images, accelerating the building of an image significantly. Please make sure all the required commands are at the top of the Dockerfile. For those changeable commands, please put them at the bottom of the Dockerfile. This method can help in building the image by using cache to run the usual commands.

For a Dockerfile usage, please refer to http://docs.docker.com/reference/builder/

For a complete Dockerfile example, please check the reference.

TROUBLE SHOOTING

The number of vCPU is always “1” for every computing node.
Solution:
There is a different concept between Docker and Hypervisor. Please update the source code below in the nova docker driver.
stats = {
    'vcpus' : 1, # Change the number to apply the number of vCPU
    'vcpus_used': 0,
    ...
}
When Remotely running the command “ls .ssh/authorized_keys” -fails during starting the instances.
This command is for cloud-init to generating authorized_keys in the instance. But Docker cannot support this feature. For this situation, please MARK the command in “/usr/lib/python2.7/site-packages/sahara/service/engine.py”
Remote login failed by using private key.
Modify _ssh.connect(host, username=username, password=”xxx”, sock=proxy) in “/usr/lib/python2.7/site-packages/sahara/utils/ssh_remote.py”
Please make sure the password is also set for Docker image.
When you remotely run the command “sudo hostname” it fails.
Docker cannot support modifying the hostname before Docker v1.2. Please upgrade Docker to v1.2 or later.
Workaround: There is a workaround to customize the host file in Docker manually. Please refer to: http://jasonincode.com/customizing-hosts-file-in-docker/#.VFl1DPmUdZu
When you remotely run the command “sudo mv etc-host /etc/hosts” it fails.
Docker cannot support modifying /etc/hosts before DOcker v1.2. Please upgrade Docker to v1.2 or later.
After upgrade to Docker v1.2, “sudo mv etc-host /etc/hosts” responds the device is busy.
Please use “sudo cp etc-hosts /etc/hosts” to replace the command “sudo mv etc-hosts /etc/hosts” in sahara/service/engine.py of Sahara source code. Another way to do is to set the hosts manually when all the instances launched.
My instances cannot reach each other.
Please make sure all the settings are fine in /etc/hosts and proxy variables: http_proxy, https_proxy, and no_proxy
Could not open session when run the command “service cloudera-scm-server-db”
By default Docker cannot allow the authorization to create the Database using the command. To fix, add the “Privileged” parameter in nova docker driver and set it to True to solve the issue.
Workaround: A workaround is to modify “/etc/security/limits.d/xxx.conf” after the instance launched and set the value from “hard”/”-” to “soft” to avoid the issue.
How to check if CM can be reachable?
Use the command:
# curl -X GET -u "admin:admin" -l http://$cm_host:7180/api/v7/tools/echo?message=hello
CM response “Connection refused”.
Please check the firewall has been passed for Docker.
# iptables -t nat -A OUTPUT -j DOCKER
The log response ApiException:{}(error 500)
Add extra time in sahara/plugins/cdh/deploy.py of Sahara source code, default is 300 secs for timeout. Another root cause could be proxy issue. Please make sure you have correct settings for http_proxy, https_proxy, and no_proxy in the environment.
There are several ports that cannot be accessed when starting the Cluster.
Please expose all the necessary ports in Dockerfile.
Or you can add “Publish-all-ports” parameter in client.py of nova docker driver source code, please make sure set it to true.
There is no storage space in Data Node.
By default, Docker uses a 10GB Root Disk and a reserved space also needs to be set for non-DFS usage in CM.
Please change the parameter when you launch Docker binary. For more information, please refer to https://github.com/snitm/docker/blob/master/daemon/graphdriver/devmapper/README.md
Another way to do is to set reserved space to smaller in CM HDFS Configuration.
My instance cannot access files for swift package in Sahara.
Please make sure your instances can reach the internet at first.
Or you can set up your own site (like ftp or nfs) to get the necessary files. Please also remember to change the swift package url in node group templates of Sahara.
There is no cloudera-scm-agent running on the host.
Sometimes cloudera-scm-agent may not started automatically or get an error when starting. Please restart the service manually by using the command: “sudo service cloudera-scm-agent restart”
Docker cannot connect to proxy.
Please use HTTP_PROXY when you launch Docker process in the background, for example: “sudo HTTP_PROXY=xxx ./docker -d &”. Then you create the Docker image by using this process with HTTP_PROXY.
Container cannot start in other Computing Nodes except the Controller?
Docker Image must be copied to individual Computing Nodes manually. Please also don’t forget to register the image to Docker register and using “docker images” to confirm the image is existing in the Docker.

↧

Intel's baremetal provisioning patch for DevStack

December 22, 2014, 5:53 pm

Latest and popular articles on Intel Technologies

≫ Next: DPDK Community Meetup

≪ Previous: Experimenting with OpenStack* Sahara* on Docker* Containers

OpenStack employs DevStack for integration testing and development purposes. In previous iterations of DevStack, baremetal provisioning was only simulated via Ironic by having physical machines replaced with virtual machines.

With recent patches supplied by Intel, the ability to test/tryout baremetal provisioning is more simple and easy. Previously DevStack could only support Ironic configuration using the pxe_ssh driver. Patches have now made the agent_ipmitool driver available as well. Rapid baremetal setup is now much easier with this additional DevStack support.

Specifically, the DevStack patches Intel contributed enable two things: they allow the use of the agent_ipmitool driver in Ironic for provisioning physical servers and create a flat provider network environment to let the physical machines have access to the DHCP service which is hosted in a virtual network environment. The agent_ipmitool driver manages nodes by deploying the operating system using Ironic Python Agent and controlling the power status using IPMI commands sent via the ipmitool utility. DevStack will also create a flat provider network for these physical machines in order to communicate with other services provided by OpenStack such as the PXE server and the dnsmasq service.

For more on OpenStack baremetal provisioning with exploratory exercises, see this article: https://software.intel.com/en-us/articles/physical-server-provisioning-with-openstack

↧

DPDK Community Meetup

January 27, 2015, 2:12 pm

Latest and popular articles on Intel Technologies

≫ Next: ADVANCED COMPUTER CONCEPTS FOR THE (NOT SO) COMMON CHEF: INTRODUCTION

≪ Previous: Intel's baremetal provisioning patch for DevStack

Silicon Valley DPDK Meetup

This is a group for engineers who enjoy developing applications for high network performance, it is all about plumbing... but for fat pipes!

This is a casual setting to collaborate, discuss, and learn more about DPDK.
Let's meetup and have fun with the Silicon Valley DPDK community, every 2nd Thursday of the month at 6:00 pm.

See you there!

Signup for the next meeting here >>>

↧

ADVANCED COMPUTER CONCEPTS FOR THE (NOT SO) COMMON CHEF: INTRODUCTION

February 20, 2015, 1:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon Phi™ Coprocessor Developer Training Coming to a City Near You in 2015

≪ Previous: DPDK Community Meetup

TITLE:
INTRODUCTION
ADVANCED COMPUTER CONCEPTS FOR THE (NOT SO) COMMON CHEF

While talking to a very intelligent but non-engineer colleague, I found myself needing to explain the threading and other components of the Intel® Xeon Phi™ ⅹ100 and ⅹ200 architectures. The first topic that came up was hyper-threading, and more specifically, the coprocessor’s version of hyper-threading. Wracking my brain, I finally hit upon an analogy that seemed to suit: the common kitchen.

After grasping hyper-threading in the processor, she asked follow-on questions, extending our discussion. As the conversation continued to evolve, I realized that the kitchen was turning out to be an excellent way of explaining, in an intuitive and (relatively) non-technical way, how the processor worked in a general sense, and what the new innovative features the Knights family of processors brought to the table. The Knights family includes the former Knights Corner, now known as the Intel® Xeon Phi™ ⅹ100 Product Family, the next generation Knights Landing, known as the Intel® Xeon Phi™ ⅹ200 Product Family, and the generation beyond that, Knights Hill.

This series of blogs should be a lot of fun for me, and hopefully for you as well. We’ll discuss processor pipeline, memory hierarchy, types of memory, hyper-threading, and lots more. At some point, I may even dare to look at some other more abstract concepts, such as Amdahl’s Law.

In and of itself, this isn’t a unique analogy. If you plumb the vast depths of the web, different forms of the kitchen analogy appear relatively frequently. So what am I bringing to the table? Well, I’m going further and will not just explain the basics of how a computer works but details of the underlying technology. I’m also going to use, as a case example, the workings of the Intel® Xeon Phi™ product families. By product family, I’m referring not just to the Intel® Xeon Phi™ ⅹ100 coprocessor (a.k.a. Knights Corner) but also the ⅹ200 processor and coprocessor families (a.k.a. Knights Landing). Though I won’t be able to discuss KNL specifics that aren't public, there’s a lot of good information out there that is public.

I know what many of you are thinking. “Hey, I’m not a novice but a technical sophisticate. Is this series going to have anything for me?” Personally, I think so. Reading and taking tests about TLBs, interconnects, and the like in an architecture class is one thing; developing an intuitive understanding is quite another. I certainly know that I’ve been learning things by writing this series. If you get excessively bored, I give you permission to not read the rest of my kitchen blogs. I promise not to be offended.

Here’s what the series is looking like at this point. Things will change as the series progresses, as they always do.

An aside on threads and the CPU
Kitchen computing: mapping a kitchen and the chef to a computer system
Hyper-threading and multiple chefs in the kitchen
Memory hierarchy and the well-stocked pantry
Adaptive threading and the adaptive kitchen
The high-bandwidth pantry
Prefetching and caching in the kitchen
The restaurant kitchen and interleaving
So many pantries and the TLB

My hope is that this series will be just as useful to the techie, as to the casual reader who wishes to know more about what makes a computer tick. It will briefly cover topics that are broad and general, e.g., what the CPU and memory are, before diving deeper in to more advanced (and interesting) topics that even the casual programmer may not be that familiar with, e.g., TLBs. Ultimately, the series will drop into topics that introduce the advanced and unique features of the Intel® Xeon Phi™ ⅹ200 product series.

The first blog in the series (outside of this introduction) will be about clarifying what a thread and CPU actually are. After that, we will start building our kitchen.

↧