Intel® Xeon® Processor E7 V2 Family New Reliability Features

April 21, 2014, 10:13 am

Latest and popular articles on Intel Technologies

The following article covers Reliability features at a glance. A very comprehensive whitepaper on MCA recovery and how to change applications to be Recovery Aware is available here: https://software.intel.com/en-us/articles/intel-xeon-processor-e7-880048002800-v2-product-family-based-platform-reliability

1) Introduction

In today’s world datacenters consisting of many servers are used to run mission critical and enterprise applications like stock trading or corporate finance and billing. Server failures can cause potential data loss and downtime, resulting in increasing service costs and potentially compromising data integrity. To minimize these effects, Intel introduced advanced Reliability, Availability and Serviceability (RAS) features in the Xeon® processor E7 product family. More information about advanced RAS can be found in previously authored whitepapers^{1 2}. The purpose of this article is to describe new advanced RAS features added to the Intel® Xeon® Processor E7 v2 family in 2014 and marketed as part of the “Intel® Run Sure Technology”. This product family is a 2, 4 or 8-socket platform based on Intel® Core™ microarchitecture (formerly codenamed Ivy Bridge) and manufactured on 22-nanometer process technology.

2) New RAS Features

Many of the new advanced Reliability, Availability and Serviceability features introduced here are implemented in hardware and firmware which don’t require any changes to software programs. However some do require Operating System or Virtual Machine Manager (VMM) support as well as recovery mechanisms from a software perspective.

PCIe Live Error Recovery (LER)
This feature allows the system to bring down the PCIe³ link associated with the PCIe root port where an uncorrected (fatal or non-fatal) fault is detected in either an incoming or outgoing transaction without resetting the entire system. It also allows Firmware/Software assisted link retraining and recovery. LER also protects against the transfer of potentially corrupt data to the disk.
Enhanced Machine Check Architecture Gen 1 (eMCA1)
This feature enhances the existing Machine Check Architecture (MCA) by implementing Firmware First Model² (FFM) of the error reporting (logging and signaling). FFM is a server RAS paradigm where all the error events are first signaled to platform specific firmware. The firmware processes the error logs and decides if and when to notify the Operating System or Application software layers of an error. EMCA1⁴ can be configured to provide enhanced error log information to the OS and VMM that can be used to implement advanced diagnostic and predictive failure analysis⁷ (PFA) for the system. Legacy MCA provides physical address of the memory location when a corrected fault occurs, but it is challenging for PFA software to map it to an actual physical DIMM. EMCA1 allows providing such additional error logs to the PFA software.
Machine Check Architecture (MCA) recovery for I/O
The MCA recovery for I/O allows uncorrected, both fatal and non-fatal, I/O errors to be reported through the MCA mechanism. Intel® Xeon® Processor E7 families incorporate PCI Express* Advanced Error Reporting⁵ (AER) architecture to report (log and signal) uncorrected and corrected I/O errors. Normally uncorrected I/O errors are signaled to the system software either as AER Message Signaled Interrupt (MSI) or via platform specific mechanisms such as System Management Interrupt (SMI) and/or physical Error Pins. The signaling mechanism is controlled by BIOS and/or platform firmware. As a part of this new feature the processor has added a new Machine Check bank called IOMCA and allows logging and signaling of IO uncorrected errors through standard Machine Check Architecture. It logs the Bus, Device, and Function information associated with the PCI Express port, thus allowing error handling software to identify the source of error faster. By using this feature to signal the uncorrected I/O errors through the MCA mechanism, the errors can be communicated to the software layer (OS, VMM and DBMS) to improve error identification and recovery.
Machine Check Architecture (MCA) recovery – Execution Path
The MCA recovery - Execution Path feature offers the capability for a system to continue to operate even when the processor is unable to correct data errors within the memory sub-system and allows software layers (OS, VMM, DBMS, and Applications) to participate in system recovery. This feature can handle hardware uncorrected errors occurring within the memory sub-system including main memory, last level caches, and mid-level caches. When the processor detects a fault within the memory sub-system, it will attempt to correct the fault. In most of the cases, memory faults are corrected by the processor. However, if the error cannot be corrected, the processor will notify the operating system (OS) using Machine Check Exception⁶ (MCE) and logs the error as an uncorrected recoverable error (UCR). The OS analyzes the log and verifies that the recovery is possible. If the recovery is possible then the OS un-maps the affected page(s) and triggers a SIGBUS event to the application. If the error is detected in an instruction code then the instruction fetch unit (IFU) is notified and MCE is triggered by the IFU. In this case, the OS will reload the affected page containing the instruction to a new physical page and resume normal execution. If the error is detected within the data space then the Data Cache Unit (DCU) is notified and the MCE is triggered by the DCU. In this case, the OS will notify the application through the SIGBUS event, and it is up to the application to take further action. The affected application is then responsible for reloading the data. If the data was already modified and the application cannot reload the data from the disk, the affected application will be terminated (i.e. a system reset will not be required, and other applications will continue to operate normally). In order to take full benefit of the MCA recovery – Execution Path feature, applications are required to be ‘Recovery Aware’.

3) Conclusion

Additional advanced RAS features allow the Intel® Xeon® processor E7 V2 family to increase resiliency within the memory sub-system and IO sub-system so that when hardware uncorrected errors are detected, the system can detect them, recover and continue to operate instead of suffering fatal events requiring a system reset. It also allows enhanced error reporting to expedite fault diagnosis.

4) References

[1] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-e7-family-ras-server-paper.pdf

[2] https://noggin.intel.com/content/autonomic-foundation-for-fault-diagnosis

[3] http://en.wikipedia.org/wiki/PCI_Express

[4] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/enhanced-mca-logging-xeon-paper.pdf

[5] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-297-304.pdf

[6] http://en.wikipedia.org/wiki/Machine-check_exception

[7] http://en.wikipedia.org/wiki/Predictive_failure_analysis

↧

Go Parallel 3

April 22, 2014, 4:54 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® graphics virtualization update

≪ Previous: Intel® Xeon® Processor E7 V2 Family New Reliability Features

This is the third part of Go Parallel series (part 1, part 2) about parallel programming with the Go language. Today we will looks at sync and sync/atomic packages.

But first, the previous part contains a serious drawback -- I forget to post a nice picture of a gopher. Let's fix it:

golang gopher

OK, now we can get back to parallel programming.

Main synchronization primitives in Go are channels. But Go is a shared memory programming system, and it provides all traditional synchronization primitives as well. Package sync contains Mutex, RWMutex (reader-writer mutex), Cond (condition variable), Once (for one-time actions) and WaitGroup (allows to wait for a group of subtasks). Package sync/atomic contains Load, Store, Add, Swap, CompareAndSwap atomic operations on numeric types and pointers.

Let's consider some examples where these synchronization primitives can be useful (both simplify code and improve performance). Let's say we do parallel graph traversal and want to execute some code only once per node. sync.Onceis what we need here:

type Node struct {
    once sync.Once
    ...
}

func visit(n *Node) {
    n.once.Do(func() {
        ...
    })
    ...
}

You can see full code here.

If we need to do concurrent accesses to a hashmap, then sync.RWMutex is the simplest solution:

    m := make(map[int]int)
    var mu sync.RWMutex

    // update
    mu.Lock()
    m[k] = v
    mu.Unlock()

    // read
    mu.RLock()
    v := m[k]
    mu.RUnlock()

Full code is here.

If we are using parallel branch-and-bound approach and the result fits into a single variable (e.g. the best packing of knapsack), then we can use atomic operations to read/update it from several goroutines:

// checkUpperBound returns false if a branch with the given upper bound
// on result can't yield a result better then the current best result.
func checkUpperBound(best *uint32, bound uint32) bool {
    return bound > atomic.LoadUint32(best)
}

// recordResult updates the current best result with result res.
func recordResult(best *uint32, res uint32) {
    for {
        current := atomic.LoadUint32(best)
        if res < current {
            return
        }
        if atomic.CompareAndSwapUint32(best, current, res) {
            return
        }
    }
}

Full code is here.

Sync and sync/atomic packages also allow easy direct porting of existing software from other languages to Go.

This is not particularly related to the sync package, but here are some interesting HPC/Scientific Computations libraries that flew by me recently: Go Infiniband libraries, a bioinformatics library for Go, Graph package, Go OpenCL binding, Automatic Loop Parallelization of Go Programs.

So what are you waiting for? Go parallel!

↧

Intel® graphics virtualization update

May 2, 2014, 11:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Power Management: So what is this policy thing?

≪ Previous: Go Parallel 3

Traditional business models, built on graphics and visualization usages such as workstation remoting, VDI, DaaS, transcoding, media streaming, and on-line gaming, are beginning to draw open source attention, worldwide. Employees are becoming mobile. They want flexibility of working from any device, anywhere, anytime, with any data, without any compromise in the quality due to access, latency or visualization. On the data center side, IT wants to protect enterprise data and IP in the most cost effective and scalable manner, while delivering a great user experience to the mobile users.

To satisfy both client and server sides, Intel has developed a comprehensive portfolio of graphics virtualization technologies trade marked as Intel® Graphics Virtualization Technology™ (Intel GVT). This portfolio currently covers three distinct flavors of graphics virtualization approaches, namely: Intel GVT-d, Intel GVT-s and Intel GVT-g. Developers can pick one or more techniques from Intel GVT portfolio to best suit their respective solutions and business models. Additional innovative techniques can be expected to get added to the portfolio as Intel GVT adoption will grow, especially in open source.

Further, true to the spirit of Moore’s law, Intel is integrating CPU-GPU in both client and server products which results in improved energy efficiency, reliability, density, and lower engineering complexities. Combining Intel’s platform integration advantages with smart techniques of sharing graphics amongst many concurrent users, IT can now deliver workstation quality high-end performance at low total costs of ownership. Additionally, using Intel® Media SDK™ tools and libraries, developers can write software that scales very well across all Intel platforms, and can last over multiple generations.

Intel GVT portfolio can be summarized as follows:-

Intel GVT-d

This flavor allows direct assignment of an entire GPU’s prowess to a single user, passing the native driver capabilities through the hypervisor without any limitations (fig-1). The assignment of the GPU is accomplished using Intel’s foundational hardware virtualization features namely VTd or DPIO. For Xen developers, Intel GVT-d has been upstreamed as Qemu Traditional with VTd.

Common nomenclature used in the industry for this flavor of graphics virtualization is ‘Direct Graphics Adaptor’ (vDGA). A large number of commercial desktop and workstation virtualization products in the market use this approach. From user experience view point, there is practically no difference between having a local desktop machine with a dedicated GPU, versus, having a Virtual Desktop with Intel GVT-d direct assigned Intel processor-graphics somewhere in the enterprise server or in the cloud.

Intel GVT-s

Commercially available as Virtual Shared Graphics Adaptor (vDGA, VMWare) and Remote FX (Microsoft), this graphics virtualization approach requires a virtual graphics driver in a virtual machine, use an API forwarding technique to interface with the Intel’s graphics hardware (fig-2). Single GPU hardware can be shared amongst many concurrent users, while the graphics hardware remains abstracted from the applications. Specific sharing algorithms remain proprietary to the virtual graphics driver. Many commercial desktop and workstation remoting products in the market use this approach.

Intel GVT-g

This approach of sharing a GPU amongst many concurrent users is the latest addition to Intel’s graphics virtualization technologies portfolio (fig-3). Each virtual desktop machine keeps a copy of Intel’s native graphics driver. On a time sliced basis, an agent in the hypervisor directly assigns the full GPU resource to each virtual machine. Thus, during its time slice, while the virtual machine gets a full dedicated GPU, from overall system view point several virtual machines share a single GPU. Intel has been developing GVT-g under the code name “XenGT” for Xen. Up-streaming of GVT-g to KVM is also in works. More recently, Intel has been disclosing this solution to select partners, and making the source available for variety of processor graphics configurations.

Major ISVs and OEMs are aligning with Intel to productize Intel GVT based solutions. Open source developers might also find Intel GVT portfolio with Intel processor-graphics products equally enticing. Comments welcome!

↧

Power Management: So what is this policy thing?

May 9, 2014, 11:41 am

Latest and popular articles on Intel Technologies

≫ Next: Compare features among memory leak detection, memory growth detection and on-demand memory leak detection

≪ Previous: Intel® graphics virtualization update

Unlike a lot of previous recent blogs, this series is about power management in general. At the very end of the series, I’ll write specifically about the Intel® Xeon Phi™ coprocessor.

I have talked incessantly over the years about power states (e.g. P-states and C-states), and how the processor transitions from one state to another. For a list of previous blogs in this series, and well as other related blogs on power and power management, see the article at [List0]. But I have left out an important component of power management, namely the policy.

A policy is a collection of rules used for guidance, for example, a security policy. A power management policy contains the rules / logic that guide power management state transitions. The implementation of that policy is done by the power management (PM) manager or module.

One way to divide power management functions is between 5 domains: hardware, BIOS or nearly BIOS level drivers, kernel level drivers (ring 0), system power management controls (ring 3), and user power management controls (ring 3). This arrangement can differ depending upon the OS and technology being used (e.g. mobile vs. server). See Figure PWRMNGR.

Latencies drive this distribution of power management functionality. Power Management can only work if its impact on executing applications is trivial. Latency is not so important for transitions into an idle state – the processor is not doing anything or it would not be transitioning into the idle state in the first place. In contrast, transitions out of an idle state and into the run state must take place as quickly as possible. So the designers of the power management infrastructure distribute its functionality across the OS, hardware, and user levels. The lowest layers must be simple and react as quickly as possible when transitioning from the idle state to the run state (e.g. from C1 to C0). As an example, transitions from C1 to C0 are less than a microsecond for the Intel® Xeon Phi™ coprocessor. As we look at higher layers of the power management stack, the transitions they govern are more latency tolerant and can involve more complex decision making logic.

As an interesting aside, the entire power management stack does not have to be running on the system being managed. The current generation (as of 2014) Intel® Xeon Phi™ coprocessor necessarily has part of the power management logic implemented on the host. I will discuss this further below. (This will likely change in future generations of the coprocessor.)

Figure PWRMNGR. The power management module and the power management policy

In the Hardware and BIOS: At these very lowest levels, power management is limited to mapping power management instructions to the underlying hardware, such as calls to invoke different P and C-states. See Alex Hung’s power management blogs for a good description of the BIOS mapping of HW power management functionality to ACPI definitions in reference section below[. Given its simplicity, this level introduces no perceptible latency to an executing user application.

In the Kernel (ring 0): Ultimately, power management decisions involve transitions between run and different idle states, and such decisions introduce latencies. For example, if a processor is in C3 and an interrupt occurs, it must transition from C3 to C0; run the interrupt routine, and then transition back to C3. But as in all things, it is not this simple. These transitions also involve software logic and decision making, such as determining whether the processor should instead use a higher idle state with less latency such as C1. It does not make any sense to have this decision making logic at the BIOS level as many repeated transitions can result in non-trivial cumulative latency (as well as violating good programming practice).

Typical kernel level power management involves functionality where latency is critical but involves some computation and decision making. This decision making takes place in ring 0 (kernel) which can avoid the latencies inherent in ring 3 context switches and other OS overhead. At this level, statistics are also collected to help the power management software better predict transitions, such as when future interrupts will occur.

In the OS (ring 3): Power management functionality at this level takes more time and becomes involved only when necessary or when minimizing latency is not as critical. An example might be adjusting policy based upon collected interrupt frequency and duration statistics. Another example might be the decisions involving P-state transitions. Such transitions do not involve any state saving and restoration. As such, its decision making can take place at a higher level and at a more leisurely pace in the power management stack.

In User Space (ring 3): This is where policy is set and initialized. At this high level, latency is much less of an issue with some rare exceptions.

One such rare exception is seen in the Intel® Xeon Phi™ coprocessor where the host necessarily becomes involved in some power state transitions. This is because when the coprocessor is in a package C-state, it is all but powered down; no power management software can run on the coprocessor when it is in a package idle state (PC-3 and PC-6). The host must wake the coprocessor up, essentially performing a fast boot up. This means that part of the coprocessor’s power management stack is executing on the host (i.e. remotely). As such, transitions from the deepest package idle state (PC6) to C0 can get close to 500 milliseconds⁺. See my article on power states referenced below.

In the next blog, we will look briefly at different power management policies.

REFERENCES

NOTE: As previously in my blogs, any illustrations can be blamed solely on me as no copyright has been infringed or artistic ability shown.

[List0] Kidd, Taylor (10/23/2013), List of Useful Power and Power Management Articles, Blogs and References, http://software.intel.com/en-us/articles/list-of-useful-power-and-power-management-articles-blogs-and-references. Retrieved February 21^st, 2014.

⁺There are state diagrams that detail these changes and the conditions for them. Introducing these diagrams, as well as the kernel level power management APIs, is at a level of depth that is inappropriate for this article. If you have an unquenchable desire to know, they can often be found in processor data sheets or software developer’s guides.

↧

Compare features among memory leak detection, memory growth detection and on-demand memory leak detection

May 12, 2014, 1:35 am

Latest and popular articles on Intel Technologies

≫ Next: Submissions open: High Performance Parallelism Gems

≪ Previous: Power Management: So what is this policy thing?

 Intel® Inspector can detect regular memory leak.
It means that a block of memory is allocated, but never de-allocated when application exits. Usually the restriction of this feature is that we only know leaks until application exits.

However some users’ application doesn’t terminate, it acts as sever application (process). Users want know two critical things that Inspector can:
 Detect memory growth
It means that a block of memory is allocated (don’t know it will be de-allocated or not, later) and Inspector collects memory growth in specific time range when application is running. The benefit of this feature helped users to know memory consumption when application is running. Please note that memory growth is only warning message, not error message.

 Detect on-demand memory leak
This is similar to memory growth, but only difference is that Inspector already knows allocated block of memory will NOT be de-allocated later. The reason is no pointer available to de-allocate memory.

There are three ways to set specific time range when application is running.
1. Use “Set Transaction Start” / “Set Transaction End” buttons, on GUI. These pair of buttons can be used for both memory growth detection and on-demand leak detection.
2. Use “memory-growth-start”/” memory-growth-end” options, “reset-leak-tracking”/” find-leaks” options with “inspxe-cl command”

Please refer this article to know more detail.

Here is an example to teach you to
3. Use APIs in ittnotify library to control specific time range to detect memory growth and/or on-demand memory leak.

Please see attach example code, test memory growth for 30s, then test on-demand memory leak for 20 minutes, you can use another console to terminate inspector…but partial results will be generated, which includes memory growth info & on-demand leak info.

Steps:
1.   Build. # gcc -g memory.c /opt/intel/inspector_xe_2013/lib64/libittnotify.a -I/opt/intel/inspector_xe_2013/include/ -o memory -lpthread -ldl
2.   # inspxe-cl -collect mi3 -- ./memory
3.   In other console: (after 30 seconds…) # inspxe-cl -command stop -r r000mi3/
4.   Inspector will display:

....
Run terminated abnormally. Some data may be lost.

2 new problem(s) found
1 Memory growth problem(s) detected
1 Memory leak problem(s) detected

↧

Submissions open: High Performance Parallelism Gems

May 19, 2014, 10:47 am

Latest and popular articles on Intel Technologies

≫ Next: How to use Perf and import its result into VTune(TM) Amplifier XE?

≪ Previous: Compare features among memory leak detection, memory growth detection and on-demand memory leak detection

Hi everyone,

We have all had our little discoveries and triumphs in identifying new and innovative approaches that increased the performance of our applications. Usually, they are small, though important, but occasionally we find something more, something that could also help others, an innovative gem. Perhaps it is a method of analysis, or an unconventional use of the memory hierarchy, or simply the dogged application of techniques that achieves remarkable speedups. Yet, we rarely have a means of making these innovations available outside of our immediate colleagues.

You now have an opportunity to broadcast your successes more widely to the benefit of our community.

And we’re not referring only to triumphs specific to pure processor performance. Perhaps your innovation solves an I/O bottleneck issue, answers a particularly important multi-body problem, or succeeds in reducing the energy footprint of a suite of applications. These are all important to the community at large.

Of course, I and the editors are from Intel, so we are focusing on the use of Intel® Xeon® and Intel® Xeon Phi™ processors. But this focus isn’t too limiting as Intel® architectures are everywhere.

So here is your chance to share your triumphs. Do you know a unique way of exploiting multicore caches? An innovative algorithm that allows scaling to greater than 200 cores? Or a unique application of OpenMP* in conjunction with MPI in an Intel Xeon cluster? Consider letting the broader community know by submitting a proposal to the editors.

=============

PLEASE PASS AROUND TO ANYONE WHO MAY BE INTERESTED

You are invited to submit a proposal to a contribution-based book, working title, “High Performance Parallelism Gems – Successful Approaches for Multicore and Many-core Programming” that will focus on practical techniques for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor parallel computing.

Submissions are due by May 29, 2014 in order to be guaranteed for consideration for publication in the first (2014) volume.

Please submit your proposal now. We'll work with you to refine it as needed.

If you would like to contribute, please fill out the form completely and click SUBMIT.

Visit http://lotsofcores.com/gems to send us your ideas now.

You may email us at hppg2014@easychair.org with questions (please read http://lotsofcores.com/gems first). Please submit by May 29.

Thank you,

James Reinders and Jim Jeffers

P.S. Many of you will think “Intel Xeon Phi gems,” but we actually expect “the gems” will show great ways to scale on both Intel Xeon Phi coprocessors and Intel Xeon processors, hence the working title for the book.

↧

How to use Perf and import its result into VTune(TM) Amplifier XE?

May 21, 2014, 1:45 am

Latest and popular articles on Intel Technologies

≫ Next: Benefits of Intel® Enterprise class SSD

≪ Previous: Submissions open: High Performance Parallelism Gems

Perf is an internal performance tool of Linux* operation system, the tool’s usage is very similar to OProfile, GProf and it uses (Performance Monitoring Unit) PMU to set performance counters before profiling target application then get information of elapsed CPU cycles, Instruction execution retired, Cache miss, Branch mispredict, etc after profiling target application..

For some customers who require to use Perf within VTune(TM) Amplifier XE to collect application's performance data, VTune Amplifier XE 2013 Update 17 integrates Perf’s function into the product, the command is “amplxe--perf”, original VTune Amplifier’s command amplxe-cl can be used to import trace file into VTune Amplifier’s result. Here is an example:
1. amplxe--perf record -o peter.perf -T --force-per-cpu -e cpu-cycles,instructions -- ./primes.icc
Determining primes from 1 - 100000
Found 9592 primes
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.924 MB peter.perf (~40350 samples) ]

2. amplxe-cl –import peter.perf –r r0001

Notes:
1.   Perf has been integrated in VTune Amplifier U17, it can support application's launch mode, as well as attach mode. For example, “amplxe--perf record -o peter1.perf -T --force-per-cpu -e cpu-cycles,instructions -p <PID> sleep 10"
2.   Perf is PMU event-based sampling, so it cannot co-work with VTune’s EBS collector in one session. (Other system/OS profiling tools, custom collectors can co-work with VTune’s EBS collector – see this article)
3.   Perf’s results can only be imported into a new VTune’s result directory, the reason is point 2.
4.   When Perf’s result has been imported into VTune, VTune GUI can open/display this result. However, Perf’s result can also be reported/displayed by VTune command, but this is restricted to be used. It means that only performance counters can be displayed – for example:
a)   “amplxe-cl -report hw-events -r r0001” can work, but
b)   “amplxe-cl -report hotspots -r r0001” cannot work.

↧

Benefits of Intel® Enterprise class SSD

May 27, 2014, 3:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Power Management Policy: You Mean There’s More Than One?

≪ Previous: How to use Perf and import its result into VTune(TM) Amplifier XE?

In this blog, I want share with you the benefits of the Intel® Enterprise Class Solid-State Drive (SSD). I have compiled a list of articles, white papers, solution briefs, and blogs and provided links below. After reading through the information, I found that it would be useful for you, as the developers wanting to use the Intel Enterprise Class SSD, to get the essential information quickly by grouping them into the questions below. The questions are: What is the Intel Enterprise Class SSD? What is the workload characterization in RAID configurations? Where are the SSDs being used? What are some methods to increase the performance?

What is the Intel Enterprise Class SSD?

You will gain an in-depth understanding of the Intel® SSD DC S3700 SSD series and Intel® SSD DC S3500 series by reviewing the production qualification, workload requirements, power-loss protection, consistent performance, and data integrity blogs. These series of blogs were written by the Intel IT experts Christian Black and James Myers.

Where are the SSDs being used?

Intel works with various developers across different applications such as database, virtualization, Hadoop, big data, and cloud computing to show the different usage models for Intel SSD. If you are interested in database use cases, the scale up performance with NoSQL* and database in a SAN-Free environment papers can provide you with more insights. For virtualization using SSDs, a VMWare* use case and Intel demo use case at the Cloud Expo in NY’13 are two examples of using the SSDs in a virtualized environment. If considering using the Intel SSD in a cloud environment or to solve a big data challenge, you should see how one of the largest Italian banks is using Intel SSD with Apache Hadoop software and the guide for building a high-IOPS/low-cost storage area network (SAN).

What is the workload characterization in RAID configurations?

There many different workloads in the data center/cloud environment that can take advantage of the performance of the Intel® Enterprise Class SSD. You might be wondering what the best use cases are. To learn more about different performance characteristics of the Intel SSD DC S3500 in a RAID configuration across multiple workloads, you should review the “Intel® SSD DC S3500 Series RAID Workload Characterization” guide.

What are some methods to increase the performance for your environment?

For developers who like coding and want to make some minor changes to take greater advantage of the speed of SSDs, you might want to read about characterizing disk I/O-bound applications and how you can improve performance on Linux. You might also want to review the Intel SSD performance characteristics to help you with improving workload performance.

After reading the answers to the questions above, you should have a good understanding of the Intel Enterprise Class SSD benefits and capabilities. Seeing a wide usage of real world applications should help you characterize your workloads and tune them. With this wealth of knowledge, I hope that you can use and apply them to your cloud and data center environment. For developers who influence purchasing decisions in their companies, the ROI (Return of Investment) and TCO (Total Cost of Ownership) of the SSDs article is a great resource. If design computing is part of your scope of work, the adoption of SSDs in your cloud and data center environment may provide additional insights.

↧

Power Management Policy: You Mean There’s More Than One?

June 9, 2014, 2:29 pm

Latest and popular articles on Intel Technologies

≫ Next: Power Management Policy: Summary and Future Policies

≪ Previous: Benefits of Intel® Enterprise class SSD

Power management policy has evolved over the years. The earliest policies consisted of little more than some critical temperature sensors and an interrupt routine that attempted (often unsuccessfully) to cleanly shut down the system before something really bad happened. Today’s sophisticated power management policies do such things as progressively shutting down parts of processor circuitry during idle with almost no impact upon performance, rapidly alternating between idle and active states, reducing processor frequencies, exploiting thermal lows to temporarily overclock the processor, and a host of other things.

EXAMPLE POLICY #1: This is one of the simplest policies. It was used in a real-time system I worked on so long ago that its existence has faded from human memory. A few well-placed temperature sensors and some hardware logic were placed on the processor’s boards. When the sensors reached certain thresholds, the hardware logic generated a high priority hardware interrupt. The interrupt routine did its best to save system state and shut down the power before anything really unpleasant occurred. To say it a different way, the policy was to save system state and cleanly shut down the system if the temperature of the hardware exceeded a certain preset threshold. I recall that it was successful only 50% of the time.

EXAMPLE POLICY #2: I wrote briefly about this policy in my previous blog on T-states; see reference [TSTATES] below. This policy uses a technique that is a precursor to P-states to give the processor a chance to cool while not interfering with the execution of most applications. When the temperature of the processor exceeds a certain threshold, the processor’s clock will start and stop with a certain duty cycle. The periods where the clock stops (i.e. is gated) allow the processor to cool. Though this slows down a running application, it ceases running when the clock is stopped, the impact for most applications is minimal outside of taking longer to execute. The exception is when the application depends upon time sensitive external events, such as externally triggered interrupts.

EXAMPLE POLICY #3: P-states. I’ve written about this quite a bit. See Power Management States: P-States, C-States, and Package C-States, reference [CPSTATES] below. Like T-states, it allows the processor to cool by slowing down applications. Unlike T-states, it is far less disruptive as the chip temporarily operates as if it has a slower oscillator, something that the design of most general purpose digital devices can accommodate. Check out the Intel® Xeon Phi™ Coprocessor System Software Developers Guide [SDG], June 2013, Figure 3-2 “Power Reduction Flow”, for an example of P-state power transition logic. As the processor is always running, slowing down the clock doesn’t affect the processing of most external events/interrupts.

EXAMPLE POLICY #4: C-States. I’ve talked about this so much that even I’m a little bored. Saying anything else will serve no purpose except to put the reader to sleep. See reference [CPSTATES] below.

EXAMPLE POLICY #5: Remote power management. In the case of the Intel® Xeon Phi™ coprocessor, part of the power management is remote. See my discussion in Power Management States: P-States, C-States, and Package C-States, reference [REMOTE]. The processor shuts down to such an extent that it is no longer capable of responding to waking events. Shutting down provides you with the ultimate in power savings as your usage is, for all intents and purposes, 0 Watts. Unfortunately, the disadvantage is significant; once you remove all the power, you can no longer respond to waking events, say from the PCIe bus. To say it another way, you cannot leave the “off” state without some external intervention, a.k.a., the host.

You can see the advantage of this in that power usage can theoretically be zero Watts. This is quite a power savings. Unfortunately, it comes at a cost, namely that this deepest power state will last for a very long time, actually forever, unless someone flips the power switch of the processor back on.

What’s up next? We’ll wrap up the general part of this discussion with a summary and a look at future possible policies.

NEXT: SUMMARY AND FUTURE POLICIES

REFERENCES:

[TSTATES] Kidd, Taylor (2013) “C-States, P-States, where the heck are those T-States?” https://software.intel.com/en-us/blogs/2013/10/15/c-states-p-states-where-the-heck-are-those-t-states. (Downloaded May 14^th, 2014)

[CPSTATES] Kidd, Taylor (2014) “Power Management States: P-States, C-States, and Package C-States” https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. (Downloaded May 14th, 2014)

[SDG] “Intel® Xeon Phi™ Coprocessor System Software Developers Guide [SDG], June 2013,” https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide. (Downloaded May 14^th, 2014)

[REMOTE] Kidd, Taylor (2014), Power Management States: P-States, C-States, and Package C-States, https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. (Downloaded May 14^th, 2014)

↧

Power Management Policy: Summary and Future Policies

June 17, 2014, 6:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Performance BKMs: There’s more than one hammer

≪ Previous: Power Management Policy: You Mean There’s More Than One?

How about the future? Have we reached the pinnacle of power management?

Hardware and software are still evolving to be even more energy efficient. An example is the “tickless” OS. In the old days, OSs had to periodically wake up the processor (i.e., perform an interrupt) around a hundred times a second and check to see if anything needed to be done, such as task switching or handling incoming data from some device. OSs haven’t needed to do this for decades, but this legacy periodic “tick” has been part of every OS until the last few years. Every wake-up meant the processor was entering a runtime state, which can potentially prevent it from dropping into the lowest power C-states. The impact is that energy is unnecessarily wasted due to a requirement that no longer exists. Thankfully, most common OSs are now tickless to one extent or another.

As devices and application domains evolve, the pressure to conserve even more energy is very strong, not only for mobile devices but for huge data centers. Mobile devices have the effrontery to get smaller and smaller; data centers need to service more and more people with more and more data; applications keep putting greater demands on processing power; and consumers demand longer and longer battery life.

These trends have resulted in a nearly 3000-fold increase in the performance / power ratio⁺ over the last 30 to 35 years⁺⁺. And the evolution of power management hasn’t stopped. Given the strong driving forces of data center and hand-held devices, I can imagine that tomorrow’s power management will eke out even more savings as well as minimize some of the negative situations that can prevent the effective adoption of power management in certain corner cases, e.g., cases where OS jitter can’t be tolerated and precise periodic interrupts are needed.

Can you think of anything that the processor and SW can do to save even more energy (using existing hardware)? Does the processor or OS do something that isn’t really necessary anymore? Does technology have a new, more power-efficient feature that existing software still doesn’t exploit? Are there power hotspots that should be looked at? Are there areas where the processor could save energy, but the cost trade-off (e.g., latency or reliability) is too great? Can the cost trade-off be mitigated allowing the processor to save more energy? These are some of the questions that very creative architects and engineers are asking in their pursuit of improving the performance / power ratio even further.

NEXT: ADDENDUM: A QUICK REVISIT OF THE INTEL® XEON PHI™ COPROCESSOR

⁺You cannot simply look at energy usage as it is a moving target: scales get smaller, silicon area gets bigger, new materials and gate technologies appear, etc.

⁺⁺This estimate is derived for Intel® general-purpose processors only starting with the 80286. It is a very rough ballpark estimate obtained from general Internet sources.

↧

Performance BKMs: There’s more than one hammer

June 23, 2014, 1:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Meshcentral.com - Intel® AMT IDE Redirect support

≪ Previous: Power Management Policy: Summary and Future Policies

I don’t know if any of you have noticed but Intel® has a tendency to emphasize its own homegrown tools. This isn’t bad as Intel has some of the best. Still, if someone has a favorite hammer, there’s a tendency to use that hammer for just about everything.

What I want to do here is to talk about some of those other tools in your proverbial toolbox; very powerful tools, like VTune™, tend to be complex, both in use and analysis. Yes, I know that VTune is a lot more user friendly than before, but it still has a steep learning curve. In this series, I want to talk about those other tools, the ones you might want to use first before you bring out the big VTune-like guns.

Here are some of the basic tools I’m looking at. This list will no doubt expand as we go along. Please feel free to suggest more.

Using a stop watch (e.g. /usr/bin/time)
Programmatic timers (e.g. gettimeofday())
System wide overview (e.g. /usr/bin/top)
Basic profilers (e.g. gprof)
Simple to use profilers (e.g. loopprofiler)
And so on

Yes, I do realize that we work in a multi-processor, multi-threaded environment and will look at these tools in that regard.

I will not be talking about commercially available tools. Those are being sold and as such, have some very effective tutorials. VTune is a case in point.

NEXT: I like ‘free’. Where do I find them?

VTune is a trademark of Intel Corporation in the U.S. and/or other countries.

↧

Meshcentral.com - Intel® AMT IDE Redirect support

June 24, 2014, 11:29 am

Latest and popular articles on Intel Technologies

≫ Next: Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer

≪ Previous: Performance BKMs: There’s more than one hammer

It’s been a while since the last Meshcentral announcement, but we are making it up today with a powerful new feature. We are announcing: Meshcentral cloud Intel® AMT IDE Redirect. Allowing administrators to remotely reboot a mesh enabled Intel AMT computer anywhere on the Internet with a recovery OS that is located on the MeshCentral server. This opens the door for powerful remote computer recovery, OS check, backup, re-install and more. Along with Hardware KVM, it’s one of the most powerful features of Intel AMT and a significant value to administrators… but there is more, much more.

As with any Intel AMT feature implemented in Meshcentral, we take it to the next level. When the administrator decides to launch a recovery boot using mesh, the mesh server generates on the fly a new remote boot OS image. That is right, each boot is made using a fully constructed single-use OS image. A tiny Linux based image is built for the target platform with the right settings, hostname, check hashes, mesh policy, MEI driver, network drivers and more. Once booted, the recovery OS makes a set of HTTP calls to the Mesh server to stop the IDE-R session, download & check the latest mesh agent and launch it. A recovery mesh agent then connects back to the server for full control over the session, built-in LMS support in the mesh agent is used to provide local Intel AMT access.

In addition to all this, the new Meshcentral server now supports recovery agents that show up on the devices page only for the duration of the connection, along with improvements in CIRA connection handling and much more, this new feature makes it easier than ever to use IDE-R. Just select and click a button, Meshcentral does the rest.

This new feature is one of the easiest and most powerful ways to make use of Intel® vPro. We are testing the feature on Meshcentral.com and will be rolling out cloud IDE-R to mesh customers and other mesh server instances in the next few weeks. We posted on Youtube a full demonstration of Meshcentral performing IDE-R.

Questions and feedback appreciated,
Ylian Saint-Hilaire
meshcentral.com/info

The inner working of this new feature are quite complex, involving updates to every major component of Meshcentral.

From the administrator’s perspective, this would not be easier. A few clicks triggers the IDE-R feature.

We posted on Youtube a full demonstration of Meshcentral performing IDE-R.

↧

Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer

July 11, 2014, 1:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Tips and Tricks when working with Intel® TXT

≪ Previous: Meshcentral.com - Intel® AMT IDE Redirect support

The latest Intel® Xeon® Processor E7 v2 Family lets you make an interesting choice. If you are willing to give up some of the high-availability features, you can further increase the already outstanding memory bandwidth. In this blog, I’ll explain how this works and why this is the case, and how you can watch this effect on your system using the Intel Performance Counter Monitor

A processor in the latest Intel® Xeon® Processor E7 v2 Family comes with 4 Intel® Scalable Memory Interconnect 2 (Intel® SMI2) links. Each Intel® SMI2 link is then connected to a Intel® C102/C104 Scalable Memory Buffer, which in return provides two memory channels. Since up to three DIMMs per channel are supported, a system with 4 sockets can support up to 3*2*4*4 = 96 DIMMs.:

Lockstep Mode

The scalable memory buffer has the possibility to distribute cache lines between the two channels. Half of a cache line is than located on a DIMM of one channel and the other half is located on the other channel. In particular, only one memory channel is driving an Intel®SMI2 link, as opposed to two in independent channel mode. Furthermore, the Intel® SMI2 link operates at the same frequency as the memory.

Running the memory channels in lockstep mode has the advantage that you can apply an interesting trick to increase the system availability. Normally, you have 16 memory devices on a DIMM plus 1 device for CRC and 1 device for parity. If one if the devices fails, its data can be reconstructed. This is called single-device data correction (SDDC).

For DDDC, you combine these 2 devices from 2 DIMMs, i.e. 4 devices per pair of DIMMs. This results in 32 “data” devices, 2 devices for CRC, 1 device for parity, and 1 spare device. If one of the devices fails, the spare device can replace this device. After the failure of one device, you still have the benefit of SDDC. In summary, DDDC allows recovery from two sequential DRAM failures on the memory DIMMs, as well as recovery from a subsequent single-bit soft error on the DIMM.

Independent Channel Mode

A new feature of the Intel® Xeon® Processor E7 v2 is the ability to run the Intel® SMI2 link at twice the frequency as the memory channels. It is therefore possible that each memory channel has its own memory controller and operators the memory channel at full speed. The Intel® SMI2 link interleaves the data from the two channels, which is then separated again by the scalable memory buffer.

Interestingly, you can also observe this when you run pcm-memory from the Intel Performance Counter Monitor package. The program pcm-memory allows you to display the memory traffic per memory channel. In case of lock-step mode, pcm-memory consequently displays only 4 memory channels per socket instead of 8 memory channels in independent channel mode.

---------------------------------------||---------------------------------------
--   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
---------------------------------------||---------------------------------------
--  Mem Ch 0: Reads (MB/s):   26.17  --||--  Mem Ch 0: Reads (MB/s):   24.85  --
--            Writes(MB/s):   24.49  --||--            Writes(MB/s):   24.09  --
--  Mem Ch 2: Reads (MB/s):    4.91  --||--  Mem Ch 2: Reads (MB/s):    2.18  --
--            Writes(MB/s):    2.37  --||--            Writes(MB/s):    1.39  --
--  Mem Ch 4: Reads (MB/s):   25.33  --||--  Mem Ch 4: Reads (MB/s):   22.79  --
--            Writes(MB/s):   24.28  --||--            Writes(MB/s):   22.77  --
--  Mem Ch 6: Reads (MB/s):    3.14  --||--  Mem Ch 6: Reads (MB/s):    2.09  --
--            Writes(MB/s):    1.66  --||--            Writes(MB/s):    1.34  --
-- NODE0 Mem Read (MB/s):     59.55  --||-- NODE1 Mem Read (MB/s):     51.91  --
-- NODE0 Mem Write (MB/s):    52.81  --||-- NODE1 Mem Write (MB/s):    49.58  --
-- NODE0 P. Write (T/s) :    711390  --||-- NODE1 P. Write (T/s):     711008  --
-- NODE0 Memory (MB/s):      112.36  --||-- NODE1 Memory (MB/s):      101.49  --
---------------------------------------||---------------------------------------
--             Socket 2              --||--             Socket 3              --
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
--   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
---------------------------------------||---------------------------------------
--  Mem Ch 0: Reads (MB/s):  112.24  --||--  Mem Ch 0: Reads (MB/s):   24.94  --
--            Writes(MB/s):   24.07  --||--            Writes(MB/s):   24.11  --
--  Mem Ch 2: Reads (MB/s):   89.65  --||--  Mem Ch 2: Reads (MB/s):    2.25  --
--            Writes(MB/s):    1.44  --||--            Writes(MB/s):    1.38  --
--  Mem Ch 4: Reads (MB/s):  110.17  --||--  Mem Ch 4: Reads (MB/s):   22.82  --
--            Writes(MB/s):   22.74  --||--            Writes(MB/s):   22.79  --
--  Mem Ch 6: Reads (MB/s):   89.60  --||--  Mem Ch 6: Reads (MB/s):    2.18  --
--            Writes(MB/s):    1.42  --||--            Writes(MB/s):    1.35  --
-- NODE2 Mem Read (MB/s):    401.66  --||-- NODE3 Mem Read (MB/s):     52.18  --
-- NODE2 Mem Write (MB/s):    49.67  --||-- NODE3 Mem Write (MB/s):    49.63  --
-- NODE2 P. Write (T/s) :    711031  --||-- NODE3 P. Write (T/s):     711011  --
-- NODE2 Memory (MB/s):      451.33  --||-- NODE3 Memory (MB/s):      101.81  --
---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    565.30                  --
--                  System Write Throughput(MB/s):    201.68                  --
--                 System Memory Throughput(MB/s):    766.98                  --
---------------------------------------||---------------------------------------

This system runs in lockstep mode. Therefore only the even channels 0, 2, 4, and 6 are used.

↧

Tips and Tricks when working with Intel® TXT

July 16, 2014, 1:59 pm

Latest and popular articles on Intel Technologies

≫ Next: Additional AVX-512 instructions

≪ Previous: Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer

I've recently started learning about Intel® Trusted Execution Technology (Intel® TXT).

Most important learning: Server and Client TXT are NOT the same and ACM files and TPMs differ by generation and system class. For current Intel® TXT purposes,

Clients are the Intel® Core™ i5/i7 and Xeon® E3 processors.
Servers are the Xeon® E5/7 processors. Only on Linux.
TPMs are usually either for client or server. Intel maintains a list of server-platforms (May 2014) that have the chipset, processor, TPM, and enabled BIOS to run Intel TXT. For Intel Server Boards, the TPM is listed in the product TPS on support.intel.com (usually AXXTPME3 for clients including single socket servers and AXXTPME5 for dual socket servers).

TPMs (usually physical although there are virtual iTPMs) come from multiple vendors and you must use the specific model(s) specified by the motherboard manufacturer. It is the OEM's responsibility to design TPM/TXT into their system, regardless of whether the TPM is already attached to the motherboard or can be added. Note: Intel TXT is just one function that relies on the TPM. The TPM can be used for drive encryption, authentication, and as a crypto provider as well as for the measured/verifed launch function. There is additional software provided by the TPM vendor (TPM SW Stack) and by the OS including interfaces into the TPM, e.g. Microsoft's TBS (TPM Base Services) or the Open Source tboot/tcs.d/the TCG software stack. It is with these software stacks that TPM 2.0 is not backward's compatible with 1.2.
Note: Intel Server TXT supports TPM's physically connected to the chipset via the LPC bus, not the i2c bus. With 2.0 there will be some support for TPMs on the SPI bus on servers.

Watch the Versions:

TPM 1.x (mostly 1.2) was the standard for a long time. In 2014 vendors are starting to ship TPM 2.0, not backward compatible. You MUSTmatch the TPM to the vendor's system requirements.
LCP (Launch Control policies) have a v1 and v2 and come in signed and unsigned. There's information on LCP in the tboot package under docs.
Intel's AXXTPME3 comes in two versions - the v3 boards use the second (MM#922115).

SINIT ACM files: The SINIT binary is the unencrypted Intel signed ACM (Authenticated Code Module) built for a specific chipset/cpu combination. Intel's naming format is (platform)_SINIT_(v#).bin and most files can be found at SINIT ACM kits. These kits are the bin and usually a changelog and the error decoder. Both client and server kits are on this page.
BIOS ACM kits available from Intel Business Link (IBL) but require an NDA. These kits usually contain provisioning tools including .bat or efi files to read PS and AUX and their capabilities.

On Linux, there are NO kernel changes required for TXT other than making sure tboot is included.. Since Intel TXT doesn't trust the Linux driver's security, the TXT authenticated code module interfaces directly to the TPM. The system's OS/VMM vendor can give specific advice on what additional drivers, if any, are needed for a particular TPM or system.

Both client and server TXT on Linux use the Linux open source Trusted Boot (tboot) software, a, pre- kernel/VMM module which executes GetSec(Senter). And there are calls to launch policies (from TPM NVRAM) to verify the kernel. A discussion forum is also available there.
Instructions are available at multiple Linux sites including the Fedora Wiki.

Reading error codes - Error codes vary between client and server as well (to some degree) between generations of the processor.
The SINIT ACM kits released by Intel include a SINIT Error Code Document (PDF or TXT) that decodes the error codes that are thrown by the Intel components. If the error is thrown from the TPM, the error code can be decoded using the industry specifications or the "Error Code Cheat Sheet for the TPM 1.2" (search on the web) or by a list from your TPM Vendor.

For example: the error code 0xC03d0441
on a CLIENT Intel core i5-3470 processor (so 3rd generation i5/i7 ACM kit)
first decodes from the ACM kit doc to pointing to an error that is then defined in the 23:15 bits
and those bits (3D) decode from the TPM doc to "locality incorrect for the attempted operation."
Note if this occurs on a single socket server system, check it has the correct TPM.

Troubleshooting and Installation

Especially on Linux servers, ensure that the system boots into the OS before the TPM/TXT are enabled.
Verify the PCR's (platform config registers) are populating and that Measured launch equals True. Fedora Wiki. bottom of page, lists the PCRs.
tboot produces a log that generally includes the error code (decode as above).
For server installation see the Intel TXT Server Enabling Guide and How to Enable

Documentation: Intel publishes the Intel® TXT MLE Software Dev Guide and Intel employees have written books/ebooks on Intel TXT. (Check out Intel® TXT Books at Apress or other book/e-book vendors.)
There is additional documentation available under NDA. Contact your Intel field representative.

↧

Additional AVX-512 instructions

July 17, 2014, 11:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel PCM Column Names Decoder Ring

≪ Previous: Tips and Tricks when working with Intel® TXT

Additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

The Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions.

As I discussed in my first blog about Intel® AVX-512 last year, Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing.

We had committed that Intel® AVX-512 would also be supported by some future Intel® Xeon® processors scheduled to be introduced after Knights Landing. These additional instructions we have documented now will appear in such processors together with most Intel® AVX-512 instructions published previously.

These new instructions enrich the operations available as part of Intel® AVX-512. These are provided in two groups. A group of byte and word (8 and 16-bit) operations known as Byte and Word Instructions, indicated by the AVX512BW CPUID flag, enhance integer operations. It is notable that these do make use of all 64 bits in the mask registers. A group of doubleword and quadword (32 and 64-bit) operations known as Doubleword and Quadword Instructions, indicated by the AVX512DQ CPUID flag, enhance integer and floating-point operations.

An additional orthogonal capability known as Vector Length Extensions provide for most AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. Vector Length Extensions can currently be applied to most Foundation Instructions, the Conflict Detection Instructions as well as the new Byte, Word, Doubleword and Quadword instructions. These AVX-512 Vector Length Extensions are indicated by the AVX512VL CPUID flag. The use of Vector Length Extensions extends most AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. The use of Vector Length Extensions allows the capabilities of EVEX encodings, including the use of mask registers and access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

Emulation for Testing, Prior to Product

In order to help with testing of support, the Intel® Software Development Emulator has been extended to include these new Intel AVX-512 instructions and will be available very soon at http://www.intel.com/software/sde.

Intel AVX-512 Family of instructions

Intel AVX-512 Foundation Instructions will be included in all implementations of Intel AVX-512. While the Intel AVX-512 Conflict Detection Instructions are documented as optional extensions, the value for compiler vectorization has proven strong enough that they will be included in Intel Xeon processors that support Intel AVX-512. This makes Foundation Instructions and Conflict Detection Instructions both part of all Intel® AVX-512 support for both future Intel Xeon Phi coprocessor and processors and future Intel Xeon processors.

Knights Landing will support Intel AVX-512 Exponential & Reciprocal Instructions and Intel AVX-512 Prefetch Instructions, while the first Intel Xeon processors with Intel AVX-512 will support Intel AVX-512 Doubleword and Quadword Instructions, Intel AVX-512 Byte and Word Instructions and Intel AVX-512 Vector Length Extensions. Future Intel® Xeon Phi™ Coprocessors and processors, after Knights Landing, may offer additional Intel AVX-512 instructions but should maintain a level of support at least that of Knight Landing (Foundation Instructions, Conflict Detection Instructions, Exponential & Reciprocal Instructions, and Prefetch Instructions). Likewise, the level of Intel AVX-512 support in the Intel Xeon processor family should include at least Foundation Instructions, Conflict Detection Instructions, Byte and Word Instructions, Doubleword and Quadword Instructions and Vector Length Extensions whenever Intel AVX-512 instructions are supported. Assuming these baselines in each family simplifies compiler designs and should be done.

Intel AVX-512 support

Release of detailed information on these additional Intel AVX-512 instructions helps enable support in tools, applications and operating systems by the time products appear. We are working with open source projects, application providers and tool vendors to help incorporate support. The Intel compilers, libraries, and analysis tools have strong support for Intel AVX-512 today and updates, planned for November 2014, will provide support for these additional instructions as well.

Intel AVX-512 documentation

The Intel AVX-512 instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference. Intel AVX-512 is detailed in Chapters 2-7.

↧

Intel PCM Column Names Decoder Ring

July 18, 2014, 5:54 am

Latest and popular articles on Intel Technologies

≫ Next: Sorting the Words in the Mahabharata

≪ Previous: Additional AVX-512 instructions

When Intel Performance Counter Monitor (Intel PCM) is generating csv files as output, short names are used as column headers. This helps to keep the table width at a manageable size if the data is loaded in a spreadsheet program. However, it makes it rather hard to guess what exactly is hiding behind these abbreviations. Since I'm getting a lot of questions on how to interpret these column names, I've put together a decoder ring:

Field	Explanation	Example
The following metrics are available on all levels:
Date	Day-Month-Year	05-02-14
Time	Time of day	13:38:04
EXEC	Instructions per nominal CPU cycle, i.e. in respect to the CPU frequency ignoring turbo and power saving	0.182
IPC	Instructions per cycle. This measures how effectively you are using the core.	0.159
FREQ	Frequency relative to nominal CPU frequency (“clockticks”/”invariant timer ticks”)	1.143
AFREQ	Frequency relative to nominal CPU frequency excluding the time when the CPU is sleeping	1.143
L3MISS	L3 cache line misses in millions	182.879
L2MISS	L2 cache line misses in millions	356.3
L3HIT	L3 Cache hit ratio (hits/reference)	0.487
L2HIT	L2 Cache hit ratio (hits/reference)	0.233
L3CLK	Rough estimate of cycles lost to L3 cache misses vs. clockticks	0.044
L2CLK	Rough estimate of cycles lost to L3 cache misses vs. clockticks	0.008
The following metrics are only available on socket and system level:
READ	Memory read traffic on this socket in GB	23.108
WRITE	Memory read traffic on this socket in GB	10.782
The following metrics are only available on a socket level:
Proc Energy (Joules)	The energy consumed by the processor in Joules. Divide by the time to get the power consumption in watt	122.457
DRAM Energy (Joules)	The energy consumed by the DRAM attached to this socket in Joules. Divide by the time to get the power consumption in watt	115.747
TEMP	Thermal headroom in Kelvin	32
The following metrics are only available on a system level:
INST	Number of instructions retired	119706
ACYC	Number of clockticks, This takes turbo and power saving modes into account.	750640.8
TIME(ticks)	Number of invariant clockticks. This is invariant to turbo and power saving modes.	2817.883
PhysIPC	Instructions per cycle (IPC) multiplied by number of threads per core	0.319
PhysIPC%	Instructions per cycle (IPC) multiplied by number of threads per core relative to maximum IPC	7.974
INSTnom	Instructions per nominal cycle multiplied by number of threads per core	0.365
INSTnom%	Instructions per nominal cycle multiplied by number of threads per core relative to maximum IPC. The maximum IPC is 2 for Atom and 4 for all other supported processors.	9.113
TotalQPIin	QPI data traffic estimation (data traffic coming to CPU/socket through QPI links) in MB (1024*1024)	21937.96
QPItoMC	Ratio of QPI traffic to memory traffic	0.632
TotalQPIout	QPI traffic estimation (data and non-data traffic outgoing from CPU/socket through QPI links) in MB (1024*1024)	38443.3

Please also note that PCM reports absolute values for the measured time interval. For example, if you use a time interval of 5 seconds, memory traffic or instructions retired are reported for the whole 5 seconds. Only if you are executing PCM with 1 sec time interval, you will get memory traffic in GB/s.

↧

Sorting the Words in the Mahabharata

July 18, 2014, 9:50 am

Latest and popular articles on Intel Technologies

≫ Next: Developer API Documentation for Intel® Performance Counter Monitor

≪ Previous: Intel PCM Column Names Decoder Ring

by N. Shamsundar

Introduction

In an earlier article, “Indexing the Mahabharata,” I wrote about building a sorted index for one of the epics of India, the Mahabharata. The Mahabharata may be as old as six thousand years, according to some accounts. One edition of the book, with the original Sanskrit verses and Hindi translation, is in six quarto-size volumes with a total of 6,511 pages. A Unicode version of just the Sanskrit text, with no commentary, is about 20 to 25 MB. The words extracted from this text number about 700,000 (many words occur more than once, of course). In this follow-up article, I discuss the details of sorting those words while retaining the associated book information: chapter, subchapter, and verse numbers.

A user may be only interested in one subset of the words -- for example, the names of the important characters. Another user may be interested in place names, yet another in activities performed, etc. In this article, I examine how to sort the entire set of words, but please note that any method and software that can do this correctly and efficiently will also be able to sort and index any desired subset.

Required attributes of the sorting/collating method

The Unicode representation of complex scripts such as Devanagari, which is the most appropriate script for Sanskrit (and also for the current national language of India, Hindi), results in the occurrence of Unicode sequences of one to four (or more) code-points (a code-point takes 16, 24 or 32 bits, depending on the particular variety of Unicode representation used such as UTF-8, UTF-16 or UTF-32). In contrast to West-European languages (English, French, German, etc.), which generally have a one-to-one relation between “letter” and code-point, a single Sanskrit “letter” (some people use the word “grapheme”) may be represented with up to four Unicode characters, each of which represents only a piece of the Sanskrit compound letter. It is not surprising, therefore, that sorting by code-point order gives correct sorting only about 75 percent of the time, and collation that is based on mere code-point order is not adequate for the task. Another consequence of this aspect of Sanskrit and Indic scripts and languages is that sorting methods such as radix-sort and burst-sort, which are highly effective for sorting ASCII/ANSI text strings, are not applicable.

Therefore, I focused on sorting methods that depend on string comparison. It has been mathematically proved that such sorting methods can, at best, have a “complexity” on the order of N log N when sorting N strings. When selecting a sorting algorithm, we have to consider additional factors such as stability (see below), the magnitude of the constant multiplier prepended to N log N, and the availability of efficient library routines that implement the atomic operations of the algorithm.

Comparing two Unicode strings

The Unicode standards specify an algorithm for sorting, called the UCA (Unicode Collation Algorithm), and include a provision for “tailoring” the collation sequence by specifying additional rules. Such tailoring is necessary because languages such as Marathi and Hindi that use the same script (Devanagari) have a tradition of using different collation orders.

Standard libraries such as the MSVCRT (Microsoft Visual C run-time library) and the ICU (International Components for Unicode) library provide two methods for comparing Unicode strings. The first method is based on one or more routines that replace the ubiquitous strcmp()/strncmp() of the C standard library. Examples are wcscoll()/_wcsncoll() from MSVCRT and ucol_strcoll() from ICU.

The second method of comparing strings involves two steps. The original Unicode strings are transformed to corresponding “keys” that are strings of 8-bit or 16-bit characters, depending on the library, and the keys are then compared using the same routines as those used for ANSI strings. For comparing just one pair of strings, both methods are equivalent. In fact, the invocation of wcscoll(s1, s2) may result in two calls to wcsxfrm() to obtain the keys k1 and k2, followed by the call wcscmp(k1, k2) to compute the result of the comparison. However, explicitly computing, storing and working with the keys has advantages when a single string, such as कृष्ण (“Krishna”) occurs many times in the text being sorted (because the same word occurs in many chapters and verses).

I found that, on average, each string, whether appearing once or many times, was used in about twenty-five comparisons. Therefore, explicitly building up the keys involves about 700,000 calls to wcsxfrm(), whereas calling wcscoll() from user code results in 16 million background calls to wcsxfrm(). As in many branches of computing, we encounter a trade-off between simplicity of code and performance.

Sorting algorithms and software tested

I selected two existing third-party software packages and adapted two sorting utilities that I had available in source code form.

The third-party software packages were Bill Poser’s MSORT from http://billposer.org/Software/msort.html and Chris Nyberg’s NSORT from http://ordinal.com. The two packages are quite different. MSORT is designed for linguistic work and has many options and features to control the nature of the collation. It runs on a single core and is rather slow because of the intricate scanning and processing of the input text that is performed before the sorting and output phases. NSORT, in contrast, is high-performance commercial software having no awareness of non-ANSI text, but it so happens that NSORT can be tricked into sorting Unicode text. What I did was to change all line-feed characters in a UTF-16 version of the Mahabharata text from 0x000A to 0x7E0A, tell NSORT that the input consisted of ANSI text records with 0x7E as the record separator, and specify that NSORT should call a user-supplied comparison routine. In this comparison routine, I call the MSVCRT library routine CompareStringEx() to compare the strings, with the environment variable LC_COLLATE set to “sa-IN”.

The two sorting utilities (in source code form) that I adapted use (i) quicksort and (ii) LCP (Longest Common Prefix) mergesort. From work that I had performed many years ago, I had source code (in C) for reading ANSI text files and sorting them with the sort key being a chosen section of the input line (e.g., columns 10 to 19). I modified the sources to suit Unicode sorting by (i) changing the input file processing from ANSI to UTF-16, (ii) computing a collation key as described above, and (iii) writing a comparison function to compare collation keys using MSVCRT routines or ICU, as described above.

I took care to keep the sorting algorithm “stable,” which means that lines with identical keys should occur in the same relative order in the output as they did in the input. For example, consider four lines in the output that contain the word “Krishna.”

3 82 18 कृष्ण

3 119 5 कृष्ण

4 39 20 कृष्ण

5 70 6 कृष्ण

Note that the chapter, subchapter, and verse numbers, which are prefixed to the Sanskrit word, are in proper order. In the comparison function, when two keys are equal, instead of returning 0, which would cause equal keys to appear in indeterminate order, I compare the key pointers and return -1 when the first pointer is less than the second and +1 otherwise (the pointers cannot be equal in a properly written sorting routine).

Performance and quality of sorted output

With the exception of MSORT, which I built from source and ran on openSUSE* Linux* 12.3, all the runs were made under Windows* 8.1 X64. MSORT took over 30 seconds to read and digest the input file on an Intel® Core™2 Duo processor E8400. The subsequent sorting and the writing of the sorted output took under 5 seconds.

NSORT and my adaptations of quicksort and mergesort, all of which used the MSVCRT key comparison routines, produced identical results. This is expected, but the agreement is a check on the correctness of the adaptations. These results differ, however, from the MSORT output. To my layman’s eyes, the collation order imposed by the MSVCRT routines appeared to be better in the sense that the order was in close agreement with what I could observe in some Sanskrit dictionaries available to me.

All the runs consumed a second or less to read the input and compute the collation keys on an Intel® Core™ i7 processor 2720QM, and less than a second each to sort the keys and output the sorted records. The fastest run came from LCP mergesort, which took 0.31, 0.15, and 0.11 seconds for the input, sort, and output phases, respectively. Note that the sort phase was fast because I used two Intel® Cilk™ Plus directives to parallelize the sorting phase using eight threads. Without the Intel Cilk Plus parallelization, the sorting time was about 0.6 second[1].

In addition to the runs just described, I made some runs with versions of quicksort and mergesort in which I used ICU routines to compute the collation keys instead of the MSVCRT routines. Note that the ICU keys are strings of bytes, whereas the MSVCRT keys are strings of concatenated two-byte characters. These versions ran a bit faster than the MSVCRT versions did. However, the quality of the collation in the output, while acceptable, was slightly lower (again, to my layman’s eyes) than the results described in the preceding paragraph.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order [LP1] .

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel logo, Cilk, and Core are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

[1] Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Dell XPS17 with 8G RAM, Intel® Core™ i7 processor 2720QM, Windows 8.1 Pro X64, tests run by author on BORI text of the Mahabharata (http://sanskritdocuments.org/mirrors/mahabharata/mahabharata-bori.html). For more information go to http://www.intel.com/performance

[LP1]Please fill out some of the details like OS and processor in the footnote where it says:

[describe config + what test used [LP1]+ who did testing]

↧

Developer API Documentation for Intel® Performance Counter Monitor

July 24, 2014, 6:10 am

Latest and popular articles on Intel Technologies

≫ Next: Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

≪ Previous: Sorting the Words in the Mahabharata

The Intel® Performance Counter Monitor (Intel® PCM: www.intel.com/software/pcm) is an open-source tool set based on an API. This API can be used directly by developers in their software. Besides the API usage example in the article, other samples of code using the API can be found in pcm.cpp, pcm-tsx.cpp, pcm-power.cpp, pcm-memory.cpp and other sample tools contained in Intel PCM package.

An important resource for learning about Intel PCM API can be found in the embedded Doxygen documentation. For example it lists all functions to extract available processor metrics supported by Intel PCM. Generating HTML browsable documentation with Doxygen from the source code is very easy: a Doxygen project file is already in Intel PCM package and most of the source code is annotated with Doxygen tags (description of function parameters, return values, etc).

Here are the steps to generate the documentation:

Download doxygen tool for your operating system from www.doxygen.org (Doxygen is available on many operating systems including Windows, Linux, MacOS X, etc)
Run doxygen in the Intel PCM directory
Open generated html/classPCM.html in your favourite browser
Click on the classes and structure of your interest, browse class hierarchies, functions implementing access to processor metrics, etc

For the current Intel PCM 2.6, there is already documentation made available here.

Best regards,

Roman

↧

Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

July 24, 2014, 1:00 pm

Latest and popular articles on Intel Technologies

≫ Next: One-Sided Communication

≪ Previous: Developer API Documentation for Intel® Performance Counter Monitor

As announced last week by James, future Intel Xeon processors will add support for byte and word processing in AVX-512. It is therefore time to revisit my blog from last year, where I showed how to use Intel AVX2 for checking if a bit is set in an array of bits. This time however, I will assume that the input consists of bytes, which allows the really nice trick to replace the gather instruction by a permutation.

Recall that, given an array of bits B and a list of integers i₀,…,i_n, we want to check if the bits B[i₀],…,B[i_n] are set. The result is an array of bits where each bit represent the result of the look-up. For example, the following sequence of input integers could be processed:

The C++ code stays the same as before except for the type of the input array, which is now vector<unsigned char>:

void check_bits(vector<bool> const& B,
          vector<unsigned char> const& Input,
          vector<bool> & Output)
{
    for (int i=0; i<Input.size(); ++i)
		        if (B[Input[i]])
	      		       Output[i] = true;
}

In the C version of the code, I’ll do some more modification. We use the 16-bit type unsigned short, which will make it easier for us to convert the code to use words in AVX-512. The inner loop therefore processes 32 entries. Consequently, the result is stored as an array of unsigned ints:

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
	  for (int i=0; i<Length/32; ++i) {
	      unsigned int Result = 0;
	      for (int j=0; j<32; ++j) {
		         int Pos = Input[i*32+j];
		         unsigned short Bits = B[Pos >> 4]; // extract one word
		         unsigned int SingleBit = ((Bits >> (Pos & 15))) & 1; // extract one bit
		         Result |= SingleBit << j; // accumulate result in double-word
		      }

	      Output[i] = Result;
	  }
}

As previously, the variable Pos holds the index. We first read the word in B that contains bit number Pos. In other words, we read 16 bits where one of them is the bit we are interested in. This word named Bits has the index Pos/16 = Pos>>4. The subsequent line extracts the bit of interest. Inside the word Bits, this bit is located at position Pos%16=Pos&15. The bit is therefore shifted to the right by Pos&15 and then masked out with 1. As in the previous version, we collect the bits of the inner loop in the Result variable, which is finally written at the end of the inner loop.

This code can be translated almost directly to Intel AVX-512 instructions. The only roadblock is the array lookup, which normally would be implemented by a gather instruction:

unsigned short Bits = B[Pos >> 4]; // extract one word

Now recall that our input is a list of bytes. Therefore, the bit-vector B is at more 2⁸ = 256 bits long, which nicely fits in a vector register. Therefore, the gather instructions can be replaces by the permute instruction for words that is part of AVX-512BW. This is very attractive as the latency of a permutation is much lower than for a gather instruction. The intrinsic for VPERMV is:

__m512i _mm512_permutexvar_epi16( __m512i idx, __m512i a);

A precise definition as well as further permutation instructions including variations for byte and masks can be found in the Intel® Architecture Instruction Set Extensions Programming Reference.

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
       	__m256i* Input256 = (__m256i*) Input;
	__m256i Bitvector256 = *((__m256i*)pBitvector);
	__m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      __m512i Masked = _mm512_and_epi32(BGathered, BitPos);
	      Output[i] = _mm512_cmpeq_epi16_mask(Masked, BitPos);
	    }
}

The input values are first converted to 16-bit values, then the position of the words in the bit-vector and their shift offset are computed. As discussed above, the whole bit-vector is stored in Bitvector512 outside of the loop and serves as the input to the permutation.

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
   __m256i* Input256 = (__m256i*) Input;
	   __m256i Bitvector256 = *((__m256i*)pBitvector);
	   __m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      __m512i Masked = _mm512_and_epi32(BGathered, BitPos);
	      Output[i] = _mm512_cmpeq_epi16_mask(Masked, BitPos);
	    }
}

The comparison at the end reveals another novelty of Intel AVX-512. Comparison can now produce their result in the new mask registers:

__mmask32 _mm512_cmpeq_epi16_mask(__m512i a, __m512i b);

The result can therefore be directly written to memory as a bit-vector and there is no need for a movmsk instruction anymore as it was the case in the AVX2 version.

As a last optimization, we take advantage of another new instruction. VPTESTM{B,W,D,Q} ands the content of two vector registers <a₁,…,a_n> and <b₁,…,b_n>. The instruction then returns for each element of the vector, if this operation results in a non-zero value: <(a₁& b₁)!=0,…, (a_n& b_n)!=0>. The return type is also a mask register:

__mmask32 _mm512_test_epi16_mask( __m512i a, __m512i b);

This VPTESTMW instruction can therefore replace the and and comparison in our example:

 void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
       	__m256i* Input256 = (__m256i*) Input;
	__m256i Bitvector256 = *((__m256i*)pBitvector);
	__m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      Output[i] = _mm512_test_epi16_mask(BGathered, BitPos);
	    }
}

This finishes our little journey into the Intel AVX-512bw instruction set. There are many more new instructions, which open new opportunities for faster code, and I hope that you enjoy exploring them as much as I do.

↧

One-Sided Communication

August 6, 2014, 11:44 am

Latest and popular articles on Intel Technologies

≫ Next: Meshcentral - Secure Intel AMT IDE-R Virus Scan

≪ Previous: Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

In this continuation of the blog, Hybrid MPI and OpenMP* Model, I will discuss the use of MPI one-sided communication and demonstrate running a one-sided application in symmetric mode on an Intel® Xeon® host and two coprocessors connected via PCIe.

The standard Message Passing Interface (MPI) has two-sided communication and collective communication models. In these communication models, both sender and receiver have to participate in data exchange operations explicitly, which requires synchronization between the processes.

In two-sided communication, memory is private to each process. When the sender calls the MPI_Send operation and the receiver calls the MPI_Recv operation, data in the sender memory is copied to a buffer then sent over the network, where it is copied to the receiver memory. One drawback of this approach is that the sender has to wait for the receiver to be ready to receive the data before it can send the data. This may cause a delay in sending data. Figure 1 illustrates this situation.

Figure 1 A simplified diagram of MPI two-sided communication send/receive. The sender calls MPI_Send but has to wait until the receiver calls MPI_Recv before data can be sent.

To overcome this drawback, the MPI 2 standard introduced Remote Memory Access (RMA), also called one-sided communication because it requires only one process to transfer data. One-sided communication decouples data transfer from system synchronization. The MPI 3.0 standard revised and added extensions to the one-sided communication, adding new functionality to improve the performance of MPI 2 RMA. The Intel® MPI Library 5.0 supports one-sided communication where a process can have direct access to the memory address space of a remote process (MPI_Get/MPI_Set/MPI_Accumulate) without the intervention of that remote process. One-sided communication operations are non-blocking. They benefit many applications because, while a process sends data to a remote process, the remote process can continue to compute (useful work) instead of waiting for the data.

In order to allow other processes to have access into its memory, a process has to explicitly expose its own memory to others. It does this (MPI_Win_create) by declaring a shared memory region, also called a window. Synchronization in MPI one-sided communication can be achieved with MPI_Win_fence. In a simple word, between two MPI_Win_fence calls, all RMA operations are completed.

To illustrate MPI one-sided communication, the below sample program shows the use of MPI_Get and MPI_Put using a memory window. Note that error checking is not implemented, since the program is intended only to show how one-sided communication works.

/*
// Copyright 2003-2014 Intel Corporation. All Rights Reserved.
//
// The source code contained or described herein and all documents related
// to the source code ("Material") are owned by Intel Corporation or its
// suppliers or licensors.  Title to the Material remains with Intel Corporation
// or its suppliers and licensors.  The Material is protected by worldwide
// copyright and trade secret laws and treaty provisions.  No part of the
// Material may be used, copied, reproduced, modified, published, uploaded,
// posted, transmitted, distributed, or disclosed in any way without Intel's
// prior express written permission.
//
// No license under any patent, copyright, trade secret or other intellectual
// property right is granted to or conferred upon you by disclosure or delivery
// of the Materials, either expressly, by implication, inducement, estoppel
// or otherwise.  Any license under such intellectual property rights must
// be express and approved by Intel in writing.
*/
#include <stdio.h>
#include <mpi.h>

#define NUM_ELEMENT  4

int main(int argc, char** argv)
{
   int i, id, num_procs, len, localbuffer[NUM_ELEMENT], sharedbuffer[NUM_ELEMENT];
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Win win;

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &id);
   MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
   MPI_Get_processor_name(name, &len);

   printf("Rank %d running on %s\n", id, name);

   MPI_Win_create(sharedbuffer, NUM_ELEMENT, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);

   for (i = 0; i < NUM_ELEMENT; i++)
   {
      sharedbuffer[i] = 10*id + i;
      localbuffer[i] = 0;
   }

   printf("Rank %d sets data in the shared memory:", id);

   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", sharedbuffer[i]);

   printf("\n");
 
   MPI_Win_fence(0, win);

   if (id != 0)
      MPI_Get(&localbuffer[0], NUM_ELEMENT, MPI_INT, id-1, 0, NUM_ELEMENT, MPI_INT, win);
   else
      MPI_Get(&localbuffer[0], NUM_ELEMENT, MPI_INT, num_procs-1, 0, NUM_ELEMENT, MPI_INT, win);

   MPI_Win_fence(0, win);

   printf("Rank %d gets data from the shared memory:", id);

   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", localbuffer[i]);

   printf("\n");

   MPI_Win_fence(0, win);

   if (id < num_procs-1)
      MPI_Put(&localbuffer[0], NUM_ELEMENT, MPI_INT, id+1, 0, NUM_ELEMENT, MPI_INT, win);
   else
      MPI_Put(&localbuffer[0], NUM_ELEMENT, MPI_INT, 0, 0, NUM_ELEMENT, MPI_INT, win);

   MPI_Win_fence(0, win);

   printf("Rank %d has new data in the shared memory:", id);
 
   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", sharedbuffer[i]);

   printf("\n");

   MPI_Win_free(&win);
   MPI_Finalize();
   return 0;
}

In the sample code, MPI_Init initializes the MPI environment which allows the parallel code. MPI_Comm_rank and MPI_Comm_size return the MPI process identification and the number of MPI processes (or ranks) respectively. MPI_Get_processor_name returns the name and length of the processor name. Each process in the system has a shared memory region, called sharedbuffer, which is an array of 4 integers. MPI_Win_create is called by all processes to create a window of shared memory; the window specifies all process memory which is available for remote operations. Each process then initializes its portion of the memory window allowing remote processes read/write access to the pre-defined memory. MPI_Put writes data into a memory window on a remote process. MPI_Get reads data from a memory window on a remote process. Between the first two MPI_Win_fence calls, each process reads data from the window on the preceding process and copies it to its local memory; between the second two MPI_Win_fence calls, each process copies data from its local memory to the window of the succeeding process. Each process finally checks its new data (which is already modified by a remote process). MPI_Win_free is called to terminate the memory window and MPI_Finalize to clean up all MPI states and MPI environment; the parallel code ends here.

In the following section, the sample code is compiled and run on an Intel Xeon host system equipped with two Intel® Xeon Phi™ coprocessors. First, we establish a proper environment for the Intel compiler (in this case, Intel® Composer XE 2013 Service Pack 1) and the Intel® MPI Library (in this case, Intel® MPI Library 5.0), then we copy the MPI libraries to the two coprocessors. Next the executables for host and coprocessor are built and the coprocessor executable copied into the directory /tmp on the coprocessors.

% source /opt/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64

% source /opt/intel/impi/5.0.0.028/bin64/mpivars.sh

% scp /opt/intel/impi/5.0.0.028/mic/bin/* mic0:/bin/

% scp /opt/intel/impi/5.0.0.028/mic/lib/* mic0:/lib64/

% scp /opt/intel/impi/5.0.0.028/mic/bin/* mic1:/bin/

% scp /opt/intel/impi/5.0.0.028/mic/lib/* mic1:/lib64/

% mpiicc mpi_one_sided.c -o mpi_one_sided.host

% mpiicc -mmic mpi_one_sided.c -o mpi_one_sided.mic

% sudo scp ./mpi_one_sided.mic mic0:/tmp/mpi_one_sided.mic

% sudo scp ./mpi_one_sided.mic mic1:/tmp/mpi_one_sided.mic

Then we enable MPI communication between host and coprocessors, and activate coprocessor-coprocessor communication:

% export I_MPI_MIC=enable

% sudo /sbin/sysctl -w net.ipv4.ip_forward=1

At this point we are ready to launch the application in symmetric mode with one rank on the Intel Xeon host, two ranks on the first coprocessor mic0 and three ranks on the second coprocessor mic1:

% mpirun -host localhost -n 1 ./mpi_one_sided.host : -host mic0 -n 2 -wdir /tmp ./mpi_one_sided.mic : -host mic0 -n 3 -wdir /tmp ./mpi_one_sided.mic

The following figures show the data movement as the program runs. After the window of shared memory is created, each rank initializes its portion of the shared memory. The array is filled with integer numbers that combine the rank number and the element number. The tens digit is set to the originated rank number; the ones digit is set to the element number in the array. Thus, rank 0 places the values of 00, 01, 02, 03 for the first, second, third and fourth entries in its shared array; likewise, rank 3 places the values of 30, 31, 32, 33 for the first, second, third and fourth entries in its shared array, etc. Figure 2 shows the values of the shared memory after each of the 6 ranks fills its shared array sharedbuffer:

Figure 2 Each rank initializes its portion in the shared memory.

Since each process now can have access to the shared memory of others, each rank calls MPI_Get to copy data of the preceding rank’s shared memory to its local array. The local array of each rank now contains the values of the preceding rank. Thus, since the preceding rank of rank 0 is rank 5, rank 0 gets the values of 50, 51, 52, 53 for the first, second, third and fourth entries in its local array; similarly, since the preceding rank of rank 3 is rank 2, rank 3 places the values of 20, 21, 22, 23 for the first, second, third and fourth entries in its local array, etc. Figure 3 shows the values of the local array localbuffer after each rank gets the values from shared memory sharedbuffer:

Figure 3 Each rank gets the data of the preceding rank from the shared memory.

Next, each rank calls MPI_Put to copy its local array to the succeeding rank’s shared memory. That is, the shared memory of the succeeding rank now contains the values of the local array of its preceding rank. Thus, since the succeeding rank of rank 0 is rank 1, rank 0 copies the values of 50, 51, 52, 53 to the first, second, third and fourth entries in rank 1’s shared array; similarly, since the succeeding rank of rank 3 is rank 4, rank 3 copies the values of 20, 21, 22, 23 to the first, second, third and fourth entries to rank 4’s shared memory of rank 4, etc. Figure 4 shows the values of the shared memory sharedbuffer when each rank copies the values from its local array to the shared memory sharedbuffer:

Figure 4 Each rank writes data (previously read) to the shared memory of the succeeding rank.

Finally, each rank reads its shared array. Thus, rank 0 reads the values of 40, 41, 42, 43 for the first, second, third and fourth entries in its shared array; rank 3 reads the values of 10, 11, 12, 13 for the first, second, third and fourth entries in its shared array, etc. Figure 5 shows the values of the shared memory after each of the 6 ranks reads its shared array sharedbuffer:

Figure 5 Data of each rank in the shared memory changes.

The output generated by the application is shown below:

Rank 0 running on knightscorner1

Rank 3 running on knightscorner1-mic1

Rank 5 running on knightscorner1-mic1

Rank 1 running on knightscorner1-mic0

Rank 4 running on knightscorner1-mic1

Rank 2 running on knightscorner1-mic0

Rank 0 sets data in the shared memory: 00 01 02 03

Rank 2 sets data in the shared memory: 20 21 22 23

Rank 4 sets data in the shared memory: 40 41 42 43

Rank 1 sets data in the shared memory: 10 11 12 13

Rank 3 sets data in the shared memory: 30 31 32 33

Rank 5 sets data in the shared memory: 50 51 52 53

Rank 4 gets data from the shared memory: 30 31 32 33

Rank 0 gets data from the shared memory: 50 51 52 53

Rank 5 gets data from the shared memory: 40 41 42 43

Rank 3 gets data from the shared memory: 20 21 22 23

Rank 1 gets data from the shared memory: 00 01 02 03

Rank 2 gets data from the shared memory: 10 11 12 13

Rank 0 has new data in the shared memory: 40 41 42 43

Rank 4 has new data in the shared memory: 20 21 22 23

Rank 5 has new data in the shared memory: 30 31 32 33

Rank 1 has new data in the shared memory: 50 51 52 53

Rank 2 has new data in the shared memory: 00 01 02 03

Rank 3 has new data in the shared memory: 10 11 12 13

Besides MPI one-sided communication, there are other programing approaches that also support one-sided communication. For example, SHMEM (Symmetric Hierarchy Memory Access) is another approach where one process can have access to a global shared memory. SHMEM also takes advantage of hardware RDMA (Remote Direct Memory Access) to allow a local Processing Element (PE) to access a remote PE’s memory without interrupting the remote PE (the remote PE’s CPU is not involved). SHMEM resembles MPI one-sided communication: each host has the SHMEM library installed; each host runs a copy of the SHMEM application called PE; each host accesses shared memory through APIs such as shem_get(), shmem_put(), shmalloc(), shmem_barrier() to transfer data, allocate memory and synchronize the processes. Like MPI, SHMEM is very easy to use to take advantage of direct memory access; both MPI and SHMEM offer point to point communication and collective communication. Unlike MPI, SHMEM is not yet standardized, although there is an effort by many leading companies in the HPC area to support the OpenSHMEM standard.

CONCLUSION:

In summary, one-sided communication in MPI enables users to take advantage of DMA to access the data of remote processes; thus benefiting applications in which synchronization can be relaxed by reducing data movement. An example of using one-sided communication MPI_Get/MPI_Set was shown and run in the symmetric mode between an Intel Xeon host and Intel Xeon Phi coprocessors using the Intel® MPI Library 5.0. SHMEM programming resembles MPI one-sided communication but SHMEM is not standardized yet. There are ongoing efforts to standardize it (OpenSHMEM) in order to support multiple hardware vendors.

Note: By installing or copying all or any part of the software components in this page, you agree to the terms of the Intel Sample Source Code License Agreement.

↧