Quantcast
Channel: Intel Developer Zone Blogs
Viewing all 181 articles
Browse latest View live

Intel® Xeon Phi™ Coprocessor Developer Training Coming to a City Near You in 2015

$
0
0

Intel is offering an updated and expanded series of software developer trainings in parallel programming using the Intel® Xeon PhiTM coprocessor.

This series of offerings provides software developers the foundation needed for modernizing their codes to extract more of the parallel compute performance potential found in both Intel® Xeon® processors and Intel Xeon Phi coprocessors.

The courses contain materials and practical exercises appropriate for developers beginning their journey to parallel programming, as well as provide cutting-edge detail to HPC experts on the best practices for Intel's multicore and many-core architectures and software development tools.  The offerings include a one-day introductory lecture (free), a one-day hands-on laboratory (free), and a four-day immersive workshop.

The training targets software engineers and architects, and covers the following topics:

  • Intel Xeon Phi architecture: purpose, organization, pre-requisites for good performance, future technology
  • Programming models: native, offload, heterogeneous clustering
  • Parallel frameworks: automatic vectorization, OpenMP, MPI
  • Optimization methods: general, scalar math, vectorization, multithreading, memory access, communication and special topics

Click here for more details and training venues.

 


Interview with James Reinders: future of Intel MIC architecture, parallel programming, education

$
0
0

During the conversation between James Reinders, the Director and Chief Evangelist at Intel Corporation, and Vadim Karpusenko, Principal HPC Research Engineer at Colfax International, recorded on January 30, 2015 at Colfax International in Sunnyvale, CA, we discussed the future of parallel programming and Intel MIC architecture products: Intel Xeon Phi coprocessors, Knights Landing (KNL), and forthcoming 3rd generation - Knight Hill (KNH). We also talked about how students can learn parallel programming and optimization of high performance applications.

Check out this YouTube video

 

Webinar for ANSYS and Intel - get your HPC on!

$
0
0

Save Money & Maximize Performance with ANSYS Mechanical 16.0 on Intel Platforms

Tuesday, March 17, 2015
9 AM EDT, 1 PM GMT

Duration: 60 minutes

ANSYS and Intel® teamed to optimize ANSYS Mechanical for the latest Intel processor and coprocessor innovations. Based on these efforts, both the Windows and Linux versions of ANSYS Mechanical 16.0 (including the SMP and DMP versions of each) can now utilize one or more Intel Xeon Phi coprocessors. With Intel's special promotion and ANSYS HPC Workgroups, you can take advantage of Intel coprocessor performance within your qualified existing hardware at a significantly discounted price.

We will show you how ANSYS Mechanical 16.0 can benefit from the latest generation of Intel Xeon E5 v3 processors and also how to upgrade to these processors, along with the Xeon Phi coprocessors, for spectacular sparse solver performance.

Register for the March 17 session

Configuring the Apache Web server to use RDRAND in SSL sessions

$
0
0

Starting with the 1.0.2 release of OpenSSL*, RDRAND has been temporarily removed as a random number source. Future releases of OpenSSL will re-incorporate RDRAND, but will employ cryptographic mixing with OpenSSL's own software-based PRNG. While OpenSSL's random numbers will benefit form the quality of RDRAND, it will not have the same performance as RDRAND alone.

If you are running a high-volume SSL web server the speed advantages of RDRAND are probably desireable. An earlier case study on OpenSSL performance when RDRAND was the sole RNG source showed that the SSL handshake can lead to up to a 1% increase in the number of connections/second that could be handled by an SSL concentrator. Internal testing on the Xeon v3 family of processors shows that RDRAND can give an additional boost to AES bulk encryption as well since random numbers are used to generate IV's.

Fortuately, OpenSSL still provides access to RDRAND as a sole random number source via it's engine API: you just have to turn it on. If you are running an Apache* 2.4 web server with mod_ssl, this is very easy to do. The configuration directive, SSLCryptoDevice, tells mod_ssl which engines to initialize inside of OpenSSL. To enable RDRAND as a sole random number source, you would use the following directive:

SSLCryptoDevice rdrand

Another advantage of doing this is that the digital random number generator that feeds RDRAND is autonomous and self-seeding, so you do not have to supply entropy to OpenSSL. This means you can use the 'builtin' entropy method in mod_ssl, which is the least CPU-intesive and most simplistic method, as the entropy generated by the sources is simply going to be ignored.

SSLRandomSeed startup builtin
SSLRandomSeed connect builtin

Depending on your system architecture, you might even see slightly higher performance from one of the special device files such as /dev/zero.

 

§

 

Check out the Parallel Universe e-publication

$
0
0

The Parallel Universe is a quarterly publication devoted to exploring inroads and innovations in the field of software development, from high performance computing to threading hybrid applications.

Issue #20 - Cover story: From Knights Corner to Knights Landing: Prepare for the Next Generation of Intel® Xeon Phi™ Technology, by James Reinders, Director of Parallel Programming Evangelism, Intel

The Parallel Universe Archive

Sign-up for future issues

Advanced Computer Concepts For The (Not So) Common Chef: Terminology Pt 1

$
0
0

Before we start, I will use the next two blogs to clear up some terminology. If you are familiar with these concepts, I give you permission to jump to the next section.  I suggest any software readers still check out the other blog about threads. There is a lot of confusion, even among us software professionals.

We will first look at what a processor, CPU, core and package are. The general media, meaning TV and the like, use these terms pretty loosely. Then we will look at threads, specifically the differences between hardware and software threads. The distinction between these different types of threads is confusing, even to the computer programmer.

THE CORE? CPU? PACKAGE? SILICON? HUH?

Let us look at the left hand side of Figure CPU. Back in the Pentium® days, people referred to the component of a computer that executes a program’s instructions (i.e. the brains of a computer) as either the ‘CPU’ or ‘processor’. There really was not a distinction between the two. The ‘computer chip’ was the silicon upon which an integrated circuit was etched, e.g. our CPU. The ‘package’ was the stylish plastic and metal case that wrapped and protected the silicon, and from which the multitude of pins/connections protruded.

In today’s world, we have processors with multiple CPUs that run multiple threads each, along with multiple chips (silicon) in the same package. Terminology has been updated to reflect this modern world. Look at the right hand side of Figure CPU. What was once a CPU we now call a ‘core’. A processor can contain many cores on the same piece of silicon; a modern laptop now typically contains 2 cores in its processor; a desktop can contain 4 to 6 cores; and a server can contain upwards of 18 cores per processor. The package can now hold not just one silicon integrated circuit but several. It contains the processor silicon, of course. It might also hold flash memory, other specialized processors, and more.

Pentium cores vs Xeon multi-core

                                       1995                                                                    2015

Figure CPU: Processors then and now.

Let us look at Figure SILICON. On the left is the original Pentium circa 1993. On the right is the current generation Intel® Xeon® processor E5-2600 v3 circa 2013. The Pentium processor on the left has one CPU on one silicon chip in a package. The Xeon processor on the right has 18 cores on one silicon chip, each core equivalent to one (very fast and enhanced) old style Pentium CPU. (Can you locate each of the cores?)

Pentium processor circa 1993

Image of Xeon E5-2600 siliconImage of Pentium die relative to Xeon E3

Figure SILICON. Pentium vs Xeon E5-2600 v3+

My point is that in the blogs that follow, when talking about the ‘processor’, I refer to the hunk of silicon that contains all the cores and their support circuitry. By a core, I refer to a single processing unit that does computation (formerly known of as a CPU) of which there can be many such units (and each of which can execute two or more threads simultaneously). And by package, I refer to the flat rectangular, metal and plastic container that can hold multiple special purpose processors, memory and other supporting circuitry, each on separate chips of silicon.

Now that we have that settled, in my next installment, we look at something that confuses programmers perhaps more than it does everyone else.

NEXT: OF COURSE, I KNOW WHAT A THREAD IS….DON’T I?

+Just for grins, to the right of the Xeon processor in Figure SILICON, I scaled the Pentium (800nm) to show how large it would be using today’s manufacturing technology (22nm). This is a very rough representation as the size varies depending upon whether you go by # of transistors (1.4 billion / 7.5 million = x187) or feature size ((800nm)^2 / (22nm)^2 = x1322). What is shown is the more conservative x187. Yes, I know that I am not factoring in the actual die size.

 

Intel Omni-Path Webinar - March 31 2015

$
0
0

The upcoming next-generation Intel Omni-Path Architecture addresses lessons learned, good and bad, from Intel True Scale Architecture and standard InfiniBand*. In an effort to avoid observed pitfalls, Intel approached the architecture of an HPC fabric from a different perspective. The architectures for current products and Intel Omni-Path systems were explicitly developed from the ground up for MPI HPC clusters to bring out the best possible performance.

This webinar is the first in a series meant to familiarize customers and partners alike on innovations created by Intel to better support end-user MPI applications. Over time, we will delve into each aspect of the Intel Omni-Path design. We will begin with innovations that are available in both generations of HPC fabric, starting at the host and steps taken to ensure a worry-free transition.

When: Tuesday, March 31, 2015
Time: 12:00 noon 1:00 p.m. EDT

Please join us to explore all of the advances made in host architecture, then continue to join as we roll out newer host and fabric innovations meant to enable both large and small scale fabric performance.

Register Now>

Videos - Parallel Programming and Optimization with Intel Xeon Phi Coprocessors

$
0
0

Here is a set of introductory videos from Colfax International on Parallel Programming and Optimization with Intel(R) Xeon Phi(TM) Coprocessors.

In this video episode we will introduce Intel Xeon Phi coprocessors based on the Intel Many Integrated Core, or MIC, architecture and will cover some of the specifics of hardware implementation.

In this video we will discuss the general properties of the Intel MIC architecture in detail, and then focus on vector instruction support.

In this episode we will introduce vector instructions and discuss instructions that are supported on Intel Xeon Phi coprocessors.

Here is a link to more videos on Parallel Programming with Intel Xeon Phi Coprocessor from Colfax International.


Videos - Parallel Programming with Intel Xeon Phi Coprocessors

$
0
0

Here is a list of recently published videos from Colfax International on Intel(R) Xeon Phi(TM) Coprocessors.

In this video we will discuss software tools needed and recommended for developing applications for Intel Xeon Phi coprocessors. We will begin with software that is necessary to boot coprocessors and to run pre-compiled executables on them.

In this video episode we will discuss the types of applications that perform well on Intel Xeon Phi coprocessors. I hope that this discussion will help you to answer the question “Will my application benefit from the MIC architecture?”

In this video we will discuss the next generation MIC architecture, based on a chip codenamed Knight’s Landing or KNL, for short.

Here is a link to other videos on Parallel Programming and Optimization with Intel Xeon Phi Coprocessors from Colfax International.

 

ANSYS® and Intel Team Up to Shrink Simulation Timelines

$
0
0

By Mike Pearce, Ph.D., IDZ Server Community Manager

ANSYS, a world leader in simulation software, announced on March 12 that its premier engineering simulation software product, ANSYS* Mechanical APDL 16.0 (ANSYS Mechanical 16.0), will ship with built-in, optimized support for Intel® Xeon Phi™ coprocessors.

Structural engineers and designers using ANSYS Mechanical 16.0 will be able to tap into the power and performance of highly-parallelized, multicore processing to speed engineering workloads, and at an affordable price.

ANSYS Mechanical 16.0 is a leading commercial Finite Element Analysis (FEA) software solution that enables engineers to test and validate a broad range of mechanical and structural design options using simulations. Precise modeling and simulations in structural engineering can greatly improve design productivity and shrink development costs and timelines by eliminating the need to build costly prototypes or perform physical testing in initial stages of design. The quality and flexibility of ANSYS Mechanical 16.0 simulations make it possible to predict end-product behavior and reliability by using iterative testing across a range of real-world scenarios.

When it comes to engineering simulations, speed, accuracy, power, and performance are what matter most. Engineers using ANSYS Mechanical 16.0 with Intel® Xeon® processors and coprocessors can now run much larger and more complex simulations faster, which enables engineers to test for detailed design variables without pushing product development timelines.

ANSYS Mechanical 16.0, optimized with the Intel Xeon Phi coprocessor, enables designers and engineers to get more simulation power out of their existing hardware investments. Using the multi-core Intel® Xeon® processor E5-2600 v3 product family and the Intel Xeon Phi coprocessor, based on Intel® Many Integrated Core Architecture (Intel® MIC Architecture), engineers can gain up to a 2.2X performance improvement for ANSYS Mechanical 16.0 in both Windows* and Linux* environments.[1], [2], [3]

Even higher performance gains are possible by upgrading from previous generation Intel hardware and adding a coprocessor. Gains of up to 3.1X were achieved for ANSYS Mechanical 16.0 DMP when upgrading from the Intel® Xeon® processor E5-2600 v2 family to the Intel® Xeon® processor E5-2600 v3 family and adding one or more Intel Xeon Phi coprocessors.3

These are game-changing performance gains, with the ability to transform structural design and other High Performance Computing (HPC) workflows. The added power allows engineers to run more and larger simulations in less time to improve innovation, quality, safety, reliability, and time to market.

Tweet this: "#ANSYS Mechanical 16.0 co-engineered with #Intel #XeonPhi coprocessors: More than 2X performance gain intel.ly/1G5s7Vg"

Performance gains of this magnitude can only be realized through collaboration

Intel and ANSYS have worked together for many years, testing, optimizing and tuning code and hardware to deliver the highest possible performance for optimized bandwidth simulation workloads.

With ANSYS Mechanical 16.0, ANSYS has further parallelized and modernized its code and physics algorithms to take advantage of the latest generation of multi-core Intel Xeon processors and Intel Xeon Phi coprocessors. The result is a simulation platform that delivers significantly improved performance and efficiency for structural mechanics modeling, without complicating the IT environment.

Intel® Architecture Improves HPC Performance

The Intel® Architecture delivers improved HPC performance cost effectively, without requiring extensive code development, because Intel Xeon processors and Intel Xeon Phi coprocessors are designed to share computational workloads, optimizing performance.

The coprocessor also delivers substantially higher performance for highly parallel software, such as that found in ANSYS simulation applications. When a coprocessor is present, the optimized ANSYS Mechanical 16.0 software (which incorporates math processing routines via the Intel® Math Kernel Library) automatically offloads computationally intensive workloads from the CPU cores to the coprocessor. In turn, the Xeon Phi coprocessor performs these calculations quickly and efficiently and returns the results to the CPU. This reduces overall calculation time and accelerates the total simulation timeline.

Intel Xeon processors and coprocessors also share the same programming model, from languages, to tools and applications. Now engineers and designers can focus on “reuse” rather than “recode.”

Developers interested in programming Intel Xeon Phi cores can use standard C, C++, and FORTRAN source code. The same program source code written for Intel MIC Products can be compiled and run on a standard Intel Xeon processor. Familiar programming models remove training barriers, allowing developers to focus on the problems instead of software engineering.

Intel and ANSYS: A Tradition of Co-Engineering

ANSYS and Intel will continue to work together to deliver HPC simulation solutions that will help provide the highest performance and best value solutions.

You can make sure you get the full value of the latest ANSYS software advances by refreshing your server platform with the latest Intel Xeon processors and Intel Xeon Phi coprocessors.

As an added incentive to ANSYS customers, Intel is offering for a limited time (through June 27, 2015), special pricing for Intel® Xeon Phi™ coprocessors 3120A and 7120A. Quantities are limited. Click here to find out more.

For more information about combining ANSYS Mechanical with Intel Xeon Phi coprocessors, download this solution brief.

Tweet this: ".@ANSYS_Inc and #Intel collaborate to deliver simulation performance intel.ly/1G5s7Vg  #HPC #ANSYS


[1] 2-socket Intel® Xeon® processor E5-2697 v2 vs. 2 socket Intel® Xeon® processor E5-2697 v3 + 1x Intel® Xeon Phi™ 7120 coprocessor running V15ln-2 DMP mode and 2 MPI processes with ANSYS Mechanical 16.0

[2] Source: Intel and ANSYS performance tests, October 2014. Previous-generation server configuration: 2 x Intel® Xeon® processor E5-2697 v2 (12 cores, 2.7 GHz, 8GT/s, 130W), 64 GB DDR3-1600 MHz memory (8 x 8GB), 1 x 600 GB SAS HDD storage, Red Hat* Enterprise Linux* Server release 6.5, ANSYS* Mechanical 16.0 DMP. New server configuration: 2 x Intel® Xeon® processor E5-2697 v3 (14 cores, 2.6 GHz, 9.6GT/s, 145W), 64 GB DDR4-2133 MHz memory (8 x 8GB), Intel® Data Center Solid-State Drive S35000 Series 800 GB, Red Hat Enterprise Linux Server release 6.5, ANSYS Mechanical 16.0 DMP. Performance measured on each system with and without one Intel® Xeon Phi™ coprocessor.

[3] Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

Advanced Computer Concepts for The (Not So) Common Chef: First Some Terminology Part 2

$
0
0

OF COURSE, I KNOW WHAT A THREAD IS….DON’T I?

Now that we know what a core is, let’s dive into another source of confusion.

This section gets a little deeper into techno babble than I wanted for this series of blogs. If you are so inclined, my gourmet readers, you can either skip or read on. I believe the rest of the blogs can be understood with or without this little aside. But for those of you who are already familiar with threading, it may give you more insight than would be the case otherwise.

Before getting into the guts of the analogy, let’s discuss what a ‘thread’ is. You see, the normal old thread that we computer scientists know and love, is not the same as a thread when talking about the execution of a program in a core. For example, hyperthreading does not really refer to the ability of one core to execute the two threads our program spawned off (say, MyProgram.ThreadA and MyProgram.ThreadB). (I’ll get to hyperthreading a little later in this series.) Don’t get me wrong; they are related. When a computer architect says thread, he’s referring to a “thread of execution” and not a program. At the hardware level, there are no programs, only a sequence of instructions the hardware executes, thus the “thread of execution.”

(I’ll be using “HW-thread” as a short hand for “hardware-thread” and “SW-thread” for “software thread.”)

Hopefully, some of my less-than-professional illustrations will help.

Figure SWTHREAD shows two threads (ThreadA and ThreadB) spawned by ProgramOne, and a single thread that corresponds to a second program, ProgramTwo.

Image of software threads

Figure SWTHREAD. How programmers think two programs execute on a single core machine with hyperthreading. (IP is the Instruction Pointer, a register that tells the computer which instruction to execute next.)

This is a pretty familiar situation. You have a program running and an Instruction Pointer (IP) which points to the instruction that will be executed next. Looking more closely, you’ll see that some strange things are going on with the IP that some might think are typos. For example, there are three IPs even if our hardware can execute only two threads at a time. In addition, the IPs are offset, whereas we usually think of them as being lined up. That’s because each SW-thread has its own IP address associated with the context it is executing in. In this case, a “context” is an operating system (OS) context, including registers and other things associated with passivating and activating a SW-thread. (‘Passivating’ is to put to sleep. ‘Activating’ is to wake up.) This means that the IP addresses are associated with each SW-thread’s context and not the actual hardware. There can be, and are, as many IPs as there are SW-threads executing on the processor.

Now let’s look at Figure HWTHREAD1. In the top left corner is a miniature version of Figure SWTHREAD, allowing us to easily compare how a hardware thread differs from a software thread. How the hardware executes those SW-threads is shown in Figure HWTHREAD1. Notice that slices of the SW-threads are distributed across the two HW-threads by the operating system on an opportunistic basis. This points out the difference between a SW-thread and a HW-thread. SW-threads are the parts of the program that the programmer intends to execute in parallel. That’s why the number of SW-threads can be anything from one to hundreds. In contrast, the execution of a sequential series of instructions by a processing unit is a HW-thread. The number of HW-threads per core is fixed per core, though some modern processors have a switch that can change this value slightly. (Hyperthreading is a mode whereby one core can execute two threads instead of just one, but we’ll talk about that later.) These instructions can and often are from different SW-thread contexts including that of the kernel. The kernel is what does the actually switching between various SW-threads by storing one SW-thread’s context and loading another.

image hardware-threading

Figure HWTHREAD1. Mapping software threads to hardware threads

Ah, I know what you are thinking: What does this have to do with our (not so) common chef and his kitchen? Well, it only does if our chef is familiar with the concept of SW-threading from his Intel® Edison IoT (Internet of Things) hobby[LP1]  or his small, on-the-side Android* cooking app business. This whole thing about SW-threads versus HW-threads only has the potential to confuse computer programmers.

ADDENDUM: ANOTHER WAY OF MAPPING A HARDWARE THREAD TO A SOFTWARE THREAD

Here’s a little addendum for those who want to really have their mind blown. Figure SWTHREAD shows SW-threads from a programmer’s perspective. Figure HWTHREAD1 shows how the program’s SW-threads map onto the processor’s HW-threads.

Figure HWTHREAD2 shows how one of the processor’s HW-threads maps to the software threads (versus SW-threads mapping to HW-threads). Though the difference seems simply a matter of semantics, you can see that in reality the difference is dramatic. The HW-thread is the wavy blue line that jumps back and forth between the different processes. Don’t worry, I’m not going explain it. If the explanation really interests you, just leave a comment and I’ll dazzle and delight.

image mapping threads

Figure HWTHREAD2. Tracing a hardware thread through software threads (plus the operating system)

NEXT: Back to the kitchen analogy and how a modern processor works!

FOOTNOTE:

+The operating system (OS) runs the jobs that must be done so programs (e.g., spreadsheets) can run. Think of those jobs as managers and support staff, people who don’t sell or make stuff, but are nevertheless necessary for the business to run.

Similarly, the kernel is the very core of the operating system that performs the key functions that must be there for anything to work. For example, you can probably do without many of the managers, but not the guys who actually keep the manufacturing equipment working (but are not involved with creating the product itself).

 

 


 [LP1]Edison should be followed by an approved noun. http://namesdb.intel.com/tmb/Namesdb/search.asp#12904

 

Working with Intel® Code Builder for OpenCL™ API on Systems Equipped with Intel® Xeon Phi™ x100 Coprocessor

$
0
0

Due to the recent restructuring in OpenCL™ Technology (see https://software.intel.com/en-us/intel-opencl ), the Intel® Code Builder for OpenCL™ replaces the Intel® SDK for OpenCL™ Applications. For the Intel® Xeon Phi™ x100 coprocessor, that means we need a specific OpenCL™ Runtime and the Code Builder for the coprocessor. So I decided to install the new OpenCL development environment and test an application on the Intel® Xeon Phi™ x100 coprocessor (codename Knights Corner). I’d like to share this experience with you in case you want to build and run an OpenCL application on your system equipped with Intel® Xeon Phi™ x100 cards.

My host system runs RHEL 64-bit 6.5 (kernel 2.6.32-431). It has two Intel® Xeon Phi™ coprocessors, and has Intel® C++ Composer XE 2015.0.2 (composer XE 2005.2.164) installed.

According to the OpenCL™ Runtime release notes (see https://software.intel.com/en-us/articles/opencl-runtime-release-notes/ ), in order to work with the coprocessor I need to install the MPSS 3.3 (not MPSS 3.4 and later).

First I installed the latest MPSS 3.3 version followed by the OpenCL runtime 14.2 (CPU + Xeon Phi) and then the Code Builder:

  1. Download the MPSS 3.3.4 from https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss-archive . Install it according to the provided readme file. Finally bring the MPSS service up:

    # service mpss start

  2. Download the OpenCL™ Runtime 14.2 (for Intel® CPU and Intel® Xeon Phi™ coprocessors for Linux* 64-bit) available at https://software.intel.com/en-us/articles/opencl-drivers . The OpenCL™ Runtime allows me to run an OpenCL™ application.

    # tar zxvf opencl_runtime_14.2_x64_4.5.0.8.tgz

    This creates a sub-directory called pset_opencl_runtime_14.1_x64_4.5.0.8

    # cd pset_opencl_runtime_14.1_x64_4.5.0.8
    # ./install.sh

  3. Download the Intel® Code Builder for OpenCL™ API 2014 R3 for Linux*(with support for Intel® Xeon Phi™ coprocessors available at https://software.intel.com/en-us/articles/intel-code-builder-for-opencl-api . The Code Builder allows me to build OpenCL binaries.

    # tar zxvf intel_code_builder_for_opencl_2014_4.6.0.178_x64.tgz

    This creates a subdirectory called intel_code_builder_for_opencl_2014_4.6.0.178_x64

    # cd intel_code_builder_for_opencl_2014_4.6.0.178_x64
    # ./install.sh

For testing purposes, I downloaded OpenCL source code from the article https://software.intel.com/en-us/articles/using-the-intel-sdk-for-opencl-applications-xe-2013 . The sample package is named opencl-sample.zip. I unzipped it in Windows* and transferred all the files to my Linux system. They included ocl_sample.cpp, kernel ocl_sample.cl and the input image called input.pgm.

# source <installed directory>/bin/compilervar.sh intel64
# icc ocl_sample.cpp –lOpenCL –o ocl_sample.out

#./ocl_sample.out

Platform: Intel(R) OpenCL
Number of accelerators found: 2

DEVICE #0:
NAME:Intel(R) Many Integrated Core Acceleration Card
#COMPUTE UNITS:240

DEVICE #1:
NAME:Intel(R) Many Integrated Core Acceleration Card
#COMPUTE UNITS:240
OpenCL Initialization Completed
Completed reading Input Image: 'input.pgm'
Transferring Data from Host to Device
Executing Kernel on selected devices
Transferring data from Device to Host
Completed writing Output Image: 'output.pgm'
Completed execution! Cleaning Up.

This generates an output image called output.pgm. The image is the result of applying a Gaussian blur to the input image.

 

For completeness, I retested the above application with all MPSS 3.3 versions, namely MPSS 3.3.1, MPSS 3.3.2, and 3.3.3. With the above OpenCL™ Runtime and Code Builder for OpenCL™ API, I was able to run the test successfully with MPSS 3.3.4, MPSS 3.3.3, MPSS 3.3.2 and MPSS 3.3.1. Note that whenever you uninstall / install another version of MPSS 3.3, before rerunning the application you need to repair the OpenCL™ Runtime or simply just uninstall and re-install it again. To repair the OpenCL™ Runtime, go to the OpenCL™ Runtime installation directory, type ./install.sh and choose #2-Repair the installation.

 

Accelerating Business Intelligence and Insights with Software Optimized for the Intel® Xeon® Processor E7 v3 Family

$
0
0

By Mike Pearce, Ph.D. Intel Developer Evangelist for the IDZ Server Community.

On May 5, 2015, Intel Corporation announced the release of its highly anticipated Intel® Xeon® processor E7 v3 family.  One key area of focus for the new processor family is that it is designed to accelerate business insight and optimize business operations—in healthcare, financial, enterprise data center, and telecommunications environments—through real-time analytics. The new Xeon processor is a game-changer for those organizations seeking better decision-making, improved operational efficiency, and a competitive edge.

The Intel Xeon processor E7 v3 family’s performance, memory capacity, and advanced reliability now make mainstream adoption of real-time analytics possible. The rise of the digital service economy, and the recognized potential of "big data," open new opportunities for organizations to process, analyze, and extract real-time insights. The Intel Xeon processor E7 v3 family tames large volumes of data accumulated by cloud-based services, social media networks, and intelligent sensors, and enable data analytics insights, aided by optimized software solutions.

A key enhancement to the new processor family is its increased memory capacity – the industry’s largest per socket1 - enabling entire datasets to be analyzed directly in high-performance, low-latency memory rather than traditional disk-based storage. For software solutions running on and/or optimized for the new Xeon processor family, this means businesses can now obtain real-time analytics to accelerate decision-making—such as analyzing and reacting to complex global sales data in minutes, not hours.  Retailers can personalize a customer’s shopping experience based on real-time activity, so they can capitalize on opportunities to up-sell and cross-sell.  Healthcare organizations can instantly monitor clinical data from electronic health records and other medical systems to improve treatment plans and patient outcomes.

By automatically analyzing very large amounts of data streaming in from various sources (e.g., utility monitors, global weather readings, and transportation systems data, among others), organizations can deliver real-time, business-critical services to optimize operations and unleash new business opportunities. With the latest Xeon processors, businesses can expect improved performance from their applications, and realize greater ROI from their software investments. 

Real Time Analytics: Intelligence Begins with Intel

Today, organizations like IBM,SAS, and Software AG are placing increased emphasis on business-intelligence (BI) strategies. The ability to extract insights from data is a something customers expect from their software to maintain a competitive edge.  Below are just a few examples of how these firms are able to use the new Intel Xeon processor E7 v3 family to meet and exceed customer expectations.

Intel and IBM have collaborated closely on a hardware/software big data analytics combination that can accommodate any size workload. IBM DB2* with BLU Acceleration is a next-generation database technology and a game-changer for in-memory computing. When run on servers with Intel’s latest processors, IBM DB2 with BLU Acceleration optimizes CPU cache and system memory to deliver breakthrough performance for speed-of-thought analytics. Notably, the same workload can be processed 246 times faster3 running on the latest processor than the previous version of IBM DB2 10.1 running on the Intel Xeon processor E7-4870.

By running IBM DB2 with BLU Acceleration on servers powered by the new generation of Intel processors, users can quickly and easily transform a torrent of data into valuable, contextualized business insights. Complex queries that once took hours or days to yield insights can now be analyzed as fast as the data is gathered.  See how to capture and capitalize on business intelligence with Intel and IBM.

Tweet this: Want faster access to data analytics? It's #NowPossible with @IBMDataWH and #XeonE7 http://intel.ly/1DWH5qd #IntelDC @TryDB2

From a performance speed perspective, Apama* streaming analytics have proven to be equally impressive. Apama (a division of Software AG) is an extremely complex event process engine that looks at streams of incoming data, then filters, analyzes, and takes automated action on that fast-moving big data. Benchmarking tests have shown huge performance gains with the newest Intel Xeon processors. Test results show 59 percent higher throughput with Apama running on a server powered by the Intel Xeon processor E7 v3 family compared to the previous-generation processor.4

Drawing on this level of processing power, the Apama platform can tap the value hidden in streaming data to uncover critical events and trends in real time. Users can take real-time action on customer behaviors, instantly identify unusual behavior or possible fraud, and rapidly detect faulty market trades, among other real-world applications. For more information, watch the video: Driving Big Data Insight: High-Speed, Streaming Analytics from Software AG. This infographic shows Apama performance gains achieved when running its software on the newest Intel Xeon processors.

Tweet this: Split-second decision making #NowPossible with @SoftwareAG and Intel #XeonE7 http://intel.ly/1IobLqR #IntelDC

SAS applications provide a unified and scalable platform for predictive modeling, data mining, text analytics, forecasting, and other advanced analytics and business intelligence solutions. Running SAS applications on the last Xeon processors provides an advanced platform that can help increase performance and headroom, while dramatically reducing infrastructure cost and complexity. It also helps make analytics more approachable for end customers. This video illustrates how the combination of SAS and Intel® technologies delivers the performance and scale to enable self-service tools for analytics, with optimized support for new, transformative applications. Further, by combining SAS* Analytics 9.4 with the Intel Xeon processor E7 v3 family and the Intel® Solid-State Drive Data Center Family for PCIe*, customers can experience throughput gains of up to 72 percent. 5

Tweet this: Up to 72 percent higher throughput #NowPossible with @SASsoftware and Intel #XeonE7 http://intel.ly/1Io8lEA #IntelDC

The new Intel Xeon processor E7 v3 processor’s ability to drive new levels of application performance also extends to healthcare. To accelerate Epic* EMR’s data-driven healthcare workloads and deliver reliable, affordable performance and scalability for other healthcare applications, the company needed a very robust, high-throughput foundation for data-intensive computing. Epic’s engineers benchmark-tested a new generation of key technologies, including a high performance data platform from InterSystem*, new virtualization tools from VMware*, and the Intel Xeon processor E7 v3 family. The result was an increase in database scalability of 60 percent,6, 7 a level of performance that can keep pace with the rising data access demands in the healthcare enterprise while creating a more reliable, cost-effective, and agile data center. With this kind of performance improvement, healthcare organizations can deliver increasingly sophisticated analytics and turn clinical data into actionable insight to improve treatment plans and ultimately, patient outcomes.

Tweet this: Clinical efficiency is #NowPossible thanks to Epic, @InterSystems, @VMwareHIT & Intel #XeonE7 http://intel.ly/1DWHIQF #IntelDC

These are only a handful of the optimized software solutions that, when powered by the latest generation of Intel processors, are enabling tremendous business benefits and competitive advantage. From the highly improved performance, memory capacity, and scalability, the Intel Xeon E7 v3 processor family helps deliver more sockets, heightened security, increased data center efficiency and the critical reliability to handle any workload, across a range of industries, so that your data center can bring your business’s best ideas to life. To learn more, visit our software solutions page and take a look at our Enabled Applications Marketing Guide.

Footnotes:

1Intel Xeon processor E7 v3 family provides the largest memory footprint of 1.5 TB per socket compared to up to 1TB per socket delivered by alternative architectures, based on published specs.

2 Up to 6x business processing application performance improvement claim based on SAP* OLTP internal in-memory workload measuring transactions per minute (tpm) on SuSE* Linux* Enterprise Server 11 SP3. Configurations: 1) Baseline 1.0: 4S Intel® Xeon® processor E7-4890 v2, 512 GB memory, SAP HANA* 1 SPS08. 2) Up to 6x more tpm: 4S Intel® Xeon® processor E7-8890 v3, 512 GB memory, SAP HANA* 1 SPS09, which includes 1.8x improvement from general software tuning, 1.5x generational scaling, and an additional boost of 2.2x for enabling Intel TSX.

3Software and workloads used in the performance test may have been optimized for performance only on Intel® microprocessors. Previous generation baseline configuration: SuSE Linux Enterprise Server 11 SP3 x86-64, IBM DB2* 10.1 + 4-socket Intel® Xeon® processor E7-4870 using IBM Gen3 XIV FC SAN solution completing the queries in about 3.58 hours.  ‘New Generation’ new configuration: Red Hat* Enterprise LINUX 6.5, IBM DB2 10.5 with BLU Acceleration + 4-socket Intel® Xeon® processor E7-8890 v3 using tables in-memory (1 TB total) completing the same queries in about 52.3 seconds.  For more complete information visit http://www.intel.com/performance/datacenter

4 One server was powered by a four-socket Intel® Xeon® processor E7-8890 v3 and another server with a four-socket Intel Xeon processor E7-4890 v2. Each server was configured with 512 GB DDR4 DRAM, Red Hat Enterprise Linux 6.5*, and Apama 5.2*. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

5 Up to 1.72x generational claim based on SAS* Mixed Analytics workload measuring sessions per hour using SAS* Business Analytics 9.4 M2 on Red Hat* Enterprise Linux* 7. Configurations: 1) Baseline: 4S Intel® Xeon® processor E7-4890 v2, 512 GB DDR3-1066 memory, 16x 800 GB Intel® Solid-State Drive Data Center S3700, scoring 0.11 sessions/hour. 2) Up to 1.72x more sessions per hour: 4S Intel® Xeon® processor E7-8890 v3, 512 GB DDR4-1600 memory, 4x 2.0 TB Intel® Solid-State Drive Data Center P3700 + 8x 800 GB Intel® Solid-State Drive Data Center S3700, scoring 0.19 sessions/hour.

6Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/performance

7Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.

 

Making programmers more productive

$
0
0

One of the themes that ran through this year’s Intel Software Conference, in EMEA, was programmer productivity. The event took place in Seville in April and gave invited resellers and journalists an opportunity to learn more about Intel’s tools for high-performance computing (HPC), parallel programming, cross-platform development, and video processing.

“Scaling is a big deal and power consumption is talked about a lot,” said James Reinders, chief evangelist of Intel Software products, opening the day. “But one challenge that isn’t talked about enough is programmer productivity.”

This means not only making it easier for programmers to get things done, but also preserving their investment in skills and knowledge as the technology evolves. The Intel® Xeon PhiTM product family, for example, offers up to 61 processor cores and is designed to only run parallel programs well. Yet, it still uses the same programming tools and models as the Intel® Xeon® products, avoiding the need for programmers to learn a whole new technology. This is also why Intel works hard at standards compliance, together with other companies and standards bodies, to ensure that code is portable between architectures.

Throughout the course of the event, there were opportunities to hear about several tools that can help to increase productivity. Laurent Duhem, senior application engineer, presented the new Intel® Advisor XE 2016 Beta for vectorization. This helps to identify where programs can use single instruction multiple data (SIMD) code, which can run the same calculation across a number of different data items simultaneously. The tool helps to ensure correctness by simulating vectorized loops and checking for any memory conflicts, and enables developers to more quickly identify where the program is spending most of its time (including the number of times a loop is called), so that these hot sections can be optimized. The tool offers hints for improving vectorization, and advice on where vectorization might be inefficient because of non-contiguous memory accesses. Vectorization is a difficult challenge, but this new tool provides guidance at each step to make it as easy as possible. You can download the beta now.

Tackling the multiplatform challenge

In the case of vectorization, productivity challenges might be said to arise from hardware complexity. In consumer software development, productivity is more likely to be challenged by the diverse range of operating systems, form factors and processor architectures that make up the device landscape. Intel® Integrated Native Developer Experience (INDE) is a suite of tools that enables programmers to write fast C++ code that targets multiple operating systems and architectures, making it easier to ship applications more quickly. Alex Weggerle, technical consulting engineer, explained how it integrates with your existing developer environment and introduced its key features. For example, it includes Intel® Hardware Accelerated Execution Manager (Intel HAXM), which uses virtualization technology to run a full-speed Android emulation. That enables developers to more quickly test a wide range of device sizes and types. The Graphics Frame Debugger eliminates the need to push updated OpenGL code to the target device for testing each time a change is made (a process of 5 to 10 minutes), so you can instead take a screenshot and instantly see any code changes applied to that screenshot. Alex also presented the Intel® XDK, a free HTML5 cross-platform development tool, that includes templates to help you get started quickly, and the Apache Cordova* APIs to enable cross-platform access to phone hardware features.

Parallel programming more effectively

Intel® Parallel Studio XE 2016 Composer Edition is available now as an open beta. Heinz Bast, technical consulting engineer, introduced some of the new features in this tool suite, which is designed to support programmers as they develop parallel programs to make optimal use of the hardware. It offers improved vectorization using Intel® CilkTM Plus and OpenMP* 4.0, with some features from the upcoming OpenMP* 4.1 already implemented. Reinders said that one of the things that excites him about OpenMP is that it helps obtain vectorization while leaving the actual code relatively intact, making it an efficient way to improve performance while keeping the code looking like the original science of the application. The new Intel Parallel Studio XE 2016 tool suite introduces loop blocking, so that data can be chunked for processing to avoid cache misses, and array reductions to avoid the bottleneck of turning off SIMD where there are data dependencies within a loop. The new annotated source listing inserts compiler diagnostics after the corresponding lines, making it easier to see what the compiler has done.

Offloading to GPUs

As more and more sophisticated graphics capabilities have been added to Intel® processors, they have become a key compute resource, with performance exceeding that of the CPU cores by up to 8 times. Heinz explained how work can be offloaded to the Intel® HD Graphics execution units using annotations the Intel® C/C++ Compiler. He said a number of customers have been asking for this capability, and that Intel had chosen to support standards rather than building its own proprietary language extensions.

Faster compilation

There are some changes that make the compilers more time-efficient too. Intel® Fortran Compiler XE 2016 has been improved with the implementation of submodules. Previously, if you made changes to a module you had to recompile not just that module but also any other modules that call it. In a project of three million lines of code, that could cause a significant delay. With submodules, that’s no longer necessary as long as the interface between the submodule and other modules is unchanged. Intel® C/C++ Compiler 16.0 implements a number of compile time improvements, including disabling intrinsics for prototypes (which were rarely used, Bast said) by default.

Accelerating video processing

The conference’s final session considered a different challenge, the rise of video streaming and download. Starting with the 5th Generation Intel® CoreTM processor, Intel has included hardware acceleration for video with functions built in that enable accelerated encoding and decoding of video. The Intel® Media SDK enables application developers to use those capabilities in their software, making it easier to make applications for visual analysis, media transcoding, and graphics in the cloud (including hosted desktops and cloud gaming). Intel® Media Server Studio can be used to generate random Intel® Stress Bitstreams for testing the architecture and also includes tools for analyzing, encoding and decoding video. As the resolution of video increases (4K is expected to be widespread by the time of the next World Cup in 2018), hardware-accelerated encoding and decoding will become increasingly important to deliver a good user experience.

To find out more about Intel tools for software developers, visit the Intel Developer Zone.

 

Advanced Computer Concepts for the (Not So) Common Chef: The Home Kitchen

$
0
0

Since that brief aside on terminology is out of the way, let us continue with the kitchen analogy.

For the Intel® Xeon Phi™ family of products, and indeed for any processor, one of its cores is like a kitchen. The components of the processor pipeline (ALU, Instruction Decoder, Memory Cluster, etc) are the different pieces of cooking equipment in the kitchen (range, oven, food processors, etc). The sequence of instructions being executed, i.e. the program, is the recipe. And the actual execution of those instructions is done by the CPU, or, to say it more succinctly, the micro-architecture’s electronics performing each instruction in the executing thread. (I will refer to the micro-architecture’s electronics as the uArchitecture for short.) See Figure KITCHEN.

Figure KITCHEN. The common kitchen.

The pantry is the computer’s memory, its contents (broth, dried noodles, flour, sugar, etc) are the data used by the computer (variables, data structures, etc). A pantry is generally in a separate room, adjacent to but not in the kitchen. Likewise, the computer’s primary memory is generally outside of the processor but connected via a fast bus. By the way, I am assuming the refrigerator and freezer are also in the pantry. For a summary, see Table COMPARE.

THE COMPUTER SYSTEM

THE KITCHEN

Core / pipeline / cache

The kitchen

Execution engine/micro-architecture

The chef

The program

Recipe

An instruction

Recipe step

Cache

Working space (e.g. a counter top)

Memory

Pantry

Result of the program

The mealsmall plate of food

Table COMPARE. The parts of a kitchen compared to the components of a computer.

The counter top and other preparation surfaces of our kitchen are the computer’s cache. When preparing a meal, the chef places frequently used ingredients on the surfaces, such as salt and flour, as well as frequently used utensils and containers (spoons, knives, pots and pans). Similarly, the processor will place frequently used data into its data-cache, and frequently executed instructions into its instruction cache.

The recipe the chef is using is the computer program. The recipe tells the chef step by step, what ingredients he needs (salt, flour, cheese, etc), and what he needs to do with them (mix, bake, sauté, etc). The program tells the computer what data (numbers, images, etc.) it needs and how it should process that data (add them together, sort them, etc). See Table RECIPE.

To belabor the point, let us briefly trace the preparation of a recipe (i.e., the execution of our program) through our kitchen.

Scenario: Our gourmet home chef has found a positively wonderful recipe, and he proceeds to create an exotic and tasty dish for a good friend: pizza.

Kitchen

Computer

Chef removes the cook book from the shelf, opens it up, and finds the pizza recipe of his dreams.

The operating system loads a program and starts executing the PZZAH.EXE application found under the GOURMET directory.

Going to the pantry, the chef gathers up the different ingredients he needs to create the mystic pizza. He cannot gather all the ingredients as he has limited counter space so he instead gathers the ingredients he needs immediately.

The computer, following the instructions in the program PZZAH.EXE, fetches from memory the data FLOUR, SALT, SUGAR, OIL and YEAST.

As it executes the program, the uArchitecture places data and instructions that the program uses frequently in the data and instruction caches, respectively.

The chef places flour, salt, sugar, oil and yeast in a bowl, mixes thoroughly, and then kneads

The computer executes instructions from the subroutines, MIX() and KNEAD(). Since FLOUR and SALT are used often, the uArchitecture places them in cache for fast access. It also uses MIX() often and so, places the code into the instruction cache.

The chef seasons the dough and rolls it out, and places it onto a pizza pie sheet.

The computer continues to execute instructions, including ROLLOUT() and PLACE-IN-PIZZH-PIE-SHEET() subroutines. It continues to fetch data that it needs from memory, such as THYME and GARLIC.

Going back to the pantry, the chef gathers vegetables, meats, cheeses, herbs and spices, and carries them over to his preparation area, placing no longer needed ingredients back into the pantry.

The uArchtecture accesses memory for a variety of new data, including TOMATOES, PEPPERONI, MOZZARELLA and OREGANO. Data that is no longer needed often is moved back into memory, making room in the cache for new frequently used data.

The chef sautés vegetables, meat, cheeses, his secret blend of fresh herbs and spices, and then layers it on top of the pizza dough.

The computer fetches the data MEAT, CHEESES, HERBS and SPICES, and executes subroutines BLEND(), SAUTÉ() and LAYER().

Layering exotic cheeses along with his mystery pizza sauce, he places the pizza in the oven for 20 minutes until golden brown.

The computer fetches EXOTIC-CHEEESES and SECRET-SAUCE, and executes subroutines LAYER() and COOK().

He removes the pizza from the oven and, voilà! He has gourmet pizza perfection.

The computer stores the result into the PERFECTION memory location, and completes the execution of the PZZAH.EXE program.

The chef cleans up countertops, putting unused food stuff back into the pantry, and washes and returns pots, pans and dishes to the shelves and cabinets where they are stored

The computer completes executing PZZAH.EXE, and starts executing an OS subroutine that releases memory, resets execution tables, and does other clean up in preparation for the next program.

Table RECIPE. Making gourmet pizza in the common computer.

 

AN ASIDE FOR OUR COMPUTER PROGRAMMING CHEFS

I have taken some liberties with mapping the kitchen to a computer. This is mostly because I have not built up the knowledge base of many of the readers, and partly because digging in too far requires too much explanation.

An example of this is defining the computer part that is analogous to our chef. What actually coordinates and drives forward the use of the pipeline to execute an instruction? Is it the microcode? Only partially as microcode is in part just the series of sub-instructions used to execute instructions. Is it some abstract unknown controlling and hidden master CPU executing the microcode? If so, then how does the controlling master CPU work? Is it a series of pulses created by the master oscillator and masked by what we think of as a microcode sub-instruction, and used to activate and deactivate digital circuits?

I will do my best to add an addendum to each blog looking more closely at the computer technology being discussed. 


Elusive Algorithms - Parallel Scan

$
0
0

Last month there was a query on the IDZ MIC forum "how to perform inclusive scan in C cilk" in which my initial reply was:

Parallelizing this is problematic due to the next result being dependent upon the prior result. While this is not impossible, it is rather difficult and it introduces some redundant additions.

As any well seasoned programmer knows, the term "can be" is seldom used and "problematic" means you have to do a little more work to attain your goal. The inclusive scan is a loop with a single statement with a temporal dependency:

    float sum = 0;
    for ( int i = 0; i < N; i++ )
    {
        Res[ i ] = (sum += A[ i ]);
    }

After making my "problematic" statement I though it be only sporting of me to demonstrate the extent of problematic this was, such that you can ascertain if the benefits from attacking your problematic issues exceed work necessary to overcome your issues. I think you will be pleasantly surprised by following this link and reading the attached PDF.

Jim Dempsey

Advanced Computer Concepts for the (Not So) Common Chef: Memory Hierarchy: Of Registers, Cache & Memory

$
0
0

After introducing this series of blogs, we established some basic processor and threading terminology. In the last blog, we laid the foundation of our kitchen analogy. We noted that a program is equivalent to a recipe, and that the different architectural features of a modern processor, e.g., pipeline, memory and microcode, function much like the components of our gourmet kitchen, e.g., the pantry, appliances and counter tops. Indeed, even our Chef is equivalent to some components in our modern processor, such as the microcode execution unit.

In this blog, we look at memory hierarchy. Modern computers have 3 types of memory: registers, cache and bulk addressable memory. Bulk addressable memory is also commonly referred to as RAM (Random Access Memory), DDR (Double Data Rate)++, main memory, or just plain “memory”. It’s called bulk because, well, there’s a lot of it compared to the other types, i.e. registers and cache.

‘Data’ in computer science terms are the variables, data structures, buffers, and so on that programs use to do what they do. In more concrete terms, data includes the internal representation of what you see on the computer screen, including your spread sheets, the documents that you read, even the programs themselves such as your word processor and browser.

In our kitchen analogy, ‘data’ are the ingredients that the recipe calls for, intermediate products such as sauces, and even our final gourmet dish. For example, salt, pepper, sugar, beef cutlets, lettuce and chives are all ‘data’.

Memory hierarchy showing relative speeds and sizes

Figure SCALE. Relative sizes and access times to different types of memory.

From the standpoint of programs we write, where we keep our data depends upon how frequently we use that data and how big it is. If we use it all the time, we keep it as close as we can to the CPU, i.e., in the registers if possible. The size and number of registers available is very small, but accessing them is very fast. See Figure SCALE and Table SPEED. If we use the data frequently (or all the time but is too large to hold in the registers), it is automatically placed into our cache memory. Data access to cache is quite a bit slower than that to registers, but it is still significantly faster than to DDR. And there are several layers of cache, generally 2 and sometimes 3 (called L1, L2 and L3). The program leaves data that it uses infrequently in DDR, i.e., main memory. Access is slow, see Figure SCALE, but the program doesn’t need the data all that often.

Computer

Latency

Register file

1 cycle (0.5 nsec)

Cache L1

4 cycles (2 nsec)

Cache L2

65 cycles (33 nsec)

Memory

60 nsec (120 cycles)+

 

 

Table SPEED. Access times for memory hierarchy.

 

Analogously, in our kitchen, how frequently we need an ingredient and how quickly determines where we place the ingredient. See Figure HIERARCHY.

Pantry, Cooking area and counter tops

Figure HIERARCHY. Kitchen analogy to the modern computer memory hierarchy.

You can think of the bank of registers as the ingredients you are actively using. A chef may be adding salt, sugar and diced carrots at a certain stage in a recipe, and will place them within reach of the stove. He simply has to reach out and use the ingredient. Similarly, a program may actively need your loan balance, interest rate and bank fees to do some forecasting, and will place that data in registers. Analogously to the limited number of registers, there is not that much space next to the stove so the Chef needs to be very selective in what he decides to place there.

If you use the ingredient often but not all the time, you leave it where it is accessible but not right next to you, e.g., somewhere on one of the counter tops. This is equivalent to a computer’s cache. Our chef needs to add thyme, a sauce he prepared earlier, and crumbled feta cheese at several points in his preparation of the entree. He may have to take a few steps to an island countertop but it is still within easy reach. In our computer example, our program may need to often update and reuse various bank and other financial data in its preparation for an investment portfolio. This data will be (automatically) placed in the processor cache. There is also a lot more space on the counter tops, as there is in cache. (See Figure SCALE.)

Most ingredients a chef uses in the preparation of a meal aren’t needed except at certain steps in the recipe. For example, if the Chef is preparing the entrée, he does not need to have at hand the cream cheese, graham cracker crumbs and blueberry compote he’ll later need for the dessert. He’ll have them stored in the pantry or refrigerator until needed. The pantry is several steps away, perhaps even in an adjacent room, but then he doesn’t go there often. Similarly, our program places in bulk addressable memory, a.k.a. RAM or DDR, the sections of a spreadsheet that it doesn’t need until a later stage in the program. Similarly, pantries are large and generally not full, with places to put ingredients and other equipment that is not needed frequently. Of course, pantries can fill up but there are almost always unneeded and expired items which need to be periodically collected and thrown in the garbage. (Ominous foreshadowing: In a later blog, I will cover something called “garbage collection”.)

Memory Hierarchy

Chef and kitchen

Computer processor

Register / cooking surface

The Chef uses salt, sugar and finely ground fresh parmesan cheese continuously to season to taste

The accounting program uses interest and exchange rates continuously to convert dollars to euros.

Cache / counter tops

Spices and sauces that are often used but not all the time

Spreadsheet records of transactions over the last and next few days

DDR / pantry or refrigerator

Meats, earlier prepared dishes, etc., that either are rarely used or will be used in a later step

The previous month of transactions that have been finished and the next month that will be processed next

 

 

 

Table ANALOGY Summary of how a computer’s memory hierarchy and data, can be mapped to the ingredients, sauces and other prepared items in our gourmet kitchen.

ADDENDUM:

“Automatically placed in the processor cache”

I’m sure you noticed that I used “automatically” along with cache memory a few times. This is because the processor pipeline (i.e., our chef) doesn’t really decide explicitly what to put into the cache. What goes into the cache is decided for him by the cache management circuitry. This management circuitry automatically guesses what data the processor needs frequently and places that data into cache. For example, it will notice that some lines in a bank rate table are used frequently and will then move them into the cache for faster access. This guessing is imperfect in that the manager will sometimes get it wrong, ejecting a data item that is frequently used and replacing it with another item that seems to be used often but is not. This is also why organizing your data to work with a processor’s cache structure is often very important for your program’s performance.

In our kitchen example, think of this as the Chef having an assistant that will automatically place ingredients that the Chef will need next within easy reach. The assistant is the cache management circuitry. Just like the cache manager, and to the great irritation of the Chef, the assistant can make mistakes as he isn’t perfect.

Footnotes:

+I am sure you noticed that instead of “cycles (seconds),” memory is expressed in “seconds (cycle)”. There is a reason to this madness. Memory access is generally given in nsec (nanoseconds) instead of cycles. This is because memory is external to the processor and controlled by different circuitry running on a different clock. Coordinating these two different clocks so memory and processor can talk to each other requires timing to be specified using a well-defined and absolute external reference unit, e.g., nanoseconds.

++DDR is shorthand for the oh-so-much-more informative name, “Double Data Rate Synchronous Dynamic Random-Access Memory” or DDR SDRAM. It is called double data rate because it has nearly twice the data rate (how much data can be moved per second) as the previous technology, SDR (Single Data Rate) SDRAM. 

Intel Software Day 2015 - Salvador (Conteúdo)

$
0
0

A Intel Software Brasil realizou nos dias 12 e 13 de Junho de 2015 o Intel Software Day 2015. Nesta edição o evento mudou de ares e foi realizado em Salvador, sendo recepcionado pelo Senai - Cimatec e sua excelente infra estrutura.
Durante o evento foram realizadas cinco trilhas de palestras com foco em Android, RealSense/Windows, IoT, HPC e Startups com profissionais de alto nível, além de apresentações de demos e muito networking entre os participantes.
Para aqueles que participaram e querem rever o conteúdo ou para aqueles que não estiveram presentes e querem saber mais sobre eles, seguem as apresentações dos feras que palestraram este ano.

 

Trilha Android

Coordenador: Eduardo Carrara - Intel

Android se tornou uma das plataformas móveis mais utilizadas no mundo com centenas de milhões de usuários ativos. A Intel tem investido consistentemente para entrar neste mercado criando processadores que entreguem performance, eficiência e experiências inovadoras. Em nossa trilha de Android trouxemos apresentações sobre ferramentas da Intel, além de novidades e dicas para desenvolvedores Android.

Abaixo seguem informações e materiais das palestras:

Crie seu primeiro APP Android no Intel XDK


Palestrante:Reinaldo Silotto
Resumo: O Intel XDK é um ambiente de desenvolvimento completo para criação de aplicações híbridas utilizando tecnologias abertas como HTML5, CSS3 e JavaScript, além de frameworks como Bootstrap 3 e jQuery Mobile. Nesta palestra foi demonstrado como criar um APP híbrido utilizando os principais recursos do XDK, além de ter apresentado todos as possibilidades que a ferramenta permite.
Material:Slides, Video (em breve), Código

Descubra seu usuário com  Context Sensing SDK


Palestrante:Ubiratan Soares
Resumo: Nessa palestra, vimos como o Intel Context Sensing pode ajudar para que você descubra mais sobre seu usuário, a partir de como ele usa o telefone/tablet e das informações de contexto associadas! 
Material:Slides, Video (em breve), Código

Material Design


Palestrante:Ricardo Lecheta
Resumo: Nessa palestra foram apresentadas novidades e dicas de como desenvolver aplicativos com recursos do Material Design. Foram explorados alguns dos novos componentes visuais e animações características do Material Design.
Material:Slides, Video (em breve), Código

Interfaces de Games para Diferentes Telas


Palestrante:Pedro Monteiro Kayatt
Resumo: Apesar de ótimos motores gráficos tentar prever como o jogo rodará em diferentes telas é importante ter em mente as diferentes proporções de telas na hora de desenvolver o layout do seu game!
Material:Slides, Video (em breve)

Aumentando a Produtividade com Android Libs


Palestrante:Nelson Glauber
Resumo: Nessa apresentação vimos como ganhar produtividade e agilizar o desenvolvimento, testes e distribuição de aplicações Android utilizando as bibliotecas mais "famosas" do mercado.
Material:Slides, Video (em breve), Código

 

Trilha HPC (High Performance Computing)

Coordenador: Igor Freitas - Intel

Atualmente as tecnologias de Big Data e Computação de Alto Desempenho exercem um papel fundamental em diversos setores da ciência e da indústria, trazendo inovações que beneficiam toda a sociedade.

Descoberta de novos medicamentos, avanços em verticais como agricultura, energia, indústria e mercado financeiro são alguns exemplos onde a computação de alto desempenho vem sendo utilizada. Desta forma, apresentaremos nesta trilha diversas técnicas de modernização de código para que sua aplicação extraia o máximo dos processadores Xeon© e co-processadores Xeon Phi™, melhorando a performance e eficiência energética na utilização de grandes datacenters.

Abaixo seguem informações e materiais das palestras:

High Performance Computing: Hardware Overview from a Software Perspective


Palestrante:Leonardo Borges - Intel
Resumo: This talk walks thru concepts in CPU and system architecture that are very relevant to High Performance Computing. A closer understanding of the hardware core and uncore components empowers developers 1) to leverage the best of each system component; and 2) to apply these concepts to improve application performance. Concepts like memory hierarchy, vector registers, and NUMA - and how to explore these from a software development perspective - are examples of topics to be covered.
Material:Slides, Video (em breve)

Modernização de Código em Software utilizado pela Física de Altas Energias


Palestrante:Calebe de Paula Bianchini - UNESP/NCC
Resumo: O projeto UNESP-IPCC está envolvido em um dos esforços internacionais de modernização de código de softwares amplamente utilizados pela Física de Altas Energias, como o Geant. A intenção é explorar arquiteturas computacionais modernas que permitem um novo grau de paralelismo por meio principalmente de vetorização. Nesta palestra serão mostrados as decisões de projetos e os resultados parciais alcançados por meio da análise dos protótipos construídos para validar o processo de modernização desses softwares e executados em arquitetura hibridas modernas, como o Intel Xeon Phi.
Material:Slides, Video (em breve)

Intel Compiler Overview with emphasis on Vectorization and Threading


Palestrante:Kenneth Craft (Intel) e Leo Borges (Intel)
Resumo: Introduction into the feature set of the Intel Composer XE tool suite which consists of the C++ and Fortran Compilers, domain specific performance libraries and multithreading libraries. The focus of this module is on integration of the Composer into development environments on Windows and Linux and demonstrating the features and performance benefits of the Intel Compilers.
Material:Slides, Video (em breve)

Developing high performance software for the future


Palestrante:Gerard Gorman - Imperial College London
Resumo: 
Material:Slides, Video (em breve)

Dia-a-dia de um supercomputador: Guia rápido de HPC para iniciantes


Palestrante:Renato Miceli - SENAI CIMATEC
Resumo: Para muitos, Supercomputação é aquela utopia, coisa que só se encontra na NASA e nos filmes de ficção científica. Para nós, Supercomputação é o ar que respiramos, a realidade do dia-a-dia. Agora, com a recém-inauguração do supercomputador CIMATEC Yemoja, o mais potente da América Latina, Supercomputação está ao alcance de todos. Nesta palestra, iremos quebrar os tabus e os preconceitos. Vamos conversar sobre o que é e o que faz um Centro de Supercomputação como o do SENAI CIMATEC, e como se opera e utiliza uma grande máquina. Vamos demonstrar ao vivo, direto no supercomputador mais potente da América Latina, algumas das funcionalidades que tornam essa ferramenta altamente cobiçada mundo afora. Iremos apresentar como seus programas podem se beneficiar da Supercomputação e quais oportunidades startups, pequenas e médias empresas podem encontrar no Centro de Supercomputação do SENAI CIMATEC.
Material:Slides, Video (em breve)

Developing Code: 50 years into Moore’s Law


Palestrante:Paul Butler – Intel
Resumo: 2015 marks the 50th year anniversary of Moore’s Law. In 2008 James Reinders (Director Chief Evangelist Intel Software) publication of “Parallel or Perish!! Are you ready?” outlined the software revolution is underway, triggered by the shift to multi-core hardware architectures and how software applications can best take advantage of it. Here we are seven years later, the muti-core / many-core hardware architecture is ubiquitous yet there are still software developers working to acquire the skills to adapt to parallelism. This talk will cover Tips and Tricks - Parallel Programming Models. Starting from a setting of the bar overview to coverage on Vectorization, Multi-Threading, Multi-Processing, and Hybrid programming. Concluding this presentation will provide directions to locate more information (white papers, case studies, on-line classes, and upcoming “Code Modernization Workshops” for developers in Brazil.
Material:Slides, Video (em breve)

 

Trilha IoT (Internet das Coisas)

Coordenador: Jomar Silva - Intel

Trilha com palestras sobre as principais tendências e tecnologias para o desenvolvimento de IoT.

Abaixo seguem informações e materiais das palestras:

O que falta na internet para as coisas?


Palestrante:Tiago Barros
Resumo: Nos últimos anos, a internet vem evoluindo de um repositório de documentos interconectados para um ambiente dinâmico, que interliga pessoas, aplicações e dispositivos. Para que isto se estabeleça, é necessária uma arquitetura para internet que consiga lidar com estes novos desafios de intercomunicação entre as coisas. Nesta apresentação iremos discutir sobre as plataformas, padrões, protocolos de comunicação e a infra-estrutura necessária para a internet das coisas.
Material:Slides, Video (em breve), Código

A indústria do futuro: o que já acontece pela automação


Palestrante:Herman Augusto Lepikson
Resumo: Fábricas inteligentes, internet das coisas, robótica autônoma, manufatura virtual.
Material:Slides, Video (em breve), Código

Como usar o Intel Edison no seu próximo projeto de IoT


Palestrante:Jomar Silva - Intel
Resumo: O Intel Edison é uma excelente plataforma para seu próximo projeto de IoT, mas antes de decidir utiliza-lo, é importante que se conheça todas as suas principais características técnicas, suas aplicações típicas e suas limitações, além de conhecer o suporte de software que existe para a plataforma.
Material:Slides, Video (em breve), Código

Intel UPM, contribuindo e compartilhando


Palestrante:Rafael Neri
Resumo: Durante a caminhada no desenvolvimento de projetos o surgimento de desafios, em alguns casos, são grandes oportunidades para desenvolvermos soluções que podem ser compartilhadas com outras pessoas. A Intel tem feito um grande esforço para construir um repositório de sensores de alto nível sobre a libmraa, UPM, diminuindo o esforço para utilização desses sensores, além de fornecer um ponto centralizado para obter informações sobre os mesmos com a comunidade que interage com o projeto. Neste contexto, veremos como observar as oportunidades de contribuição para a UPM e como estamos compartilhando internamente os conhecimentos adquiridos nessas plataformas com os alunos dos nossos cursos técnicos através de estudos de caso.
Material:Slides, Video (em breve), Código

Criando incríveis projetos de IoT com a plataforma Intel


Palestrante:Vinicius Senger
Resumo: Nesta palestra o Vinicius vai apresentar diversos projetos desenvolvidos na prática utilizando Galileo, Edison e Real Sense para integrar robots, lâmpadas, eletrodomésticos com a Internet e comandados por gestos e comandos de voz. Você vai ver código, circuitos e gadgets inusitados como o laptop de madeira que Vinicius criou com Edison!
Material:Slides, Video (em breve), Código

 

Trilha Intel RealSense e Windows

Coordenador: Felipe Pedroso - Intel

Nesta trilha foi apresentado como a  tecnologia RealSense pode revolucionar a maneira que os usuários interagem com seus dispositivos e as últimas novidades da plataforma Windows vindas diretamente da última edição do Build e apresentado pela Microsoft.

Abaixo seguem informações e materiais das palestras:

Levandos seus apps para outra dimensão com o RealSense SDK


Palestrante:Felipe Pedroso - Intel
Resumo: Se o mundo não é plano, por que nossas interfaces de interação também tem que ser? Esta palestra aborda como utilizar a tecnologia RealSense para implementar Interfaces Naturais (NUI) de uma maneira mais prática, permitindo focar no que mais interessa para o usuário: conteúdo. Também demonstrarei como criar jogos no Unity que utilizam a tecnologia como interface sem digitar uma linha de código.
Material:Slides, Video (em breve)

Como implementar o framework SharpSenses e C# 6 nas Apps?


Palestrante:Renato Haddad
Resumo: Veja nesta palestra como implementar o framework SharpSenses, criado para aumentar a produtividade do seu código com o Intel RealSense SDK, em apps de expressões faciais, reconhecimento de voz, gestos e poses. E, ao final, veremos quais as principais novidades da linguagem C# 6.
Material:Slides, Video (em breve)

É do Brasil! O caminho até um App vencedor


Palestrante:Alexandre Ribeiro
Resumo: Ganhador do Grand Prize no último Intel RealSense App Challenge, Alexandre conta como foi participar do concurso e dá dicas de como trabalhar com novas tecnologias, além de falar um pouco sobre a história da Animagames e os desafios para conquistar um lugar no mercado de Games.
Material:Video (em breve)

Windows 10 e Universal Windows Platform – desenvolvendo para 1 bilhão de devices


Palestrante:Caio Garcez - Microsoft
Resumo: O Windows 10 traz uma plataforma unificada de desenvolvimento - a Universal Windows Platform (UWP). Esta plataforma representa uma evolução das apps universais lançadas com o Windows 8.1 e o Windows Phone 8.1 e consolida a convergência dos sistemas operacionais da Microsoft. A UWP permite que um mesmo app seja executado em smartphones, tablets, laptops, desktops e outros dispositivos como HoloLens, Surface Hub e placas Raspberry Pi 2. Nesta palestra vamos apresentar os principais conceitos da plataforma, incluindo UIs inteligentes que se adaptam a diferentes resoluções e layouts, ciclo de vida dos apps e comunicação entre apps.
Material:Slides, Video (em breve)

Universal Windows Platform Bridges – seu código rodando no Windows 10


Palestrante:Caio Garcez - Microsoft
Resumo: Uma das estratégias da Microsoft com a UWP é permitir que diversos perfis de desenvolvedores possam criar e publicar apps na Windows Store, a loja unificada que abrangerá todos os dispositivos que executam o Windows 10. As "Bridges" são uma combinação de ferramentas de desenvolvimento e tecnologias de runtime que permitem que aplicações escritas em outras linguagens e frameworks sejam adaptadas para a UWP. Nesta palestra vamos abordar os principais conceitos envolvidos na transposição de aplicações .NET e Win32, sites web, apps Android escritos em Java e apps iOS escritos em Objective-C.
Material:Slides, Video (em breve)

ModernCode Project - Intel and Partners Helping You Make Today’s Software Be Ready for Tomorrow’s Computers

$
0
0

Today, we introduced the Intel® Modern Code Developer Community, which focuses on the pursuit of parallel programming.  The community includes our very successful series of Modern Code Live Workshops taught around the world and our upcoming Intel® HPC Developer Conferences. Our community includes the Intel® Parallel Computer Centers (IPCCs) located at institutions around the world with the goal to modernize key technical codes, and experts from around the world including the Intel® Black Belt Software Developers. In addition to the online community, we have an exciting contest for a very special and worthy cause, coming this fall.

Encouraging and Educating Parallel Programming

The end of rising clock rates, a decade ago, has ushered in an era of parallelism driven by the continued rise in transistor count in keeping with half a century of Moore’s Law. Today multicore and many-core processors offer amazing capabilities which are maximized by parallel programming.

Modern Code – architecting and optimizing for today and the future

“Modern Code” is code that has been re-architected and optimized, for parallelism, to run on today and tomorrow’s computers, including supercomputers, thus increasing application performance. These efforts benefit from the fruits of the Intel® Parallel Computer Centers (IPCCs) that we established with universities and other institutions around the world with the goal to modernize key technical codes. Many examples of successful techniques, including many from the IPCCs, are captured in content on the web site (Code Modernization Library), with more to come. You will also find excellent material on modernizing code in the series of “Pearls” books edited by myself and Jim Jeffers.

The Intel Modern Code Community hosts a growing collection of tools, training and support. We proudly feature an elite group of experts in parallelism and HPC, from Intel and the industry worldwide, that we call Intel® Black Belt Software Developers. Intel is partnering with these experts to train and support the broader community on modern code techniques.

Intel has been helping educate and encourage parallel programming. We have our very successful series of Modern Code Live Workshops taught around the world in conjunction with our training partners. Later this year, we will hold Intel® HPC Developer Conferences. We will keep you updated through the Intel® Modern Code Developer Community online (see "Upcoming Events").

Online Community - find us in person too!

To join the Intel Modern Code Community or find out more, visit the Intel® Modern Code Developer Community online, or find us here at the International Supercomputing Conference (July 13-15, 2015). There will be many more opportunities in the future to engage us in person, including the Intel® Developer Forum (IDF) in San Francisco, August 18-20, 2015 as well as the Supercomputing Conference 2015 in Austin, November 14-20, 2015.  Personally, I’ll be at all these and I would be very interested in discussing code modernization with you. I’ll also be at SIGGRAPH to teach a tutorial “Multithreading for Visual Effects” with five experts on Visual Effects on August 12 in Los Angeles, and I’m speaking at ATPESC 2015 the week prior.

Intel® Modern Code Challenge 2015
Coding for Science to Build a Better Tomorrow

As a way to test out newly acquired Modern Code skills and techniques, while contributing to a social cause, developers can participate in the Intel® Modern Code Challenge 2015.

Parallel computing plays a role in advancing scientific research in key areas like cancer research, physics and climate modeling which rely on parallel computing to push the envelope in performance. Developers that participate in the Intel Modern Code Challenge will have an opportunity to help the industry make the best use of the computers we have available today to enable scientific breakthroughs.

Prizes will include a trip to SC15 in Austin, Texas in November 2015, and to visit CERN, in Switzerland, in 2016. The top student participants will be eligible for scholarships. To receive additional details sign up at software.intel.com/moderncode/challenge.

Intel® HPC Developer Conferences

We will hold three Intel HPC Developer Conferences this year – one in the U.S., one in China and one in India.  We will announce more details on the Modern Code website soon.  I am the overall technical committee chair, and I’m very excited by the speakers and content we already have lined up. I expect to be able to announce complete details late this summer. We will keep you updated through the Intel® Modern Code Developer Community online (see "Upcoming Events").

Modern Code Live Workshops

In conjunction with partners, we have been holding hands-on training around the world for developers and partners. The classes are enabled with remote access to Intel® Xeon® processor and Xeon Phi™ coprocessor-based clusters. You can learn more at the Modern Code Live Workshops website. Training and resources cover architecture overviews, memory optimizations, multithreading, vectorization, Intel® Math Kernel Library, Intel® Threading Building Blocks, Intel® Parallel Studio XE and much more.

Join Us

Parallelism been long been embraced for High Performance Computing (HPC) for programming the world’s most powerful computers often called supercomputers. It is fitting that, today, at the International Supercomputing Conference, we launched the Intel® Modern Code Developer Community with many resources to help HPC developers get the most out of their applications on modern hardware.

I encourage you to come take advantage of one or more of the many benefits of the Intel® Modern Code Developer Community.

Webinar: IDF LIVE - Parallel Programming Pearls

$
0
0

Unable to join us at the Intel Developer Forum in San Francisco this August? We have you covered. This session dives into real-world parallel programming optimization examples, from around the world, through the eyes and wit of enthusiast, author, editor and evangelist James Reinders.

When: Wed, Aug 19, 2015 11:00 AM - 12:00 PM PDT


This session will explore:
• Concrete real world examples of how Intel® Xeon Phi™ products extend parallel programming seamlessly from a few cores to many-cores (over 60) with the same programming models
• Examples from the newly released book, “High Performance Parallelism Pearls Volume 2”, will be highlighted with source code for all examples freely available. 

Click to register

Viewing all 181 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>