OpenGL* ES 3.0 Precompiled Shaders

http://en.wikipedia.org/wiki/Portable_Network_Graphics

Exemplo de código

↧

Android Texture Compression - a comparison study with code sample

May 29, 2014, 7:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Data Plane Development Kit Overview

≪ Previous: OpenGL* ES 3.0 Precompiled Shaders

This is a code sample written by Cristiano Ferreira, Graphics Software Applications Engineer at Intel Corporation. Source code for this sample is available here.

Introduction

The application of an image, or texture, to a 2D or 3D model to enhance graphical detail is a very common technique in the field of computer graphics. Android* allows the usage of a variety of texture compression file formats, each of which has its own set of advantages and disadvantages. The Android Texture Compression sample allows developers to easily compare textures of five different texture compression file formats: Portable Network Graphics* (PNG), Ericsson Texture Compression* (ETC), Ericsson Texture Compression 2* (ETC2), PowerVR Texture Compression* (PVRTC), and S3 Texture Compression* (S3TC), which is also known as DirectX Texture Compression* (DXTC). This sample demonstrates how to load and use these different formats with OpenGL ES* on Android. All supported texture formats are shown side-by-side so the relative size and quality can be observed. Choosing the right compression allows the developer to balance app size, visual quality, and performance.

The sample loads each texture format, determines the mapping coordinates of the texture, and displays a portion of each of the textures. The final composition will display the full image/texture, but as four separate, format-specific textures. They are individually labeled at the top of the screen and the file sizes are provided on a small bar at the bottom of the screen.

Background on Principles/Related Terms and Texture Formats

Texture mapping is a method by which an image is applied to the surface of a shape or polygon. A helpful analogy to keep in mind is picturing the texture as wrapping paper, and the 3D model as a gift box to be wrapped. This is why this process is also called “texture wrapping.”

Figure 1 - 1 is solely the polygon/shape tank model and 2 is the texture-mapped model

Mipmaps are an optimized group of images that are generated with the primary texture. They are typically created for the purpose of improving rendering speed and reducing aliasing artifacts. Each mip (bitmap image of the collection of mipmap) is a lower resolution version of the primary texture, utilized when viewing the original texture from a distance or a downsized version of it. The creation and implementation of mipmaps comes from the basic concept that we cannot pick up as much detail in an object when it is located far away from us or when the object is miniscule. Based on this idea, different mips can be used to represent different parts of the texture/image based on the size of the objects. This increases rendering speed because the simplified mips have a much lower texel (overall number of texture pixels) count—less pixels to be processed. Additionally, since mipmaps are essentially anti-aliased, the number of noticeable artifacts is also greatly reduced. The support for mipmaps in PNG, ETC (KTX), ETC2 (KTX), PVRTC, and S3TC are included in the sample.

Portable Network Graphics (PNG)

PNG is a bitmapped image format primarily noted for its lossless data file compression. The image format is equipped with support for palette-based images (24 bit RGB or 32 bit RGBA) or grayscale images (with and without an alpha channel).

Advantages:

Has a lossless compression scheme and high visual quality
Handles both 8-bit and 16-bit transparency

Disadvantages:

Large file size; this will increase app size and memory bandwidth requirements
Highest GPU cost (i.e. worst performance)

Ericsson Texture Compression (ETC)

Ericsson Texture Compression is a texture compression format that operates on 4x4 blocks of pixels. Originally Khronos used Ericsson Texture Compression as the standard for OpenGL ES 2.0 (this version is also known as ETC1). Therefore, this texture compression format is available on nearly all Android devices. Recently with the release of OpenGL ES 3.0, a reworked version of ETC1, known as ETC2, was implemented as the new standard. The main differences between the two schemes is the algorithm which operates on each pixel group. The improvements in the algorithm result in higher fidelity output when it comes to finer details. The finer quality of the image comes without the cost of additional space.

ETC1 and ETC2 both support compression of 24-bit RGB data, but they do not support the compression of any images/textures containing alpha components. In addition, there are two different file formats that both fall under the category of ETC texture compression: KTX and PKM. KTX is the standard Khronos Group compression format, and it provides a container for multiple images/textures. When mipmaps are generated with KTX, only one KTX file is created. On the other hand, PKM is a much simpler file format used mainly to contain single compressed images. Generating mipmaps in this case would create multiple PKM files instead of a single file, and so as a consequence, this format is not recommended for that purpose.

Advantages:

File size is considerably smaller in comparison to the PNG texture compression format
GPU hardware acceleration supported on nearly all Android devices

Disadvantages:

Quality is not as high as PNG texture compression (ETC is a lossy compression format)
Does not support alpha channels/components

Example of Tools Used For Compression:

Mali* Texture Compression Tool (Developer Center)
ETC-Pack Tool

PowerVR Texture Compression (PVRTC)

PowerVR Texture Compression is a lossy, fixed-rate texture compression format utilized primarily in Imagination Technologies’ PowerVR* MBX, SGX, and Rogue technologies. It is currently being employed in all iPhone*, iPod*, and iPad* devices as the standard compression format. Unlike ETC and S3TC, PVRTC is not block-based but rather involves the bilinear upscaling and low-precision blending of two low-resolution images. In addition to the unique process of compression by the PVRTC format, it also supports RGBA (alpha channel supported) for both the 2-bpp (2 bits per pixel) and 4-bpp (4 bits per pixel) options.

Advantages:

Supports alpha channels/components
Supports RGBA data for both 2-bpp and 4-bpp modes
File size is much smaller than one using PNG texture compression
GPU hardware acceleration on PowerVR GPUs

Disadvantages:

Quality is not as high as PNG texture compression (PVRTC is a lossy compression format)

PVRTC is only supported on PowerVR hardware
Only square (power-of-two) dimension textures are determined to work consistently, although in some cases, rectangular support is provided for the compressed texture
Compressing textures into this format can be slow

Tool Used For Compression:

PVRTexTool* v3.33 from Imagination Technologies

S3 Texture Compression (S3TC) or DirectX Texture Compression (DXTC)

S3 Texture Compression is a lossy, fixed-rate, texture compression format. This style of compression makes S3TC an ideal texture compression format for textures used in hardware-accelerated 3D computer graphics. Following the integration of S3TC with Microsoft’s DirectX* 6.0 and OpenGL 1.3, the compression format became much more widespread. There are at least 5 different variations of the S3TC format (including DXT1 through DXT5). The sample supports the commonly used variations (DXT1, DXT3, and DXT5).

DXT1: DXT1 is the smallest mode of S3TC compression; it converts each block of 16 pixels into 64 bits. Additionally, it is composed of two different 16-bit RGB 5:8:5 color values and a 4x4 2-bit lookup table. DXT1 does not support alpha channels.

DXT3: DXT3 converts each block of 16 pixels into 128 bits and is composed of 64 bits of alpha channel data and 64 bits of color data. DXT3 is a good format choice for images or textures with sharp alpha transitions (opaque versus translucent).

DXT5: DXT5 converts each block of 16 pixels into 128 bits and is composed of 64 bits of alpha channel data and 64 bits of color data. DXT5 is a good format choice for images or textures with gradient alpha transitions.

Advantages:

File size is considerably smaller in comparison to the PNG texture compression format.
Decent quality, low banding (artifacts not too visible)
Good speed for compression/decompression
GPU hardware acceleration for a wide range of video chip parts: support is almost universal on desktop, but it still needs to grow on Android devices

Disadvantages:

Quality is not as high as PNG texture compression (S3TC is a lossy compression format)
Not supported on all Android devices

Tool Used for Compression:

DirectX Texture Tool from DirectX (included in the DX SDK)

Accessing the Texture Data

Most texture compression file formats have a header that is placed before the actual texture data. The header usually contains data regarding the name of the texture compression format, the width of the texture, the height of the texture, the depth of the texture, the size of the data, the internal format, and other specific properties of the file.

Our goal is to load and map texture data from each of the different texture compression files to a 2D model for comparison of quality and file size. The headers that come before the texture data is not to be included as part of the texture to be mapped, as doing so would distort the image/texture representation. The file headers all vary depending on the file compression format that is being considered, and as such, each texture compression file format needs individual support in order to load the textures and map them properly.

IMPORTANT:

The PVRTC header is packed due to the presence of the 64-bit pixel format data member (mPixelFormat in the sample). ARM attempts to align the header by padding it with 4 additional bytes, making the header a total of 56 bytes instead of the raw 52 bytes. This in turn causes the image to be distorted when displayed on an ARM device. Intel devices do not pad the header, and so isn’t an issue. Packing the header solves the ARM padding issue, and the texture displays correctly on both ARM and Intel devices.

Figure 3 – Image of ARM Padding Issue with PVRTC in Previous Sample

Loading and Supporting the Texture Formats:

Loading PNG

Mipmaps are taken care of in PNG by a simple function: glGenerateMipmap – a predefined function from Khronos OpenGL designed for this specific purpose. Sean Barrett’s public domain stb_image.c was utilized in the reading and loading of the PNG files (as well as locating and pinpointing the texture data to be processed). The following is a piece of code that initializes the texture and provides mipmap support for the PNG compression format.

// Initialize the texture

glTexImage2D( GL_TEXTURE_2D, 0, format, width, height, 0, format, GL_UNSIGNED_BYTE, pData );

// Mipmap support

glGenerateMipmap( GL_TEXTURE_2D );

Loading ETC / ETC2

As mentioned earlier, ETC is composed of two different format types—KTX and PKM. KTX is the standard compression format, used as a container for multiple images/textures, and is ideal for mipmapping. PKM, on the other hand, is designed for simple single texture compression, and so generating mipmaps gives rise to multiple PKM files, which is inefficient. For this reason, mipmap support for ETC texture compression in the sample app is restricted to KTX file compression only. Khronos provides an open source C library (libktx) in which KTX texture loading with mipmaps is supported. We took advantage of this tool and implemented the code in a texture loading function called LoadTextureETC_KTX. The function used to actually load the KTX texture compression format file is ktxLoadTextureM (loads the desired texture from data in memory). This function (ktxLoadTextureM) was provided in the library (libktx) and is documented at the Khronos site (in “Resources” below).

The following is a piece of code that initializes the texture and provides mipmap support for the ETC (KTX) compression format:

// Generate handle & Load Texture

GLuint handle = 0;

GLenum target;

GLboolean mipmapped;

KTX_error_code result = ktxLoadTextureM( pData, fileSize, &handle, &target, NULL, &mipmapped, NULL, NULL, NULL );

if( result != KTX_SUCCESS )

{

LOGI( "KTXLib couldn't load texture %s. Error: %d", TextureFileName, result );

return 0;

}

// Bind the texture

glBindTexture( target, handle );

Loading PVRTC

Providing mipmap support for PVRTC textures was a bit trickier. After reading through the header, the offset is defined as the size of the header plus the metadata size (metadata follows the header and is also not part of the actual texture data). For each of the mips generated, pixels are grouped into blocks (different depending on if it is 4 bits per pixel or 2 bits per pixel—both valid PVRTC formats). Next, clamping occurs, and so height and width of the blocks are limited to certain boundaries. Then, the function glCompressedTexImage() is called to identify a two-dimensional image in the PVRTC compressed format. Following that, the pixel data size is calculated and then is added to the offset in order to group the set of pixels in the next mip. This process is repeated until there are no more mips to operate on.

// Initialize the texture

unsigned int offset = sizeof(PVRHeaderV3) + pHeader->mMetaDataSize;

unsigned int mipWidth = pHeader->mWidth;

unsigned int mipHeight = pHeader->mHeight;

unsigned int mip = 0;

{

// Determine size (width * height * bbp/8), min size is 32

unsigned int pixelDataSize = ( mipWidth * mipHeight * bitsPerPixel ) >> 3;

pixelDataSize = (pixelDataSize < 32) ? 32 : pixelDataSize;

// Upload texture data for this mip

glCompressedTexImage2D(GL_TEXTURE_2D, mip, format, mipWidth, mipHeight, 0, pixelDataSize, pData + offset);

checkGlError("glCompressedTexImage2D");

// Next mips is half the size (divide by 2) with a min of 1

mipWidth = ( mipWidth >> 1 == 0 ) ? 1 : mipWidth >> 1;

mipHeight = ( mipHeight >> 1 == 0 ) ? 1 : mipHeight >> 1;

// Move to next mip

offset += pixelDataSize;

mip++;

} while(mip < pHeader->mMipmapCount);

Loading S3TC

After loading an S3TC texture file, determining the format, and reading past the header, mipmap support takes place. Each mip is looped through and pixels are grouped into blocks. Then, the function glCompressedTexImage is called to identify a two-dimensional image in the S3TC compressed format. The aggregate data size of the blocks is then added to the offset in order to move to the next mip and perform the same actions. The process repeats until there are no more mips to operate on. The following is a piece of code that initializes the texture and provides mipmap support for the S3TC compression format.

// Initialize the texture

// Uploading mipmaps

unsigned int offset = 0;

unsigned int width = pHeader->mWidth;

unsigned int height = pHeader->mHeight;

unsigned int mip = 0;

{

// Determine size

// As defined in extension: size = ceil(<w>/4) * ceil(<h>/4) * blockSize

unsigned int Size = ((width + 3) >> 2) * ((height + 3) >> 2) * blockSize;

glCompressedTexImage2D( GL_TEXTURE_2D, mip, format, width, height, 0, Size, (pData + sizeof(DDSHeader)) + offset );

checkGlError( "glCompressedTexImage2D" );

offset += Size;

if( ( width <<= 1 ) == 0) width = 1;

if( ( height <<= 1 ) == 0) height = 1;

mip++;

} while( mip < pHeader->mMipMapCount );

Conclusion

Depending on the situation it is used in, proper texture compression may improve visual quality, decrease the size of an app considerably, and greatly enhance performance. Optimal texture compression provides substantial advantages to developers and their applications. The Android Texture Compression sample app demonstrates how to load and access the most popular texture formats that can be found on Android. Go download the source code and incorporate the best texture compression in your next project.

About the Author:

William Guo created this sample while he was an intern with Intel’s Personal Form Factor team while working on Intel phones and tablets. He is currently attending the University of California, Berkeley as an up & rising sophomore with an expected graduation date of May 2015. He intends to major in Electrical Engineering and Computer Science with a possible minor in psychology.

Sample and article updated to include ETC2 format by Cristiano Ferreira who is currently working for Intel in developer relations. Cristiano can be contacted at Cristiano.ferreira@intel.com for questions regarding the sample.

Updated artwork for the ETC2 sample was provided by Jeffery A. Williams, Lead Digital Content Designer in the Game Development Experience group under Developer Relations at Intel Corporation.

Resources:

Texture Mapping Figure 1:
http://upload.wikimedia.org/wikipedia/commons/3/30/Texturedm1a2.png
Mipmapping Image Figure 2:
http://en.wikipedia.org/wiki/File:MipMap_Example_STS101.jpg
PNG* Info:

http://www.libpng.org/pub/png/

ETC* (KTX* and PKM*) Info:
PVRTC* Info:
S3TC* Info:
Source Code:

*Other names and brands may be claimed as the property of others

vcsource_index

vcsource_type_techsample

vcsource_os_windows

vcsource_domain_graphics

vcsource_type_productsample

Intel Sample Source Code License Agreement

Gráficos

Telefone

Tablet

Contrato de licença:

Intel® Integrated Native Developer Experience (INDE)

Exemplo de código

Para começar

↧

Data Plane Development Kit Overview

May 30, 2014, 5:06 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio 2014 Update 1 - What's New

≪ Previous: Android Texture Compression - a comparison study with code sample

The Data Plane Development Kit (DPDK) is a key ingredient addressing the data plane needs of Telecom and Networking applications implemented on general purpose processors. It is an optimized library in Linux User Space offering a higher level of packet processing throughput than standard Linux network interfaces.

DPDK fundamentals:

‒Implements a run to completion model

‒Accesses all devices by polling without a scheduler

‒Accesses all devices directly from Linux User Space

‒Runs in 32-bit and 64-bit mode with/without NUMA

‒Scales from Intel Atom processors to Intel Xeon processors

‒Supports an unlimited number of processors and processor cores

‒Optimizes packet allocation across DRAM channels

‒Allocates memory from the local node where possible

‒Ensures data structures & objects are cache-aligned; improves performance

‒Examples to show capabilities

The DPDK was initially defined and developed by Intel, but it now includes contributions from many individuals and companies as an open-source (BSD license) community project.

Download the poster below, and be sure to visit http://dpdk.org/.

↧

Intel® System Studio 2014 Update 1 - What's New

May 31, 2014, 4:44 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® INDE Media Pack for Android* Tutorials - Video Streaming from device to YouTube*

≪ Previous: Data Plane Development Kit Overview

Intel® System Studio 2014 provides deep hardware and software insights to speed-up development, testing and optimization of Intel-based IoT, Intelligent Systems, mobile systems and embedded systems. Intel® System Studio 2014 has added exciting new features such as Tizen* IVI and Android* target support, Windows* host support, enhanced Eclipse* integration & cross-build, system-wide analysis, and more.

Product Contents Update 1

The product contains the following components. For a comprehensive explanation of each one of these components, please go to our Solutions, Tips, and Tricks Page.

Intel® C++ Compiler 14.0 Update 1 for Embedded OS Linux*

What's New Intel® System Studio 2014

1. Intel® Math Kernel Library Compatibility library for 32-bit Android* NDK has been added.
This component adds support for static linking of MKL under 32-bit Android* NDK.
2. Intel® IPP support of the Intel® Quark SoC has been added
3. Intel® C++ Compiler for Android* and embedded OS Linux* updated
4. Intel® VTune™ Amplifier
4.1. Updated version of Intel® VTune™ Amplifier with support for remote softwarebased algorithm analysis (Basic Hotspots, Concurrency, Locks and Waits) on
embedded Linux target systems
4.2. New analysis type “TSX Exploration” for 4th generation Intel® Core™ processors.
4.3. Support for external data collection launched from the VTune Amplifier with the
Custom collector target configuration option or -custom-collector command line option.
4.4. Android 64 bit kernel support (not 64-bit user space)

What's New Intel® System Studio 2014 Update 1

Broader Host and Target OS coverage
New Tizen* IVI 2.0, 3.0 New Yocto Project* 1.5 New Wind River* Linux* 4.0 - 6.0 New Android* 4.0.x - 4.4.x (Android* NDK R9a, R9b, R9c) New Windows* 7 & 8 Host development for Linux*-based Targets
Enhanced Eclipse integration & cross-build
Automated Eclipse* IDE Integration on Linux* and Windows* hosts Enhanced cross-build sysroot support and Wind River* Linux* cross-build environment integration Yocto Project* Compatible OpenEmbedded* 3rd party toolchain layer recipes
New features across all components
Intel® C++ Compiler and libraries generated code compatible with Wind River Simics* Intel® JTAG Debugger 2014 - Now supports Intel® Quark SoC, Intel® Core™ and Intel® Xeon™ processors. GNU* GDB - Branch Trace Store (btrace) for Intel® Atom™ or 4th generation Intel® Core™ processors Intel® VTune™ Amplifier 2014 for Systems - Adds system-wide event-based sampling of uncore and SoC memory bandwidth	Intel® C++ Compiler 14.0– Optimizations for the latest Intel® processor generation including Intel® Quark processor Intel® Integrated Performance Primitives 8.1 - Adds new signal processing features for communications, MMSE MIMO support, and optimization for the latest Intel® processor generation Intel® Math Kernel Library 11.1 - Optimizations for the latest Intel® processor generation Intel® C++ Compiler for Android* compatibility with GCC 4.8 and integration into the latest Android* NDK R9
New Intel® Architecture platforms
Latest Intel® Atom™ processor E3xxx and Z3xxx generation (code-named Bay Trail) Latest Intel® Atom™ processor C2xxx generation (code-named Avoton, Rangeley) Intel® Quark SoC X1xxx (code-named Clanton) 4th generation Intel® Core™ processor (code-named Haswell)

Get Help or Advice

Getting Started?
Click the Learn tab for guides and links that will quickly get you started.
Support Articles and White Papers – Solutions, Tips and Tricks

Resources
Documentation
Training Material

Support

We are looking forward to your questions and feedback. Please don't hesitate to escalate any questions you have or issues you run into. We thank you for helpig us to continuously improve Intel® System Studio

Intel® Premier Support– (registration is required) - For secure, web-based, engineer-to-engineer support, visit our Intel® Premier Support web site. Intel Premier Support registration is required. Once logged in search for the product name Intel® System Studio for Linux*.

Please provide feedback at any time:

↧

Intel® INDE Media Pack for Android* Tutorials - Video Streaming from device to YouTube*

May 30, 2014, 1:20 am

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® C++ Composer XE for Multiple Simple Random Sampling without Replacement

≪ Previous: Intel® System Studio 2014 Update 1 - What's New

This tutorial explains how to use Intel® INDE Media Pack for Android* for streaming from your device to YouTube*.

Prerequisites:

Camera Streaming sample from Intel® INDE Media Pack for Android:
Wowza* Streaming Engine.

Installing Wowza Streaming Engine:

You can use Wowza on any platform (Microsoft* Windows*, Mac OS X* and Linux*/Unix*). Please check the official installation guide.

Configuring Wowza Streaming Engine:

In the Streaming Engine Manager Welcome page, click the Applications tab at the top of the:
In the Applications contents pane, click live.
In the contents pane, click Incoming Security, and then click Edit.
The Incoming Security page is displayed. Configure the following options, and then click Save:
Restart the application.

Setting up the YouTube live event:

Sign in to the YouTube Video Manager Live Events webpage (http://www.youtube.com/my_live_events).
Click Enable live streaming and verify your account.
Click Create live event.
In the Create a new event page, under Basic info, enter the relevant information about the stream (title, description, date/time, location, and so on) into the fields.
Under Type, select Custom (more encoding options).
Click Advanced settings to configure additional options such as enabling comments and recording the stream.
Click the Go live now button.
On the Ingestion Settings page, under Choose maximum sustained bitrate of your encoder, select the options that best represent your network and encoding capabilities.
Under Select your encoder, select Other Encoders. You'll then see stream name and server URL information similar to this:

Copy this information into a text document, for reference.
Click the Save changes button.

Installing Push Publishing AddOn:

Starting with Wowza Streaming Engine 4, no additional installation steps are necessary. The Push Publishing AddOn has been updated and is now built in to Wowza Streaming Engine.

Configuring and testing Push Publishing AddOn:

Access to the Modules tab is limited to administrators with advanced permissions:
In Wowza Streaming Engine Manager, click the desired live application in the contents pane.
On the live application page, click the Modules tab.
On the Modules tab, do the following:
Publishing the live stream:
1. Click Edit.
2. Click Add Module, and then add the following entry:
```
Name: ModulePushPublish
Description: ModulePushPublish
Fully Qualified Class Name: com.wowza.wms.pushpublish.module.ModulePushPublish
```
3. Click Add, and then Save, and then click the application Restart when prompted to do so.
  The ModulePushPublish module listens for incoming live streams to be published to the server (source streams). Push Publishing AddOn requires the following information to be formatted in a specific syntax in the [install-dir]/conf/PushPublishMap.txt file. At this time, editing the PushPublishMap.txt file isn't supported by the Streaming Engine Manager UI. Using the above example information as a reference, the following unique stream elements are required when configuring the Push Publishing map file:
```
Profile: rtmp
Host(1): rtmp://a.rtmp.youtube.com
Host(2): rtmp://b.rtmp.youtube.com
Application(1): live2
Application(2): live2?backup=1
StreamName: ilya.aleshkov.hpw0-zadr-d849-4pbj
```
Using a text editor, edit the [install-dir]/conf/PushPublishMap.txt file and create a publish stream for both the primary and backup servers. Refer back to the stream name and server URL information that's displayed on the YouTube Video Manager Ingestion Settings page. Using the above example information as a reference, the updated PushPublishMap.txt file would look like this:
```
test={profile:”rtmp”, streamName:”ilya.aleshkov.hpw0-zadr-d849-4pbj”, host:”a.rtmp.youtube.com”, application:”live2”}
test={profile:”rtmp”, streamName:”ilya.aleshkov.hpw0-zadr-d849-4pbj”, host:”b.rtmp.youtube.com”, application:”live2?backup=1”}
```
The PushPublishMap.txt file now supports quotation (") marks around the keys and value strings, in compliance with the JSON specification. Older files without quotation marks will continue to work for a limited time. Update your files as you edit them to retain compatibility with future versions of Push Publishing.
Save the [install-dir]/conf/PushPublishMap.txt file.
Restart Wowza Streaming Engine.

Publishing the live stream:

Launch Camera Streaming sample. Make sure you have proper stream settings:
Check your stream inside live application Test Players:
Go to the YouTube Live Control Room page for your event and click the Preview button to enable the YouTube CDN (Content Delivery Network) to process the incoming stream. When the Stream Status is GOOD, scroll down the page to find the Preview test video player and click Play. If you see your live video start to play, your Push Publishing workflow is correct and you're ready to stream live.
When you're ready to release your live stream for public viewing, go to the YouTube Live Control Room page for your event and click the Start Streaming button. This will release the stream that's pushed from Wowza Media Server for public viewing. When you see the live stream in the Public View test video player in the Live Control Room page, you're successfully push publishing live to YouTube.

Processamento de mídia

URL:

Intel® INDE Media Pack for Android* Tutorials - Video Capturing for Unity3d* Applications

Intel® INDE Media Pack for Android* Tutorials - Advanced Video Capturing for Unity3d* Applications

Intel® INDE Media Pack for Android* Tutorials - Video Capturing for Qt* Quick Applications

Zona do tema:

Android

Android*

↧

Using Intel® C++ Composer XE for Multiple Simple Random Sampling without Replacement

June 4, 2014, 1:56 am

Latest and popular articles on Intel Technologies

≫ Next: Intel Software Conference 2014

≪ Previous: Intel® INDE Media Pack for Android* Tutorials - Video Streaming from device to YouTube*

Introduction

Random sampling is often used when pre- or post-processing of all records of the entire data set is expensive, as in the following examples. When the file of records or database is too large, retrieval cost for one record is too high. In further physical examination of the real-world entity described by a record, fiscal audit of financial records, or medical examinations of sampled patients for epidemiological studies, post-processing of one data record is too time-consuming. Random sampling is typically used to support statistical analysis of an entire data set and some aggregate statistic estimation (such as average), to estimate parameters of interest, or to perform hypothesis testing. Typical applications of random sampling are financial audit, fissile materials audit, epidemiology, exploratory data analysis and graphics, statistical quality control, polling and marketing research, official surveys and censuses, statistical database security and privacy, etc.

Problem statement

Definitions:

The population to be sampled is assumed to be a set of records (tuples) of a known size N.
A fixed-size random sample is a random sample for which the sample size is a specified constant M.
A simple random sample without replacement (SRSWOR) is a subset of the elements of a population where each element is equally likely to be included in the sample and no duplicates are allowed.

We need to generate multiple fixed size simple random samples without replacement. Each sample is unbiased, i.e., item (record) in each sample was chosen from the whole population with equal probability 1/N, independently of others. All samples are independent.

Note: We consider a special case of problems where all records are numbered using natural numbers from 1 to N, so we do not need access to population items themselves (or we have array of indexes of population items).

In other words, we need to conduct a series of experiments, each generating a sequence of M unique random natural numbers from 1 to N (1≤M≤N).

The attached program uses M=6 and N=49, conducts 119 696 640 experiments, generates a large number of result samples (sequences of length M) in the single array RESULTS_ARRAY, and uses all available parallel threads. In the program, we call each experiment a “lottery M of N”.

Considered approaches to simulate one experiment

Algorithm 1

A straightforward algorithm to simulate one experiment is as follows:

           A1.1: let RESULTS_ARRAY be empty

            A1.2: for i from 1 to M do:

                A1.3: generate random natural number X from {1,...,N}

                A1.4: if X is already present in RESULTS_ARRAY (loop), then go to A1.3

                A1.5: put X at the end of RESULTS_ARRAY

            End.

In more detail, step A1.4 is the “for” loop of length i-1:

            A1.4.1: for k from 1 to i-1:

            A1.4.2: if RESULTS_ARRAY[i]==X, then go to A1.3

Algorithm 2

This algorithm uses the partial “Fisher-Yates shuffle” algorithm. Each experiment is treated as a partial length-M random shuffle of the whole population of N elements. It needs M random numbers. The algorithm is as follows:

            A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

            A2.2: for i from 1 to M do:

                A2.3: generate random integer X uniform on {i,...,N}

                A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]

            A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]

            End.

Explanation: each iteration of the loop A2.2 works as a real lottery step. Namely, in each step, we extract random item X from remaining items in the bin PERMUT_BUF[i], ..., PERMUT_BUF[N] and put it at the end of the results row PERMUT_BUF[1],...,PERMUT_BUF[i]. The algorithm is partial because we do not generate full permutation of length N, but only a part of length M.

At the cost of more memory and extra Initialization and Copy steps (loops), Algorithm 2 needs fewer random numbers than Algorithm 1, and does not have the second nested loop A1.4 with “if” branching. Therefore, we chose to use Algorithm 2.

In the case of simulating many experiments, Initialization step is needed only once because at the beginning of each experiment, the order of natural numbers 1...N in the PERMUT_BUF array does not matter (like in real lottery).

Note that in our C program (attached), zero-based arrays are used.

Optimization

We use Intel® C++ Compiler, with its OpenMP* implementation, and Intel® MKL shipped with Intel® Composer XE 2013 SP1.

Parallelization

We exploit all CPUs with all available processor cores by using OpenMP* (see “#pragma parallel for” in the code, and see [4] for more details about OpenMP usage).

We use Intel® MKL MT2203 BRNG since it easily supports a parallel independent stream in each thread (see [3] for details).

     #pragma omp parallel for num_threads(THREADS_NUM)

     for( thr=0; thr<THREADS_NUM; thr++ ) { // thr is thread index

         VSLStreamStatePtr stream;

         // RNG initialization

        vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );

         ... // Generation of experiment samples (in thread number thr)

         vslDeleteStream( &stream );

     }

Generation of experiment samples

In each thread, we generate EXPERIM_NUM/THREADS_NUM experiment results. For each experiment we call Fisher_Yates_shuffle function that implements steps A2.2, A2.3, and A2.4 of the core algorithm to generate the next results sample. After that we copy the generated sample to RESULTS_ARRAY (step A2.5) as shown below:

     //  A3.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

     for(i=0; i<N; i++) PERMUT_BUF[i]=i+1; // we will use the set {1,...,N}

     for(sample_num=0;sample_num<EXPERIM_NUM/THREADS_NUM;sample_num++) {

         Fisher_Yates_shuffle(...);
     

         for(i=0; i<M; i++)

             RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];

     }

Fisher_Yates_shuffle function

The function implements steps A2.2, A2.3, and A2.4 of the core algorithm (chooses a random item from the remaining part of PERMUT_BUF and places this item at the end of the output row, namely, to PERMUT_BUF[i]):

            for(i=0; i<M; i++) {

                j = Next_Uniform_Int(...);            

                tmp = PERMUT_BUF[i];

                PERMUT_BUF[i] = PERMUT_BUF[j];

                PERMUT_BUF[j] = tmp;

            }

Next_Uniform_Int function

In step A2.3 of the core algorithm, our program calls the Next_Uniform_Int function to generate the next random integer X, uniform on {i,...,N-1}.

To exploit the full power of vectorized RNGs from Intel MKL, but to hide vectorization overheads, the generator must be called to generate a sufficiently large vector D_UNIFORM01_BUF of size RNGBUFSIZE that fits the L1 cache. Each thread uses its own buffer D_UNIFORM01_BUF and the index D_UNIFORM01_IDX pointing to after the random number from that buffer used last. In the first call to Next_Uniform_Int function (or in the case all random numbers from the buffer have been used), we regenerate the full buffer of random numbers again by calling to vdRngUniform function with the length RNGBUFSIZE and set the index D_UNIFORM01_IDX to zero (in fact, the index was already set to zero a while before):

 vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );

Because Intel MKL provides only generators of random values with same distribution, but in step A2.3 we need random integers on different intervals, we fill our buffer with double-precision random numbers uniformly distributed on [0;1) and then, in the “Integer scaling step”, we convert these double-precision values to the needed integer intervals. Fortunately, we know that our algorithm in step A2.3 will need this sequence of numbers, distributed as follows:

            number 0   distributed on {0,...,N-1}   = 0   + {0,...,N-1}
            number 1   distributed on {1,...,N-1}   = 1   + {0,...,N-2}

            ...

            number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}

            (then repeat previous M steps)

            number M     distributed on: see (0)
            number M+1   distributed on: see (1)

            ...

            number 2*M-1 distributed on: see (M-1)
            (then again repeat previous M steps)
            ...

            etc.

Hence, “Integer scaling step” looks like this:

            // Integer scaling step

            for(i=0;i<RNGBUFSIZE/M;i++)

                for(k=0;k<M;k++)

                    I_RNG_BUF[i*M+k] =

                        k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-1-k));

Notes:

RNGBUFSIZE must be a multiple of M;
This double-nested loop is not suitable for good vectorization because M=6 is not a multiple of 8 (8 is the number of integers in the Intel® Advanced Vector Extensions (Intel® AVX) vector register);
Even if we interchange loops “for i” and “for k” and choose RNGBUFSIZE/M to be multiple of 8, this double-nested loop is not suitable for good vectorization, because we will store the results not contiguously in memory;
We put scaled integers I_RNG_BUF[i*M+k] into the same buffer where we put double-precision random values D_UNIFORM01_BUF[i*M+k]. Although depending on the CPU type, it may be preferable to have a separate buffer for integers, so that both buffers together fit L1 cache. Separate buffers allow to avoid store-after-load forwarding penalty stalls that might occur because the size of loaded double-precision values is not equal to the size of stored integers.

Conclusions

The attached, Intel C++ Composer XE based, implementation of the algorithm presented in this article for the case of 119 696 640 experiments of “lottery 6 of 49” runs ~24*13 times faster than the sequential algorithm based on the sequential scalar version using GNU* Scientific Library (GSL)+GNU Compiler Collection (GCC).

Measured work time is:

0.216 sec (algorithm presented in this article);
69.321 sec (sequential scalar algorithm, based on GSL+GCC, i.e., using gsl_ran_choose function, sequential RNG gsl_rng_mt19937 from GSL, gcc 4.4.6 20110731 with options -O2 -mavx -I$GSL_ROOT/include -L$GSL_ROOT/lib -lgsl -lgslcblas).

The measurements were done on the following platform:

CPU: 2 x 3d-Generation Intel® Core™ i7 processor 2.5GHz, 2*12 cores, 30MB L3 cache size, hyper-threading off;
OS: Red Hat Enterprise Linux* Server release 6.2, x86_64;
Software: Intel® C++ Composer XE 2013 SP1 (with Intel C++ Compiler 13.1.1 and Intel MKL 11.0.3).

Program code attached (see lottery6of49.c file).

References

[1] D. Knuth. The Art of Computer Programming. Volume 2. Section 3.4.2 Random Sampling and Shuffling. Algorithm S, Algorithm P;

[2] Intel® Math Kernel Library Reference Manual, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Statistical Functions”, subsection “Random Number Generators”;

[3] Intel® MKL Vector Statistical Library Notes, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Independent Streams. Block-Splitting and Leapfrogging” about usage of several independent streams of VSL_BRNG_MT2203;

[4] User and Reference Guide for the Intel® C++ Compiler, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Key Features”, subsection “OpenMP support”;

[5] GNU Scientific Library (GSL), available at http://www.gnu.org/software/gsl, documentation section “18 Random Number Generation” about gsl_rng_alloc() and gsl_rng_mt19937 and subsection “20.38 Shuffling and Sampling” about gsl_ran_choose() function.

Simple Random Sampling without Replacement

Unique Random Numbers Generation

Compilador C++ Intel®

Intel® Composer XE

Biblioteca kernel de matemática Intel®

OpenMP*

Serviços financeiros

Desenvolvimento de multithread

Tópicos de compilador

Aumento de desempenho

Bibliotecas

↧

Intel Software Conference 2014

June 4, 2014, 7:16 am

Latest and popular articles on Intel Technologies

≫ Next: Java vs C vs IPP vs TBB: test di performance su dispositivi Intel

≪ Previous: Using Intel® C++ Composer XE for Multiple Simple Random Sampling without Replacement

A Intel Software Brasil realizou neste mês o Intel Software Conference 2014, que aconteceu na Universidade Estácio de Sá (Rio de Janeiro) nos dias 26 e 27 de Maio e no IMAM (São Paulo) nos dias 28,29 e 30 de Maio.

Foram realizadas palestras e "round tables" ministradas por profissionais da Intel do Brasil, Estados Unidos e Alemanha cobrindo dois temas: Computação Paralela e de Alto Desempenho nos primeiros quatro dias (RJ e SP) e Desenvolvimento Android no último dia (somente em São Paulo).

Confira logo abaixo os slides das apresentações realizadas.

Getting the maximum performance in distributed clusters Intel Cluster Studio XE

http://www.slideshare.net/IntelSoftwareBR/getting-the-maximum-performance-in-distributed-clusters-intel-cluster-studio-xe

Intel tools to optimize HPC systems

http://www.slideshare.net/IntelSoftwareBR/intel-tools-to-optimize-hpc-systems

Methods and practices to analyze the performance of your application with Intel® VTune™ Amplifier XE

http://www.slideshare.net/IntelSoftwareBR/v-tune-istep2014

Principais conceitos técnicas e modelos de programação paralela

http://www.slideshare.net/IntelSoftwareBR/principais-conceitos-tcnicas-e-modelos-de-programao-paralela

Principais conceitos e técnicas em vetorização

http://www.slideshare.net/IntelSoftwareBR/principais-conceitos-e-tcnicas-em-vetorizao

Notes on NUMA architecture

http://www.slideshare.net/IntelSoftwareBR/numa-i-step2014

Intel Technologies for High Performance Computing

http://www.slideshare.net/IntelSoftwareBR/hpc-update-istep2014

Benchmarking para sistemas de alto desempenho

http://www.slideshare.net/IntelSoftwareBR/benchmarking-para-sistemas-de-alto-desempenho

Ferramentas para cluster

Intel® Parallel Studio XE

Intel® Advanced Vector Extensions

Extensões Intel® Streaming SIMD

Interface de transferência de mensagens

Tópicos de compilador

Verificação de erros

Aumento de desempenho

Bibliotecas

Erros de memória

Desenvolvimento de multithread

Erros de thread

Intel Hardware Accelerated Execution Manager (HAXM)

↧

Java vs C vs IPP vs TBB: test di performance su dispositivi Intel

June 6, 2014, 4:10 am

Latest and popular articles on Intel Technologies

≫ Next: NoSQL Software Configuration & Deployment

≪ Previous: Intel Software Conference 2014

Di recente ci siamo trovati nella situazione di voler ottimizzare una nostra applicazione per la realta’ aumentata (http://picshare.jooink.com) su dispositivi mobile. Picshare e' interamente scritto in javascript ed essendo il nostro target quello di ottimizzarlo per dispositivi mobile la strada più naturale ci è sembrata quella di riscrivere nativamente parte degli algoritmi computazionalmente rilevanti e, con l’occasione, mettere a confronto diverse implementazioni ‘native’ al fine di capire quale strategia fosse preferibile.

Sfruttando il fatto che da sempre Intel mette a disposizione numerose librerie ottimizzate per i suoi processori e che adesso sono disponibili dispositivi mobile con processori intel abbiamo voluto capire anche quanto l’utilizzo di librerie ottimizzate per una specifica architettura potesse aumentare le performance.

In questo post mostreremo i risultati ottenuti implementando il medesimo algoritmo con ‘tecniche’ (linguaggi e libererie) diverse: java, C, C utilizzando le Intel Performance Primitives, C utilizzando le IPP e parallelizzando l’esecuzione utilizzando Intel Threading Building Blocks.

Tutti i test sono stati effettuatu su 3 dispositivi con processore Intel:

Samsung Galaxy Tab 3, tablet 10’’, Intel Atom Z2560, Dual-core
Dell Venue 8/3830, tablet 8’’, Intel Atom Z2580, Dual-core
Lenovo k900, mobile 5’’, Intel Atom Z2580, Dual-core

L’algoritmo:

Non potendo certamente riscrivere l’intera applicazione 4 volte abbiamo ritenuto opportuno fare il confronto su un algoritmo semplice da implementare ma computazionalmente rilevante per il nostro caso d’uso.

La maggior parte degli algoritmi utilizzati nelle nostre applicazioni opera su immagini in toni di grigio e la conversione da rgb a grayscale è quindi un candidato ideale per i test; abbiamo quindi deciso di implementare con le diverse ‘tecniche’ il calcolo di una media pesata da un array 3*SIZE a un array SIZE:

gray ← 0.299 * red + 0.587 * green + 0.114 * blue

Allo scopo di minimizzare le complicazioni implementative abbiamo scritto l’algritmo di media in modo che operi su float.

Java version

L’implementazione in java dell’algoritmo di conversione è naturalmente estremamente semplice:

public void compute(float[] in, float[] out) {
    for(int i=0, j=0; i<out.length; i++, j+=3)
       out[i] = 0.299f * in[j] + 0.587f * in[j+1] + 0.114f * in[j+2];
}

Per eseguire ripetutamente la trasformazione (in maniera tutto sommato simile a come facciamo nel codice javascript attualmente utilizzato in produzione), abbiamo inserito le chiamate al metodo ‘compute’ in un AsyncTask che a sua volta viene creato ed eseguito da un Timer ma al fine di evitare che la creazione degli async task ed il timer falsassero i nostri test, abbiamo deciso di effettuare le misurazioni del tempo di esecuzione all’interno dell’async task stesso:

    ...
    long st = System.nanoTime();
    compute(in,out);
    long et = System.nanoTime();
    ...

La pressione sul gargbage collector è mantenuta bassa riutilizzando per tutte le iterazioni i medesimi array di input ed output (2 array bidimensionali di float uno di dimensione 3*1024*1024, interpretati come valori rgb, l’altro 1024*1024, interpretati come valori di una immaginaria scala di grigi).

Come si vede nella figura seguente, sui dispositivi che avevamo a disposizione la performance di questa prima implementazione è assolutamente soddisfacente ed in tutti i casi superiore ai 100Hz.

Native C

Fatta l’implementazione in java ci siamo concentrati su quella ‘nativa’ (in C).

L’implementazione nativa richiede l’istallazione del Native Development Kit (NDK).

Seguendo la documentazione che si trova su Android NDK for Intel Architecture, si capisce rapidamente che scrivere una applicazione android/ndk non è in linea di principio complicato (a patto di conoscere il C o il C++ naturalmente) ma, come sempre, il setup del progetto è la cosa che richiede più tempo.

Il setup del primo progetto con ndk è documentato completamente su Creating and Porting NDK based Android Apps for ia, ma se non volete perderci tempo in questo momento ne abbiamo predisposto uno pronto per l’uso su github github.com/jooink/ndk-cpuid.

Clonato il repo per compilarlo ed usarlo non avete da fare altro che

entrare nella directory CPUIDApp/jni folder ed eseguire "ndk-build";
tornare nella root directory (CPUIDApp) ed eseguire "ant debug";
installare l’applicazione e provare su un dispositivo "adb install -r bin/CPUIdApp-debug.apk".

Per cimentarsi nello sviluppo di applicazioni ndk è indispensabile, oltre a comprendere bene la struttura del progetto e l’uso di ‘ndk-build’, una certa pratica con jni (Java Native Interface) e javah (C Header and Stub File Generator) in quanto ci troviamo nella situazione di voler chiamare codice C da java e quindi jni è sostanzialmente l’unica strada.

Il codice C, fatta eccezione per le ‘stranezze’ dovute appunto a JNI, è sostanzialmente identico a quello in java:

#include <stdlib.h>
#include <math.h>

#include "com..ToGrayscaleTaskNDK.h"

JNIEXPORT void JNICALL
Java_com...ToGrayscaleTaskNDK_grayscale(
JNIEnv *env, jclass c, jfloatArray in,  jfloatArray out) {

  jsize len_in = (*env)->GetArrayLength(env, in);
  jsize len_out = (*env)->GetArrayLength(env, out);

  jfloat *body_in = (*env)->GetFloatArrayElements(env, in, 0);
  jfloat *body_out = (*env)->GetFloatArrayElements(env, out, 0);

  int i,j;
  for(i=0, j=0; i< len_out; i++, j+=3)
    body_out[i] =
         0.299f * body_in[j] +
                 0.587f * body_in[j+1] +
                           0.114f * body_in[j+2];


  (*env)->ReleaseFloatArrayElements(env, in, body_in, 0);
  (*env)->ReleaseFloatArrayElements(env, out, body_out, 0);

  return;
}

Il codice, come spiegato ad esempio su Creating and Porting NDK based Android Apps for ia, deve essere inserito in un file collocato nella directory jni del progetto e compilato con ndk-build dopo aver predisposto il file Andoid.ndk e quello Application.mk dei quali potete trovare una copia sempre facendo riferimento al progetto d’esempio su github.

Il passaggio da Java a C regala all’algoritmo una crescita della performance dell’ordine del 35% su tutti i dispositivi in esame.

Native C/IPP

Per guadagnare ancora performance non abbiamo altra strada a questo punto che cercare sfruttare meglio l’hardware dei dispositivi abbandonando la strada della portabilità e utilizzando librerie ‘specifiche’ per i processori utilizzati.

Fino ad ora il codice che abbiamo scritto era interamente portabile ed era quindi possibile eseguire l’applicazione su qualsiasi piattaforma. In quello che aggiungeremo a partire da questa sezione utilizzeremo invece librerie specifiche per intel ed anche solo per testare l’applicazione durante lo sviluppo è assai più comodo utilizzare anche un emulatore basato su processore x86.
Dunque, se sviluppate su una macchina con processore intel, consigliamo di scaricare ed installare l’emulatore x86 e HAXM, l’acceleratore che rende estremamente più rapido l’emulatore e quindi più piacevole la vita degli sviluppatori (per dettagli si veda ad esempio HAXM speeds up the Android emulator).
Naturalmente benché l’emulatore x86 con HAXM sia uno strumento molto comodo per lo sviluppo, non ci possiamo aspettare di poter fare analisi di performance usando l’emulatore e quindi un dispositivo ‘vero’ è indispensabile.

Preparato l’ambiente di sviluppo per programmare x86 e fatto il setup per l’ndk siamo a pronti per utilizzare le librerie Intel ottimizzate per Android.

Le Intel PP sono una collezione di librerie per “multimedia processing, data processing, communications applications” per Windows, Linux, Android, and OS X, tra le quali troviamo anche una versione ottimizzata dell’algoritmo di trasformazione da RGB a grayscale che ci permette di capire quanto il nostro codice ‘vanilla’ possa trarre vantaggio da una implementazione dedicata ad una precisa architettura.

La preview delle IPP per Android è inclusa in Beacon Mountain e su Building Android NDK Applications with Intel IPP viene descritto in modo dettagliato come fare il setup di un progetto che le utilizzi.

Il problema a questo punto pero’ è che che le IPP in Beacon Mountain sono davvero solo una ‘preview’: uno scheletro delle librerie in cui sono implementati solo un paio di metodi. Servono per provare che funzionano, non per usarle e di sicuro nemmeno per fare i nostri test.

Per avere una versione utilizzabile delle IPP occorre scaricare ed installare la versione trial per Linux (che è utilizzabile per 30 giorni, dopo bisogna acquistare quella commerciale); se come noi sviluppate su OSX la cosa davvero più facile è farsi una macchina virtuale Linux e usare quella per procedere all’istallazione.
Una volta installate le librerie sulla macchina Linux è sufficiente copiare gli include file da /opt/intel/ipp/include le librerie a 32 bit da /opt/intel/ipp/lib/ia32.
IPP è divisa in diversi moduli ma per le funzioni di “image conversion” basta copiare libippcore.a libippcc.a, mentre gli header tanto vale copiarli tutti.
Noi abbiamo messo sia le librerie che gli include direttamente dentro la directory jni.

Fatte le copie è il momento di preparare l’opportuno Android.mk che faccia riferimento alle librerie statiche e provveda al linking corretto:

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)
LOCAL_MODULE := grayscale
LOCAL_STATIC_LIBRARIES := libippcc ippcore
LOCAL_C_INCLUDES := .
LOCAL_LDLIBS := -llog  -lc -landroid -lm -ljnigraphics
LOCAL_SRC_FILES := com_jooink_experiments_android_preformance_java_ToGrayscaleTaskNDK.c com_jooink_experiments_android_preformance_java_ToGrayscaleTaskIPP.c
include $(BUILD_SHARED_LIBRARY)

include $(CLEAR_VARS)
LOCAL_MODULE    := ippcore
LOCAL_SRC_FILES := libippcore.a
include $(PREBUILT_STATIC_LIBRARY)

include $(CLEAR_VARS)
LOCAL_MODULE    := ippcc
LOCAL_SRC_FILES := libippcc.a
include $(PREBUILT_STATIC_LIBRARY)

Si osserva che le 2 librerie sono citate sia nei 2 blocchi PREBUILT_STATIC_LIBRARY che nella variabile LOCAL_STATIC_LIBRARIES.

Il codice per la conversione in grayscale utilizzando IPP è quasi più semplice di quello scritto direttamente in C:

#include <stdlib.h>
#include <math.h>
#include <jni.h>
#include "ipp.h"

#include "com_jooink_experiments_android_preformance_java_ToGrayscaleTaskIPP.h"

JNIEXPORT void JNICALL Java_com_jooink_experiments_android_preformance_java_ToGrayscaleTaskIPP_grayscaleIPP(JNIEnv *env, jclass c, jfloatArray in,  jfloatArray out) {


  jsize len_in = (*env)->GetArrayLength(env, in);
  jsize len_out = (*env)->GetArrayLength(env, out);

  jfloat *body_in = (*env)->GetFloatArrayElements(env, in, 0);
  jfloat *body_out = (*env)->GetFloatArrayElements(env, out, 0);

  IppiSize srcRoi = { 1024, 1024 };
  Ipp32f* pSrc = (jfloat*)body_in;
  Ipp32f* pDst = (jfloat*)body_out;
  ippiRGBToGray_32f_C3C1R(pSrc ,1024*4*3, pDst, 1024*4, srcRoi);

  (*env)->ReleaseFloatArrayElements(env, in, body_in, 0);
  (*env)->ReleaseFloatArrayElements(env, out, body_out, 0);

  return;
}

che è sostanzialmente identica alla versione in C tranne per le righe in cui si effettua la chiamata a ippiRGBToGray_32f_C3C1R che opera la conversione.

Dopo aver messo, come al solito, il codice nella directory jni ed averlo compilato (ndk-build) possiamo generare l’apk (ad esempio attraverso eclipse).
Installiamolo sull’emulatore (adb install …) o su un dispositivo con processore Intel ed eseguiamolo.

Sull’emulatore questa volta nei nostri test si guadagna oltre un ordine di grandezza ma la cosa è meno interessante del fatto che consistentemente su tutti i dispositivi abbiamo un incremento del 20-25% ! visto che l’algoritmo è banale ci viene tutto sommato da domandarsi come sia fatta l’implementazione e quanto sia possibile guadagnare utilizzando gli algoritmi più sofisticati che si trovano in IPP.

Native C/IPP/TBB

Resta da capire a questo punto se con il solo utilizzo delle librerie IPP la nostra ricerca di performance debba considerarsi conclusa. Tra tutte le questioni quella che ci premeva di più capire era se effettivamente stessimo utilizzando tutti i core dei processori.

Per la verità leggendo le Release Notes si scopre, con nostro dolore, che le Performance Primitives da qualche versione hanno deprecato la gestione della parallelizzazione degli algoritmi su processori multicore e quindi si ‘limitano’ a fornire algoritmi ottimizzati per ogni singolo core lasciando agli sviluppatori il compito di trovare la giusta strategia di parallelizzazione.

Il perché di questa scelta risulta più chiaro se si osserva che nel vasto parco di librerie che Intel mette a disposizione ce ne è una dal nome Threading Building Blocks (Intel TBB) che specificatamente mira a fornire le primitive per la parallelizzazione multithread degli algoritmi.

La compilazione ed installazione di TBB esula dallo scopo di questo post ma è interessante osservare che TBB e IPP sembrano fatte apposta per essere usate insieme ed infatti è banale estendere l’algoritmo per diventare parallelo:

JNIEXPORT void JNICALL Java_com...grayscaleIPPTBB
(JNIEnv *env, jclass c, jfloatArray in,  jfloatArray out) {

  jsize len_in = env->GetArrayLength(in);
  jsize len_out = env->GetArrayLength(out);

  jfloat *body_in = env->GetFloatArrayElements(in, 0);
  jfloat *body_out = env->GetFloatArrayElements(out, 0);

  Ipp32f* pSrc = (jfloat*)body_in;
  Ipp32f* pDst = (jfloat*)body_out;

  tbb::parallel_invoke(
          [pSrc,pDst] {
                  IppiSize srcRoi = { 1024, 512 };
                  ippiRGBToGray_32f_C3C1R(pSrc ,1024*4*3, pDst, 1024*4, srcRoi);
              },
              [pSrc,pDst] {
                  IppiSize srcRoi = { 1024, 512 };
                  Ipp32f* pSrcShifted = pSrc+3*(1024*512);
                  Ipp32f* pDstShifted = pDst+(1024*512);
                  ippiRGBToGray_32f_C3C1R(pSrcShifted ,1024*4*3,
pDstShifted, 1024*4, srcRoi);
              });


  env->ReleaseFloatArrayElements(in, body_in, 0);
  env->ReleaseFloatArrayElements(out, body_out, 0);


  return;
}

dove si vede quanto risulti apprezzabile il fatto che tutti gli algoritmi di IPP che lavorano su array bidimensionali usino il concetto di Regions of Interest permettendoci di splittare in 2 parti (eseguite in parallelo) il calcolo della conversione in grayscale.

Abbiamo provato l’algoritmo parallelo basato su parallel_invoke (probabilmente la primitiva più semplice di TBB) sia usando 2 thread che usandone 4 ed i risultati sono quelli attesi: quasi un fattore 2 tra single thread e dual thread (tutti i dispositivi hanno processori a due core) ed un altro, sorprendente, 25-30% di guadagno passando alla versione a 4 thread.

Vale decisamente la pena di indagare meglio le TBB, partendo magari dai loro sorgenti che sono disponibili a http://www.threadingbuildingblocks.org.

Intel® Integrated Native Developer Experience (INDE)

↧

NoSQL Software Configuration & Deployment

June 6, 2014, 9:36 am

Latest and popular articles on Intel Technologies

≫ Next: Game Optimization for Ultrabook™ Devices

≪ Previous: Java vs C vs IPP vs TBB: test di performance su dispositivi Intel

The driving force behind the development of NoSQL databases is the need to rapidly store and manage ever larger, dynamically changing Big Data information sets.

Configuring Cassandra NoSQL column store

Cassandra workloads can rapidly become CPU-bound, and as such, we recommend systems with as many CPU cores as possible.
On RHEL and CentOS, system limits should be changed from 1024 to 10240 in /etc/security/limits.d/90-nproc.conf like so: soft nproc 10240
Tune the JVM Heap size on each node of the cluster, depending on the amount of memory on that specific node. Too big a Heap size can impair Cassandra’s efficiency.
Cassandra through the JVM only makes use of 8GB RAM. For best read performance, allow for enough nodes to host the dataset in RAM, at about 8GB per server.
Fast I/O storage, such as RAID or SSDs, becomes crucial for out of memory read-dominant workloads. Reads tap secondary storage far more than writes.

Configuring MongoDB NoSQL document store

MongoDB uses a scale-out architecture based on “sharding”. Each instance stores its data on disk, while also maintaining a memory-mapped cache in RAM.
MongoDB can serve “web-scale” request rates, at millions of end-user requests per second, while providing sub-second responses—requires that servers have sufficient RAM to hold the “working set” from which most requests can be served.
The storage must then perform fast enough to accommodate the rate at which the working set changes, occasional requests beyond the working set, and rates of write and update traffic.
On NUMA hardware, run MongoDB with an interleave policy. Start the sharding servers like so: numactl--interleave=all ./mongod … Failure to disable NUMA will result in sporadic slowdowns.

Download our informative poster, below:

↧

Game Optimization for Ultrabook™ Devices

June 6, 2014, 2:43 pm

Latest and popular articles on Intel Technologies

≫ Next: WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

≪ Previous: NoSQL Software Configuration & Deployment

Download as PDF

By Lee Bamber

1. Introduction

Recently, I had the task of preparing a game engine I’ve been working on for the Games Developer Conference. Given the importance of the event, I needed my game to run fast on the three devices I had in hand, which ranged from the current Ultrabook™ technology to a system two generations old.

In this paper you will learn how to improve the speed of your 3D game and understand what to look out for when porting your application to Ultrabook systems. Whether you are an experienced game developer or a hobby coder getting into the industry, you will no doubt appreciate the importance of performance. A game that runs at a super smooth frame rate will feel polished and professional compared to a game staggering along at a measly five frames per second (FPS). No amount of gorgeous graphics will disguise the fact your game lurches along, tears the screen as it continually misses the monitor’s vertical sync step, and sends your game physics into pure pandemonium. With this case study of an actual game project port, I hope you will gain insight into the real-world problems you may encounter and possible solutions.

Figure 1:You may gain sales with your screen shot, but you’ll pay the price online if your FPS is low!

This article highlights a few of the common causes of performance loss and specifically helps game developers move a typical high-end AAA 3D game title to the Ultrabook device with the performance demanded by modern audiences. Such titles often require a high-end discrete graphics card to work well and put extremely high demands on the GPU. Understanding the architectural differences between a dedicated and integrated GPU can help, but the very best method of improving graphics performance is by analyzing the pipeline for bottlenecks and optimizing those areas without adversely affecting visual quality.

You should have a basic understanding of graphics API calls in general, a familiarity with the components that make up a typical 3D game, and some knowledge or use of an Ultrabook.

2. Why Is Performance Important?

As the market for applications and games becomes increasingly crowded, the unique selling points for your product become ever more crucial for commercial success, and performance today is not just desirable but absolutely essential. Many users would not even consider your game as finished until it ran smoothly and consistently on their device, and would not bother to play the game beyond an initial negative experience.

Given the crucial importance of this requirement and the fact that mobile, tablet, and portable computing is rapidly growing, you can appreciate that performance is critical. You might be complacent when adapting your game to the Ultrabook given its exceptional power over these other devices, but users will demand the highest standard and expect a high-end gaming experience.

Figure 2:Ultrabook™ Systems pack a powerful punch in the right hands

From a skill development point of view, everything you can do to optimize and improve your game code now becomes a vital lesson that can be applied to future projects, making you a better game developer.

3. Why Optimize?

Many developers use a desktop PC system to create and test their 3D games, and the presence of a dedicated graphics card can sometimes create a sense of abundance, resulting in algorithms and shaders that push the very limits of what is possible on the GPU. When you run this game on a more limited platform, it may not perform as expected and result in a dramatic reduction in performance. Ultrabooks are amazingly powerful mobile devices, but they do not provide the same level of brute force rendering available on next-gen, high-end GPUs. In addition, Ultrabooks are designed to be used on the go, so your game may very well find itself running on battery power, requiring an efficient rendering pipeline to prevent rapid power loss. Your approach to creating in-game visuals must respect these facts.

Figure 3:The many destinations of a successful app

When developing an application, developers traditionally start at the top, and trim their way down to run on as many devices as is practical in the available time.

Developing on the Ultrabook and porting your game to a desktop powered by a dedicated graphics card would be the easiest route to take, as this virtually eliminates the need to port. However, you may find yourself competing with games that have set the quality bar substantially higher. This approach does have one advantage: you are conscious of battery life from the very beginning, and therefore, you are more likely to develop a 3D game that dials down intensive activity at specific moments in the game such as title screens and HUD pages. Developing on a desktop and optimizing down to the Ultrabook is more common and generally yields a higher level of quality as your original development philosophy aims high and then works out how to deliver it on more form factors.

4. Desktop to Ultrabook – A Case Study in Performance

My story begins many weeks before the big GDC event, running my game on a relatively modern PCI Express* 3.0 graphics card worth about $200 and getting 60 FPS with visual settings set to the highest quality. It was by no means a high-end gaming rig, but it was capable of running any 3D game at the highest settings with no noticeable lag and packed a mean punch with its six cores, 6 GB of system memory, and an array of super-fast SSD drives. I knew there would be no desktop systems waiting for me at the event, and I did not want to lug a huge PC system half way around the world with me. Naturally, the solution was to take my Ultrabook, the next most powerful device I owned and more than capable of putting on a good show.

Figure 4:GDC 2014 – One of the biggest developer conferences…but no pressure

My Ultrabook has a 4th generation Intel® Core™ processor with Intel® HD Graphics 4000™, and is my device of choice when away from the office. My initial test was painful, dropping so many frames that the whole endeavor seemed far too ambitious. The current build of the 3D game engine relied heavily on shaders and multiple targets for rendering, gobbling up CPU cycles like candy and running everything as fast and as loud as it could. As you can imagine, such a beast was a million miles away from the power-conscious and friendly apps you want on a portable device.

Despite the audaciousness of the plan, I also knew that modern Ultrabooks are very capable gaming systems and when used correctly could match the desktop for productivity and hands down beat it for convenience. I also played many games that ran great on Ultrabooks, and the mission was not impossible, so I set to work to get the FPS up to the needed 60—my goal for the GDC event.

As an old-school coder, I learned to program long before the arrival of performance analyzers and graphics debuggers, so my primary method of detecting bottlenecks is to remove huge chunks of the engine until the performance improved. By selectively re-introducing vital chunks of code back in, I could determine which parts of the engine were slowest. Once the bottlenecks are identified, and as it was not an option to simply remove them altogether, the careful process of reducing the intensity of the component could begin. Typical examples are skipping normal map calculations in the shader for pixels beyond a certain range from the player, or skipping A.I update calls every other cycle to reduce the overhead of these processes. Cumulatively, these small improvements start to add up and before long the game engine is running at full speed again with hardly any loss in visual quality.

For coders new to the world of performance tuning, I would heartily recommend you avoid this method of detecting bottlenecks. Numerous tools are available to help you identify performance problems in your application, which not only provide the location of the bottleneck but the nature of the issue. One such set of free tools is the Intel® Graphics Performance Analyzers, which profiles your application as it runs and gives you a snapshot of what your program is doing and how long it’s taking to do it. While demonstrating the game at the event, I found a few issues that I later fixed to improve performance and smoothness of the final result.

Figure 5:Before & After – Screen Shots of the game before and after optimizations

As you can see in figure 5, I went from 20 fps to 62 fps with only minor visual differences in the before and after scenes. The ‘after’ shot shows the removal of the strong dynamic lighting around the player and a less aggressive fragment shader.

Hungry Shaders

It did not take us long to realize that the biggest drain on our performance was in our graphics rendering step.

Figure 6:Performance Metrics Panel from the original low FPS version

As you can see in Figure 6 the horizontal bar marked in the panel as ‘Rendering’ consumed most of our available cycles, and when we drilled down to the fine detail, it was apparent that rendering the objects to the screen was very costly. From here, it was a short step to realize that a scene rendering hundreds of thousands of polygons, each one using a heavy-duty fragment shader, contributes greatly to a loss in performance. Just how much was it costing? By adding MEDIUM and LOWEST techniques to the shader and scaling back the visual eye candy per-pixel, we gained a factor of six in performance improvement.

To settle on what LOWEST and MEDIUM actually do, we first had to determine the lowest common denominator of features for the game. By figuring out which features where absolutely essential for playing the game and then disregarding whatever remained, I could create the new LOWEST technique within the shader. Early on, this technique was amazingly simple, with almost all elements removed including all shadows, normal mapping, dynamic lighting, texture overlays, specular mapping, and so on. By starting at near-zero, it was possible to run the game and see what the ‘best case’ scenario was for this shader running on the Ultrabook. When I compared a screen shot from the HIGHEST setting to one from the LOWEST setting, I saw the most important missing ingredient that would cause users distress when they reduced the setting. The least subtle elements in the shader were shadows and texture overlays, each of which created a dramatic reduction in quality when absent. Adding overlays back in was relatively inexpensive and I could test the cost by simply adding the shader code for this element back in and running the game again. Shadows on the other hand extolled a high price, both in their generation in another part of the engine and their use within the shader itself. Given the importance of this aspect to preserve visual quality, time was spent investigating various approaches until a faster solution was found, which I’ll detail below.

Producing the MEDIUM technique setting for the shader was a little easier and simply involved writing a shader between the highest and lowest settings, yet always preferring to err on the side of performance. The intent with this setting was to allow all the speed benefits of the lowest setting but include the less costly effects such as player flash light, dynamic lighting, and slightly better shadows.

Had I simply removed all visual quality from the lowest setting, I could have achieved almost all the performance improvement required in one go, but gamers dislike poor graphics almost as much as poor performance. By making an effort to preserve 90% of the visual fidelity of the highest setting, and prioritizing which aspects could be reduced or eliminated, I achieved a significant improvement with minimal loss in visual quality. Moving from 5 FPS to over 40 FPS was my single biggest improvement.

When investigating why your desktop game is running so slow on an Ultrabook, I highly recommend you dismantle your graphics rendering pipeline and ask some serious questions about where the time is being spent. You can try my method of butchery and remove whole slabs of functionality until your pipeline improves, or you can opt for a more sophisticated approach and use a performance analyzer tool. Whatever method you choose, once the issue has been located your next most critical task is to arrive at a solution that not only improves the speed of that element but does so without sacrificing visual quality.

To provide some inspiration for the work required to find these optimal solutions, here are a few of the techniques I devised to solve some of the bottlenecks I discovered.

Cheaper Shadows

To solve the shadow issue mentioned above, I had to look for alternatives to a technique called Cascade Shadow Mapping. The technique will not be discussed here in detail, but you can find more information here: http://msdn.microsoft.com/en-gb/library/windows/desktop/ee416307(v=vs.85).aspx. The basic premise is that four render targets are drawn with the shadows of all objects immediately within view of the player camera, each one at a different level of detail.

Figure 7:Cascade Shadow Mapping – a debug view from the game engine

A shader is then instructed to re-color a pixel on screen based on whether it falls within the shadows previously calculated. The problem is that this is an intense shader effect and requires a lot of video memory. You will notice in the ‘fragment shader’ code below, the IF branch statement is being used several times, and some GPU hardware will incur a penalty in performance for each IF branch used. In extreme cases, some systems will compute every permutation of pixel output meaning there is no benefit to branching over code.

fPercentLit = 0.0f;
if ( iCurrentCascadeIndex==0 )
{
fPercentLit += vShadowTexCoord.z > tex2D(DepthMap1,float2(vShadowTexCoord.x,vShadowTexCoord.y)).x ? 1.0f : 0.0f;
}
else
{
	if ( iCurrentCascadeIndex==1 )
	{
		fPercentLit += vShadowTexCoord.z > tex2D(DepthMap2,float2(vShadowTexCoord.x,vShadowTexCoord.y)).x ? 1.0f : 0.0f;
	}
	else
	{
		if ( iCurrentCascadeIndex==2 )
		{
			fPercentLit += vShadowTexCoord.z > tex2D(DepthMap3,float2(vShadowTexCoord.x,vShadowTexCoord.y)).x ? 1.0f : 0.0f;
		}
		else
		{
			if ( iCurrentCascadeIndex==3 && vShadowTexCoord.z<1.0 )
			{
				fPercentLit += vShadowTexCoord.z > tex2D(DepthMap4,float2(vShadowTexCoord.x,vShadowTexCoord.y)).x ? 1.0f : 0.0f;
			}
		}
	}
}

It’s important that the video memory requirement and the dependence on the IF branch statements be reduced. The solution (of which there are many) is to create a single large shadow mega-texture and deposit the results of the lowest level of detail shadow into this target.

A new cheaper shader technique was written to simply read from this shadow mega-texture without needing a single IF statement. Again the specifics of this technique go beyond the scope of this article, but the underlying practise of first identifying the cause of a performance drop and then creating a second technique to produce a similar visual look without the cost is a sound strategy.

Maintaining Visual Fidelity

One thing to keep in mind as you optimize your engine is to protect the visual quality of your game at every stage of development. It’s easy to simply hack away beautiful yet expensive effects for the sake of performance, but it’s more rewarding to treat each issue as an opportunity to gain better performance while retaining the visual quality your game needs. Not only will you achieve the results you are after, but your game will run even better on higher-end systems, which of course means you can add even more features as your game scales up.

Figure 8:Comparison of a game scene when you reduce the visual quality too much

When you are developing on a desktop, you will be tempted to use clever and sophisticated fragment shaders to create all manner of surface effects and simply removing them for a low-end technique would destroy the appearance of the final image to a point where it no longer resembles the original. Maintaining a consistent visual style across all shader techniques is vital if you want to retain the integrity of your game. New users, impressed with a stunning screen shot in an online magazine, will be mighty disappointed when they run your game and see something significantly different.

Where possible, look for techniques that reproduce the high-end shader effect using low-tech techniques such as pre-baked textures, or even better, limit the expensive pixel effects to an area close to the player.

Spend the Most on Those Closest To You

Sounds like good family advice, but it’s a good strategy when making shaders look great on Ultrabooks. With a single IF branch statement, you can determine if the pixel being calculated is close to the player or not. If so, you can use the expensive high-end shader pixel effect as before, and beyond that range you can revert to a cheaper baked or faked effect.

Figure 9:The blending effect in action, notice the normal map effects up close

A good technique to use in concert with the above is blending, and for the price of an extra IF branch, you can also check if the pixel is between two range points. At the closest two ranges, you use the expensive effect, and beyond the closest range point, you calculate the cheap effect. Between the first and second closest range points, you calculate a blended transition between the two results. It is important to note here that the range between these two points should be relatively narrow to avoid double computation costs. The blending range should only be sufficiently wide to allow the transition to go unnoticed by the player. In the code below, you can see how each pixel is treated based on the distance from the view camera, and between the range of 400 and 600 units, both code branches are computed.

float4 lighting = float4(0,0,0,0);
float4 viewspacePos = mul(IN.WPos, View);
if ( viewspacePos.z < 600.0f )
{
	// work out surface normal per pixel
	lighting = lit(pow(0.5*(dot(Ln,Nb))+0.5,2),dot(Hn,Nb),24);
}
if ( viewspacePos.z > 400.0f )
{
	// cheapest directional lighting
	lighting = lerp ( lighting, cheaplighting, min((viewspacePos.z-400.0f)/200.0f,1.0f) );
}

The result is alarmingly good and creates a soft almost unnoticeable transition when rendered. The upshot for the game is that around 90% of the scene is now using the cheap effect and thus accelerating the speed of the game.

In-Process to Pre-Process

Having spent a good deal of time on the graphics optimization side, we were still running a few FPS short of our target of 60. The balance of visual quality and achievable performance was struck, but other parts of the game engine beyond the shader system were causing processing overhead sufficient to degrade game speed.

The game engine already had an internal performance metrics system that crudely measured each major section of the overall game engine pipeline. In addition to the graphics metric, the engine also measures the time taken for A.I, Physics, Weapons, Debugging, and Occlusion among others. One of the metrics monitored the generation of real-time grass, which allows the engine to provide the game with the illusion of infinite grass. Once we had reduced the cost of graphics processing, we noticed that the relative cost of this process jumped up as the next hungriest element in the game engine pipeline. When you optimize, you should always watch out for these spikes in performance and if you determine that they are using an unreasonable amount of game cycles, then a closer examination is warranted. Knowing what is reasonable often comes down to experience and the intimate understanding of the whole engine, and in this case the grass should not be consuming over 10% of the overall game cycles, not with so many other vital services requiring game cycles. On the desktop PC this spike was not obvious, but on the Ultrabook, it was a substantial performance hit. In addition to the metric spike, it was apparent when playing the game that whenever new grass was generated ahead of the player, the frame rate would stutter as the spike interrupted the normally smooth running of the game.

Figure 10:A field of green – generating grass in real time can be extremely compute intensive

The solution, and another staple of the optimization coder, was to move the entire grass generation system to a pre-process step that happens before the game even starts. Instead of grass being generated on the fly, it was simply moved into place ahead of the player to create a near identical effect. Nothing needs to be generated, just moved, and the Ultrabook breathed a sigh of relief as precious CPU cycles were freed up for the rest of the game engine. I also sighed with relief as the magic 60 FPS was achieved and the game ran at the desired speed.

The Mysterious Case of the Strange Stutter

Having succeeded in achieving ideal gameplay velocity and travelling half way around the world to present the game and engine to the harsh gazes of the GDC attendees, I found that when installing the game on the show devices, a strange stutter effect emerged. The stutter did not exist on the desktop development machines, did not happen on the Ultrabook I used for pre-event testing but was happening on these show devices, and to make things more interesting, they were more powerful than the ones I had tested on.

After much debate and subsequent research back home, the issue was related to something called “internal timer resolution.” In short, all games that run at a machine-independent speed (that is, the player in your game will take the same amount of time to run from A to B, irrespective of the machine you are running the game on) require access to a GetTime() command. There are several to choose from but one of the most popular is the timeGetTime() command that returns the number of milliseconds that has passed since the machine was switched on. It implies that you will get the result in granularities of 1 millisecond, and indeed many desktop systems report the time at this resolution. It so happens that on Ultrabooks and other portable power-saving devices, this granularity is not fixed and can return a resolution in the 10-15 millisecond range. If you are using this timer to control physics, which was the case with our game engine, the result is a seemingly random and jagged stutter as the physics update calls are sporadically jumping from one reported time to another.

The reason the granularity can go from 1 ms to 10-15 ms is that some systems can save on-battery power if they step down the processor, and one of the side effects of this is that the frequency of the ticks can get unpredictable. There are a number of solutions, and the one we chose and recommend was to use the QueryPerformanceTimer() command, which guarantees the granularity of the time value returned by offering a second command that returns the frequency the timer operates under.

5. Tricks and Tips

Do’s

Augment shaders with additional techniques instead of replacing them when optimizing for Ultrabook. Your game still needs to run on desktops as well as Ultrabooks, and the process of distribution is much easier with a single game binary. Both DirectX* and OpenGL* shaders allow you to create techniques within a single shader. With additional techniques in place, your game code can detect the platform you are running on and select the best technique, whether it be for performance or graphical quality.
Offer your users an options screen so they can select the level of performance / quality they desire as this is expected by most games players today. It is always a good idea to detect and pre-select the best settings based on their system specification, but it should always be changeable and the default settings you select should always work on the user’s system.

Don’ts

Do not assume you have to run your game at 60 FPS. You can set the monitor refresh interval on most modern devices to skip one or even three vertical sync signals and gain the same smooth non-tearing screen display at 30 FPS. It’s not going to be as smooth as 60 of course, but if your game timings are adjusted, the game will still feel smooth and very playable.
Do not underestimate how costly fragment shaders are when developing your game, especially if you are running on low-scoring graphics hardware. If you find your game suffering low performance, switch off or downgrade all shader use as a process of elimination.
Do not pre-select a resolution for the user that may not be supported by the display device. Use the Windows* API to interrogate the display device for a compatible default resolution.
Do not assume timeGetTime() returns the time in intervals of 1 ms. When Ultrabook power-saving is enabled, it can be as infrequent as 10-15 ms!

6. A Brief Tour of Ultrabook Gotchas

It might seem an exercise in the obvious, but here is a quick and handy guide to testing, running, and exhibiting your games and 3D applications on an Ultrabook.

Power-Saving

If you are presenting to a large audience and want to show your game in its best light, it is vital you plug in the Ultrabook. Do not run on battery power as the system will protect itself by dialling down all manner of hardware settings that you want to keep on ‘red hot maximum’.

Figure 11:Power Management on the Ultrabook

As an extra precaution, find the Power Management settings through the control panel and double check that when using Plugged-In power, all saving settings are off, and that as many settings as possible are set to HIGH.

Graphics

The Control Panel has another settings panel that gives you access to your specific device’s graphics accelerator settings. You will find settings that control the GPU and driver when in power-savings mode. You must have this setting set to Performance, or the equivalent mode, to ensure your on-board GPU will run as fast as possible.

Figure 12: Graphic Acceleration Settings on the Ultrabook™

It might seem odd that you have to do these things, but the Ultrabook has been designed to conserve power at every turn, allowing you to use the device for hours on end. To achieve maximum performance on the Ultrabook, nothing beats plugging into a wall socket and turning every setting to 11.

Background Tasks

Old hands will nod sagely at this simple but crucial piece of advice, which involves a quick scan for any background tasks that may be running on the Ultrabook when Windows starts up. Originally intended as light-weight and helpful background tasks, when combined they have a propensity to slowly task the CPU with all manner of things.

As vital as some of these are, when you are demonstrating how fast your 3D game can run on an Ultrabook, it is prudent to cancel any tasks that you will not need for that session. Fear not, as they will reappear the next time you boot the Ultrabook, but for the remainder of the Windows session your device will be dedicated to running one application, yours!

7. Conclusions

The subject of game optimization is a broad one, and developers should consider the task of optimization part and parcel of their daily duties. The challenge is to enable your game to run on as wide a range of hardware as possible, and it’s at these times that experience and know-how come to the rescue. Using Intel® tools such as the VTune™ analyzer and the Intel Graphics Performance Analyzers accelerate the process of finding the problem. Articles such as this one may give you a few clues as to likely solutions, but it ultimately comes down to your ability to think laterally. How can you do this another way? Is there a faster way to do this? Is there a smarter way to do this? These are great questions to start the process, and the more you ask them, the better you will be at optimizing your games and applications. As I suggested at the start of this article, you will not only become a better coder, you will have expanded your reach into a market that’s growing at an incredible rate!

About The Author

When not writing articles, Lee Bamber is the CEO of The Game Creators (http://www.thegamecreators.com), a British company that specializes in the development and distribution of game creation tools. Established in 1999, the company and surrounding community of game makers are responsible for many popular brands including Dark Basic, FPS Creator, and most recently App Game Kit (AGK). Lee also chronicles his daily life as a coder, complete with screen shots and the occasional video here: http://fpscreloaded.blogspot.co.uk

Intel®Developer Zone offers tools and how-to information for cross-platform app development, platform and technology information, code samples, and peer expertise to help developers innovate and succeed. Join our communities for the Internet of Things, Android*, Intel® RealSense™ Technology and Windows* to download tools, access dev kits, share ideas with like-minded developers, and participate in hackathons, contests, roadshows, and local events.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

performance optimization

Laptop

Intel(R) Xeon Phi(TM) Coprocessor

↧

WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

June 11, 2014, 4:20 pm

Latest and popular articles on Intel Technologies

≫ Next: 为使用英特尔® 通用连接框架的 Windows* 8 开发数据传输应用

≪ Previous: Game Optimization for Ultrabook™ Devices

Overview

This document demonstrates the best methods to obtain, build and run the WRF model on multiple nodes in symmetric mode on Intel® Xeon Phi™ Coprocessors and Intel® Xeon processors. This document also describes the WRF software configuration and affinity settings to extract the best performance from multiple node symmetric mode operation when using Intel Xeon Phi Coprocessor and an Intel Xeon processor.

Introduction

The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.phpfor more details on this system. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus2.5km benchmark.

WRF is used by many private and public organizations across the world for weather and climate prediction.

WRF has a relatively flat profile on Intel Architecture over many functions for atmospheric dynamics and physics: advection, microphysics, etc.

Technology (Hardware/Software)

System	Xeon E5-2697 v2 @ 2.7GHz
Coprocessor	Intel Xeon Phi coprocessor 7120A @ 1.23GHz
Intel® MPI	4.1.1.036
Intel® Compiler	composer_xe_2013_sp1.1.106
Intel® MPSS	6720-21

We used the above hardware and software configuration for all of our testing.

Note: This Index card assumes that you are running the workload on the aforementioned hardware configuration. If you are using Intel Xeon Phi coprocessor model 7110 cards, please use the following instructions on 8 nodes instead of 4. To run the workload on 4 nodes, you need Intel Xeon Phi coprocessors with 16GB memory; since the 7110 model coprocessors have 8GB memory, you will need to use more than 4 Xeon Phi coprocessor Cards.

Note: Please use netcdf-3.6.3 and pnetcdf-1.3.0 for I/O.

Multi Node Symmetric Intel Xeon + Intel Xeon Phi coprocessor (4 Nodes)

Compile WRF for the Coprocessor

Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the Intel MPI for intel64 and Intel Compiler
1. source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for Intel Xeon Phi coprocessor is a prerequisite.
1. export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/netcdf/mic/
2. export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/pnetcdf/mic/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the ../WRFV3/ directory and run ./configure and select the option to build with Xeon Phi (MIC architecture) (option 17). On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.mic
This will build a wrf.exe in the ../WRFV3/main folder.
For a new, clean build run ./clean –a and repeat the process.

Compile WRF for Intel Xeon processor-based host

Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the latest Intel MPI for intel64 and latest Intel Compiler (as an example below)
1. source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
1. export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/netcdf/xeon/
2. export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/pnetcdf/xeon/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the WRFV3 directory created in step #1 and run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compiles, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
For a new, clean build run ./clean –a and repeat the process.

Run WRF Conus2.5km in Symmetric Mode

Download the CONUS2.5_rundir from http://www2.mmm.ucar.edu/WG2bench/conus_2.5_v3/
Follow the READ-ME.txt to build the wrf input files.
The namelist.input has to be altered. The changes are as follows:
1. In the &time_interval section, edit the values as below:
  1. restart_interval =360,
  2. io_form_history =2,
  3. io_form_restart =2,
  4. io_form_input =2,
  5. io_form_boundary =2,
2. Remove "perturb_input =.true." from the &domains section and replace with "nproc_x =8,"
3. Add "tile_strategy =2," under the &domains section.
4. Add "use_baseparam_fr_nml =.true." under the &dynamics section.
Create a new directory called CONUS2.5_rundir (../WRFV/CONUS_rundir) in the CONUS2.5_rundir, create 2 directories "mic" and "x86". Copy over contents of ../WRFV/run/ into “mic” and “x86” directories.
Copy the Intel Xeon Phi coprocessor binary into the CONUS2.5_rundir/mic directory and copy the Intel Xeon binary into the CONUS2.5_rundir/x64 directory.
Cd into the CONUS2.5_rundir and execute WRF as follows on 4 nodes (i.e 4 coprocessors + 4 Intel Xeon processors) in symmetric mode. To run conus2.5km, you need to have access to 4 nodes (example shown below)

Script to run on Xeon-Phi + Xeon (symmetric mode)

The nodes I am using are: node01 node02 node03 node04

When you request for nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536


source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh
source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64

export I_MPI_DEVICE=rdssm
export I_MPI_MIC=1
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto

./run.symmetric

Below is the run.symmetric to run the code in symmetric mode:

run.symmetric script


#!/bin/sh
mpiexec.hydra
 -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node02 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node03 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node04 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node02-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node03-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node04-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh

In ../CONUS2.5_rundir/mic, create a wrf.sh file as below.

Below is the wrf.sh that is needed for the Xeon Phi part of the runscript.

wrf.sh script


export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH
/path/to/CONUS2.5_rundir/mic/wrf.exe

You will have 80 rsl.error.* and 80 rsl.out.* files in your CONUS2.5_rundir directory.
Do a 'tail –f rsl.error.0000' and when you see 'wrf: SUCCESS COMPLETE WRF' your run is successful.
After the run, compute the total time taken to simulate with the scripts below. The mean value (which indicates the Average Time Step (ATS)) is of interest for WRF (lower the better).

Parsing scripts

gettiming.sh – is the parsing script


grep 'Timing for main' rsl.out.0000 | sed '1d' | head -719 | awk '{print $9}' | awk -f stats.awk
bash-4.1$ cat stats.awk 
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 }
{
i ++
a += $1
if ( $1 > max ) max = $1
if ( $1 < min ) min = $1
}
END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation

To validate if the successful WRF run is correct or not, check the following:

It should generate a wrf_output file.
diffwrf your_output wrfout_reference > diffout_tag
'DIGITS' column should have high value (>3). If yes, the WRF run is considered valid.

Compiler Options

-mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor
-openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
-O3 :enable aggressive optimizations by the compiler.
-opt-streaming-stores always : generate streaming stores
-fimf-precision=low : low precision for higher performance
-fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision.
-opt-streaming-cache-evict=0 : turn off all cache line evicts.

Conclusion

This document enables users to compile and run the WRF Conus2.5KM workload on an Intel-based cluster with Intel Xeon processor based systems and Intel Xeon Phi coprocessors and showcases the benefits of using Intel Xeon-Phi coprocessors over the use of a homogeneous Intel Xeon processor based installation in a symmetric mode run with 4 nodes.

About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).

Weather Research and Forecasting

Arquitetura Intel® Many Integrated Core

Computação paralela

Intel® Common Connectivity Framework

↧

为使用英特尔® 通用连接框架的 Windows* 8 开发数据传输应用

June 16, 2014, 2:32 am

Latest and popular articles on Intel Technologies

≫ Next: 基于 Windows* 8.1 台式机的 Miracast*

≪ Previous: WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

Download PDF

英特尔® 通用连接框架（英特尔® CCF）是一款面向移动设备上运行的应用的连接软件。使用英特尔 CCF 的应用能够将所有用户连接在一起，无论其在世界的另一端使用者不同的防火墙，还是同处一室但没有连接互联网。英特尔 CFF 适用于 iOS*、Android*、Windows* Desktop 和 Windows 商店的所有应用，这些应用可在任意平台上使用任何外形的英特尔 CCF。使用英特尔 CCF，开发人员能够开发在手机、平板电脑、PC 和其他智能设备上使用的应用。

英特尔 CFF 的通信模型是对等的方式。它支持人们直接与彼此连接，并分享其所有移动计算设备间的信息。

本文中，我将介绍如何使用英特尔 CCF 3.0 为 Windows 8 设备开发应用。我曾负责开发能够在 Windows 8 和 Android 设备之间传输文件的应用的项目，在该项目中，我独立开发了一款 Windows 应用商店应用。在这里，我将与大家分享使用英特尔 CCF 的经验。

首先，你需要在 Microsoft Visual Studio* 中将 lib 文件与该项目相关联。英特尔 CCF SDK 包含两个 dll 文件：libMetroSTC 和 WinRTSTC，所有英特尔 CFF 应用都需要这两个文件。如要使用英特尔 CCF API，需要将 Intel.STC.winmd 添加至项目参考。 winmd 文件包含使用英特尔 CFF SDK 构建 Windows 应用商店应用的元数据。

身份设置

在将会话显示前，英特尔 CCF 用户必须先设置身份，包括用户名、设备名和 avatar。这是远程用户将看到的身份。在面向 Windows 应用商店应用的 SDK 中，InviteAssist 类支持用户设置英特尔 CCF 身份。

                string displayName = await UserInformation.GetDisplayNameAsync();
                _inviteAssist = InviteAssist.GetInstance();
                _inviteAssist.SetUserName(displayName);
                _inviteAssist.SetStatusText("Status");
                _inviteAssist.SetSessionName("Win 8");
                if (_AvatarStorageFile == null)
                {
                    Windows.Storage.StorageFile imgfile = await Windows.Storage.StorageFile.GetFileFromApplicationUriAsync(new Uri("ms-appx:///Assets/Device.png"));
                    _AvatarStorageFile = imgfile;
                }
                await _inviteAssist.SetAvatarAsync(_AvatarStorageFile);

实施 SetUserProfile() 函数获取 InviteAssist 类的实例，通过调用 InviteAssist.SetUserName()、InviteAssist.SetSessionName() 和 InviteAssist.SetAvatarAsync() API 设置配置文件，如下所示。我将用户名设置为 Windows 账户名，我的 Windows 8 设备将连接的所有设备上都能看到该指派。状态文本和会话名可由用户在 UI 中定义。在我的案例中，这些参数不能被用户更改，且一直使用我的符号。现在，配置文件已设置完成并可由远程用户发现。

发现

英特尔 CCF 远程用户发现可通过调用 Intel.STC.Api.STCSession.GetNeighborhood() API 来完成，它能够返回发现的所有远程 STCSessions 的 IObservableVector<object>。开发人员需要负责在 observable collection 与 UI 之间实施数据绑定（Data Bind），或从背后的代码更新 UI 以显示用户列表。我使用了 Grid APP (XAML)，它是 Visual Studio 中渲染 GUI 的标准模板。 GridList 结果中显示的所有用户。

为 NeighborhoodUsers 类对象创建一个 ObservableCollection。 _userList 是 STCSession 类对象的列表，并包括周围用户（neighborhood user）列表。

private static List<STCSession> _userList;
	IObservableVector<object> _hood;
	ObservableCollection<NeighborhoodUsers> _neighborhoodList = new ObservableCollection<NeighborhoodUsers>();
        ObservableCollection<NeighborhoodUsers> neighborhoodList
        {
            get { return _neighborhoodList; }
        }

通过调用 STCSession.GetNeighborhood()API 获取 IObservableVector，并为 IObservableVector 设置 VectorChanged 事件处理程序。

	async void GetNeighborhoodList()
        	{
            		await Task.Run(() =>
          		  {
               	 	_hood = STCSession.GetNeighborhood();
                		_hood.VectorChanged += _hood_VectorChanged;
                		STCSession.LockNeighborhood();
                		IEnumerator<object> en = _hood.GetEnumerator();
                		while (en.MoveNext())
                		{
                    			STCSession session = en.Current as STCSession;
                    			if (session != null)
                        				hood_DiscoveryEvent(session, CollectionChange.ItemInserted);
                		}
                		STCSession.UnlockNeighborhood();
            		});
        	}

        	void _hood_VectorChanged(IObservableVector<object> sender, IVectorChangedEventArgs evt)
        	{
            		STCSession session = null;
            		lock (sender)
            		{
                		if (sender.Count > evt.Index)
                    			session = sender[(int)evt.Index] as STCSession;
            		}
            		if (session != null)
                		hood_DiscoveryEvent(session, evt.CollectionChange);
        	}

我们添加 hood_DiscoveryEvent() 回调函数以捕获矢量变化事件。当远程会话可以进行连接，或无法在周围（neighborhood）继续使用时，该回调函数将会发出通知。当新会话可用时，将会接收到 CollectionChange.ItemInserted 事件；当远程 STCSession 离开周围（neighborhood）或 STCSession 参数发生改变时，将会接收到 CollectionChange.ItemRemoved 和 CollectionChange.ItemChanged 事件。添加 STCSession.ContainsGadget()，以查看远程设备上是否安装了相同的应用。

private async void hood_DiscoveryEvent(STCSession session, CollectionChange type)
        	{
            		switch (type)
            		{
                		case CollectionChange.ItemInserted:
                    			await Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
                    {
                        			_userList.Add(session);
                        			AddPeopleToList(session, "Not Connected");
                    });

                    		break;

                		case CollectionChange.ItemRemoved:
			// Handle this case to check if remote users have left the neighborhood.
                    			if (_neighborhoodList.Count > 0)
                    			{
                        				NeighborhoodUsers obj;
                        				try
                        				{
                            				obj = _neighborhoodList.First((x => x.Name == session.User.Name));
                            				await Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
                            {
                                	_neighborhoodList.Remove(obj);
                                	_userList.RemoveAll(x => x.Id.ToString() == session.Id.ToString());
                            });
                        				}
                        				catch
                        				{
                            				obj = null;
                        				}
                    			}
                   		 break;
                		case CollectionChange.ItemChanged:
			// Handle this case to check if any STCSession data is updated.
                   		 {
                        			STCSession item;
                        			try
                        			{
                            			item = _userList.First(x => x.Id == session.Id);
                        			}
                        			catch
                        			{
                            			item = null;
                        			}
                        			if (item != null)
                           			 item.Update(session);
                        			break;
                  		}
                		default:
                    		break;
            		}
       	 }

邀请（Invitation）

我们已经知道了如何发现远程用户（设备），现在应该了解发送连接的流程。在英特尔 CCF，该流程成为“邀请”。在英特尔 CCF 3.0，发送和接收“邀请”通过 STCInitiator 和 STCResponder 进行控制。 STCInitiator 用于向远程用户发送“邀请”，STCResponder 用于响应入站请求。远程用户接收请求后，将成功建立英特尔 CCF 连接。使用 STCInitiator 对象发送“邀请”没有任何限制。一个对象可以发送多个“邀请”。

我们将在以下函数中介绍向远程用户发送邀请。

private void InitializeInitiator(STCApplicationId appId)
        	{
            		initiator = new STCInitiator(appId, true);
            		initiator.InviteeResponded += initiator_InviteeResponded;
            		initiator.CommunicationStarted += initiator_CommunicationStarted;
            		initiator.Start();
        	}

设置完所有调用处理程序后，则调用 STCInitiator.Start() API。现在，如要向发现的远程用户发送“邀请”，则需要调用 STCInitiator.Invite() API。

initiator.Invite(_userList[itemListView.Items.IndexOf(e.ClickedItem)].Id);

如要查看所发送的“邀请”的状态，需要实施 STCInitiator.InviteeResponded() 回调。

async void initiator_InviteeResponded(STCSession session, InviteResponse response)
{
		await Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
		{
  			switch(response)
			{
				case InviteResponse.ACCEPTED:
                			// You are connected to the user.
				break;
				case InviteResponse.REJECTED:
                			//Invite was rejected by the remote user.
				break;
		        		case InviteResponse.TIMEDOUT:
                			//No response. Invite time-out.
				break;
                		}
        		});
}

“邀请”已发送，远程用户需要接收它。实施调用 InitializeResponder() 的函数，通过传递 STCApplicationId 对象初始化 STCResponder 对象。注册 STCResponder.InviteeResponded() 和 STCResponder.CommunicationStarted() 处理程序。当远程用户响应“邀请”，或当两个英特尔 CCF 用户之间的通信通道成功建立时，将会调用这些处理程序。

	private void InitializeResponder(STCApplicationId appId)
        	{
            		responder = new STCResponder(appId);
          		  responder.InviteReceived += responder_InviteReceived;
            		responder.CommunicationStarted += responder_CommunicationStarted;
            		responder.Start();
      	  }

远程用户发送的“邀请”可在 STCResponder.InviteReceived() 回调中接收。接收到“邀请”后，可以接受也可以拒绝。为响应该“邀请”，需要调用 STCResponder.RespondToInvite() API。

STCResponder.RespondToInvite() API is called.
	async void responder_InviteReceived(STCSession session, int inviteHandle)
        	{
            		if ((_initiatorDataStream == null) && (_responderDataStream == null))
            		{
                		try
                		{
                    			if (!checkPopUp)
                    			{
                        				_inviteHandle = inviteHandle;
                        				_session = session;
                        				Debug.WriteLine("Several windows " + _inviteHandle);
                        				checkPopUp = true;
                        				await Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
                        			{
                            				InviteeName.Text = session.User.Name + " wants to connect";
                            				InviteePopup.IsOpen = true;
                            				checkPopUp = true;
                        			});
                		    	}
                		}
                		catch (Exception ex)
                		{
                    			Debug.WriteLine(ex.Message);
                		}
            		}
            		else
            		{
                		responder.RespondToInvite(session, inviteHandle, false);
            		}
        	}

通信和数据传输

发送和接收“邀请”后，initiator_CommunicationStarted() 和 responder_CommunicationStarted() 回调将提供流处理（Stream handle）。我们将使用该 NetStream 处理在两个互联用户之间传输数据。如要获得数据流处理，则需要实施 initiator_CommunicationStarted() 和 responder_CommunicationStarted() callbacks 并创建 NetStream 对象。可以注册对 NetStream.StreamClosed 和 NetStream.StreamSuspended 事件的回调。当通信通道关闭或暂停时，将会接收到这些事件。

void initiator_CommunicationStarted(CommunicationStartedEventArgs args)
        	{
            		_initiatorDataStream = args.Stream;

            		objWrapper.SetStream(_initiatorDataStream);

            		_initiatorDataStream.StreamClosed += DataStream_StreamClosed;
            		_initiatorDataStream.StreamSuspended += DataStream_StreamSuspended;
            		_initiatorDataStream.DataReady += objWrapper.StreamListen;
       	 }
        	void responder_CommunicationStarted(CommunicationStartedEventArgs args)
        	{
            		_responderDataStream = args.Stream;

            		objWrapper.SetStream(_responderDataStream);

            		_responderDataStream.StreamClosed += DataStream_StreamClosed;
            		_responderDataStream.StreamSuspended += DataStream_StreamSuspended;
            		_responderDataStream.DataReady += objWrapper.StreamListen;
      	  }

	private async void DataStream_StreamClosed(int streamId, Guid sessionGuid)
        	{
            		await Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
            		{
               		 UpdateConnectionStatus(_session.User.Name, "Not Connected");
              			  if (_inviter)
               		 {
                   			 _initiatorDataStream.Dispose();
                    			_initiatorDataStream = null;
                    			_inviter = false;
                		}
                		else if (_responder)
                		{
                    			_responderDataStream.Dispose();
                    			_responderDataStream = null;
                    			_responder = false;
                		}
                		if (isTransferFrame)
               		{
                    			if (this.Frame != null && this.Frame.CanGoBack) this.Frame.GoBack();
                		}
                	ResetUIScreen();
            		});
        	}

现在，我们来看一下数据传输流程。首先，我们选择想要发送的文件。对于这一点，我开发了自己的文件管理器，但是选择文件更简单的方式是使用 FileOpenPicker。选择文件后，将数据编写至接收到的 NetStream 处理中。如要在通信通道上编写数据，则需要使用 NetStrem.Write()。

async void SendFileData()

async void SendFileData()
       	{
           		uint size = 1024 * 4;
           		byte[] buffer = new byte[size];
           		int totalbytesread = 0;
           		using (Stream sourceStream = await storedfile.OpenStreamForReadAsync())
           		{
               		do
               		{
                   			int bytesread = await sourceStream.ReadAsync(buffer, 0, buffer.Length);
                   			_objNetStream.Write(buffer, (uint)bytesread);
                   			totalbytesread += bytesread;
                   			TransferedBytes = totalbytesread;

                   			if (args.nState != TransferState.FT_SEND_PROGRESS)
                   			{
                       				args.nState = TransferState.FT_SEND_PROGRESS;
                       				args.FileName = storedfile.Name;
                       				args.FileSize = (int)storedfileProperties.Size;
                       				args.DataBuffer = null;
                       				args.readbytes = 0;
                       				OnTransferUpdates(args);
                   			}
               		} while (totalbytesread < sourceStream.Length);
           		}
       	}

如要接收文件，需要实施 NetStream.Read() 事件回调，这已在“通信”部分进行了介绍。

readBytes = _objNetStream.Read(receivebuffer, (uint)bytesToRead);

最后，我希望该信息能够帮助你了解英特尔 CCF SDK，并希望你能够使用该 SDK 为 Windows 8 和 Android 开发出色的应用。

英特尔和 Intel 标识是英特尔在美国和/或其他国家的商标。

* 其他的名称和品牌可能是其他所有者的资产。

Intel® CCF

Estrutura de conectividade comum Intel®

Design e experiência do usuário

Laptop

Tablet

Download PDF

Code Sample

↧

基于 Windows* 8.1 台式机的 Miracast*

June 16, 2014, 2:41 am

Latest and popular articles on Intel Technologies

≫ Next: 从英特尔媒体软件开发套件向 OpenGL 分享纹理

≪ Previous: 为使用英特尔® 通用连接框架的 Windows* 8 开发数据传输应用

要点综述

英特尔 WiDi 扩展库中的许多功能已迁移至微软实施的 Miracast 中，成为 Windows* 8.1 的一部分。本白皮书讨论了使用英特尔® 媒体软件开发套件和 OpenGL* 为 Miracast 提供 Windows 8.1 台式机应用支持的多种技术。本文不提供在 Windows 应用商店应用中支持 Miracast 的内容，因为它们需要完全不同的框架。

系统要求

示例代码使用 Visual Studio* 2013 编写，以便展示以下两项内容：使用英特尔媒体软件开发套件的(1) Miracast 和 (2) 英特尔® 媒体软件开发套件 / OpenGL* 纹理共享，其中解码平面无需进行复制即可与 OpenGL 纹理共享，进而显著提高了效率。 MJPEG 解码器面向第四代英特尔® 酷睿™ 处理器（代号：Haswell）及更高版本的处理器实施了硬件加速，以前版本的处理器中的英特尔媒体软件开发套件将会自动使用该软件解码器。在任何情况下，它都需要与 MJPEG 兼容的摄像头（无论是板载还是 USB 摄像头）。

除了辨认 Miracast 连接类型以外，该示例代码和白皮书中使用的大部分技术都适用于 Visual Studio 2012。该示例代码基于 Intel Media SDK 2014 for Client，可通过以下链接进行下载：http://software.intel.com/sites/default/files/MediaSDK2014Clients.zip。安装英特尔媒体软件开发套件可为 Visual Studio 自动设置一套环境变量，以便为头文件和库查找正确的路径。

应用概述

该应用以摄像头为 MJPEG 输入，并对 MJPEG 视频解码，将视频流编码至 H264，然后再对 H264 进行解码。 MJPEG 摄像头视频流（解码后）和最终以 H264 标准解码的视频流将会在基于 MFC 的 GUI 上显示。在 Haswell 系统上，2 个解码器和 1 个编码器（1080P 分辨率）将会按顺序运行以改进可读性，但是由于硬件加速，它们的速度非常快，这使得摄像头速度成为限制 fps 的因素。在实际情况下，编码器和解码器将会在不同的线程中运行，性能不会成为障碍。

在单个显示器配置上，摄像头馈入将会在基于 OpenGL 的 GUI 中以 H264 标准解码的视频上方的 PIP 中显示（图 1）。当 Miracast 连接时，软件将会自动辨认与 Miracast 相连的显示器，并使用全屏窗口播放以 H264 标准解码的视频，而 GUI 将显示原始摄像头视频，因此原始视频和加密视频之间的区别便清晰可见。最后，查看->监控拓扑（View->Monitor Topology）菜单不仅能够以单选按钮的形式显示当前的显示器拓扑，还能够用来更改拓扑。但是很可惜，它无法启动 Miracast 连接。这些操作只能在 OS charm 菜单上完成（从右侧滑入-> 设备->项目），目前尚且没有能够创建 Miracast 连接的 API 。有趣的是，通过将显示器拓扑设置为仅限内部使用可以将 Miracast 连接断开。如果通过线路连接了多台显示器，菜单可以随时更改拓扑。

图 1.单个显示器拓扑。 MJPEG 摄像头视频流在右下角显示。以 H264 标准加密的视频在 GUI 中播放。当启用多台显示器（如 Miracast）时，该软件可检测到变化，MJPEG 摄像头视频和以 H264 加密的视频将会自动分至不同的显示器。

检测显示器拓扑变化

当检测到显示器配置变化时，如添加/删除外置显示器（Miracast 连接/断开），OS 将会在顶层窗口显示 WM_DISPLAYCHANGE 消息。在样本代码中，顶层窗口是 CMainFrame 类，而且其 OnDisplayChange 类函数可处理该消息。鉴于多台显示器播放时会出现较短的延迟，OnDisplayChange 处理程序将会先禁用所有更新内部数据结构的活动，如摄像头馈入及所有后续流程，然后启用定时器以便提供充足的时间切换显示器配置。 QueryDisplayConfig API 可用于了解拓扑，它可以提供一组显示器信息（包括每台显示器的位置和尺寸，如果希望在某个显示器上显示全屏窗口，了解这一信息非常重要）以及拓扑类型（内置、复制、扩展、外置等）。这些函数可在 CDisplayHelper 类中封装， OnDisplayChange 处理程序发起的 OnTimer 函数使用该类进行封装。对拓扑进行重新配置后，处理程序将重启内置活动，恢复摄像头馈入。

更改显示器拓扑

如果想要更改显示器拓扑，你可以调用 SetDisplayConfig（不要调用 QueryDisplayConfig），这会生成一系列事件，如 WM_DISPLAYCHANGE，它由 WM_DISPLAYCHANGE 进行处理，如同显示器以物理方式连接/断开。该函数封装在 CDisplayHelper::SetCurrentTopology 中。比如，当用户点击单选菜单项目时，可以在 CMainFrame::OnMonitortopologyRange 处理程序中使用该函数。

多显示器拓扑变更注意事项

理论上，在外置显示器中显示另一个窗口并根据拓扑变化检测对其进行控制似乎相当简单。但是实际情况下，操作系统开始转换，完成显示器配置，以及显示内容都需要时间。当结合使用编码器 / 解码器 / D3D / OpenGL 及其优势时，调试内置处理和 GUI 的时间将会非常复杂。例如，如果摄像头进入视频播放、解码和编码管线，生成馈入，但是没有连接实际显示器，可能会致使系统崩溃，且难以修复。本示例尝试重新使用切换过程中的大部分管线，但是关闭整个管线并重新启动将会更简单，因为添加显示器的 10 多秒钟的时间内可能会出现任何故障 — 甚至 HDMI 或 VGA 连接故障。

未来展望

本白皮书能够很好地处理多显示器（包括 Miracast）上的视频。但是，当外置显示器自带扬声器时，它无法处理音频，一般情况下，Miracast 显示器是带有内置扬声器的大屏幕电视机。我们计划在未来添加音频切换。

英特尔、Intel logo、英特尔标识、Core 和酷睿是英特尔在美国和/或其他国家的商标。
* 其他的名称和品牌可能是其他所有者的资产
 英特尔公司© 2014 年版权所有。所有权保留。

↧

从英特尔媒体软件开发套件向 OpenGL 分享纹理

June 16, 2014, 2:53 am

Latest and popular articles on Intel Technologies

≫ Next: Guía instructiva para Android*: cómo escribir una aplicación multiproceso con Threading Building Blocks de Intel®

≪ Previous: 基于 Windows* 8.1 台式机的 Miracast*

Code Sample

要点综述

通常，在 Windows* OS，使用 Direct3D 进行视频处理。但是，也有许多应用使用 OpenGL*，因为它具备出色的跨平台能力，能够在不同的平台上提供相同的 GUI 和外观与体验。最新的英特尔图形驱动程序支持使用 NV_DX_interop 从 D3D 向 OpenGL 曲面共享，并可与英特尔® 媒体软件开发套件结合使用。英特尔® 媒体软件开发套件可配置使用 Direct3D 和加入 NV_DX_interop，OpenGL 可使用英特尔媒体软件开发套件的帧缓冲区，而不用执行将纹理从 GPU 复制至 CPU 再复制回 GPU 这一昂贵过程。本示例代码和白皮书演示了如何设置英特尔® 媒体软件开发套件以使用 D3D 进行编码和解码，从 NV12 色域（媒体软件开发套件的自然颜色格式）向 RGBA 色域（OpenGL 的自然颜色格式）进行颜色转换，然后将 D3D 曲面应设置 OpenGL 纹理。本管线完全绕过将纹理从 GPU 复制至 CPU 这一过程，该过程曾经是配合英特尔® 媒体软件开发套件使用 OpenGL 时的最大瓶颈。

系统要求

本示例代码使用 Visual Studio* 2013 编写，旨在 (1) 展示 Miracast 和 (2) 利用英特尔® 媒体软件开发套件实现英特尔® 媒体软件开发套件/ OpenGL 纹理共享，其中不用进行任何复制即可与 OpenGL 纹理分享解码曲面，这能够显著地提升效率。 MJPEG 解码器面向 Haswell 和更高版本的处理器执行了硬件加速，较低版本的处理器中的媒体软件开发套件可自动使用软件解码器。在任何情况下，它都需要与 MJPEG 兼容的摄像头（无论是板载还是 USB 摄像头）。
除了辨认 Miracast 连接类型以外，该示例代码和白皮书中使用的大部分技术都适用于 Visual Studio 2012。该示例代码基于 Intel Media SDK 2014 for Client，可通过以下链接进行下载：（https://software.intel.com/sites/default/files/MediaSDK2014Clients.zip。）安装媒体软件开发套件后，将会为 Visual Studio 创建一套环境变量，以便为头文件和库查找正确的路径。

应用概述

该应用将摄像头作为 MJPEG 输入，并对 MJPEG 视频解码，将视频流编码至 H264，然后再对 H264 进行解码。 MJPEG 摄像头视频流（解码后）和最终以 H264 标准解码的视频流将会在基于 MFC 的 GUI 上显示。在 Haswell 系统上，2 个解码器和 1 个编码器（1080P 分辨率）将会按顺序运行以提供出色的可读性，但是由于硬件加速，它们的速度非常快，这使得摄像头速度成为 fps 的唯一限制因素。在实际情况下，编码器和解码器将会在不同的线程中运行，性能不会成为障碍。

在单个显示器配置上，摄像头馈入将会在基于 OpenGL 的 GUI 中以 H264 标准解码的视频上方的 PIP 中显示（图 1）。当 Miracast 连接时，软件将会自动辨认与 Miracast 相连的显示器，并使用全屏窗口播放以 H264 标准解码的视频，而 GUI 将显示原始摄像头视频，因此原始视频和加密视频之间的区别便清晰可见。最后，查看->监控拓扑（View->Monitor Topology）菜单不仅能够检测到当前的显示器拓扑，还能够更改拓扑。但是很可惜，它无法启动 Miracast 连接。它只能在 OS charm 菜单上完成（从右侧滑入-> 设备->项目），目前尚且没有能够创建 Miracast 连接的 API 。有趣的是，通过将显示器拓扑设置为仅限内部使用可以将 Miracast 连接断开。如果通过线路连接了多台显示器，菜单可以随时更改拓扑。

图 1 单个显示器拓扑。 MJPEG 摄像头在右下角显示。以 H264 标准加密的视频在 GUI 中播放。当启用多台显示器（如 Miracast）时，该软件可检测到变化，MJPEG 摄像头视频和以 H264 加密的视频将会自动分至不同的显示器。

管线设置的主要接入点

本示例代码基于 MFC，设置管线的主要接入点是 CChildView::OnCreate ()，它能够启动摄像头，从 MJPEG 转码至 H264，使用 H264 解码，并将转码器和解码器上的纹理与 OpenGL 渲染器绑定。转码器仅为在基础解码器上添加编码器的解码器子类。最后，OnCreate 可启动持续获取序列化摄像头馈入的线程。在工作线程中读取摄像头馈入后，它可向 OnCamRead 函数发送消息，该函数可对 MJPEG 进行解码，编码至 H264，解码 H264，并将纹理更新至 OpenGL 渲染器。从最高层面而言，整个管线整洁、简单，易于执行。

启动解码器/转码器

解码器和转码器都是使用 D3D9Ex 进行启动。我们可对英特尔® 媒体软件开发套件进行配置以使用软件、D3D9 或 D3D11。在本示例中，D3D9 用于简化颜色转换。英特尔® 媒体软件开发套件的自然颜色格式是 NV12；IDirect3DDevice9::StretchRect 和 IDirectXVideoProcessor::VideoProcessBlt 都可用来将色域转换至 RGBA。简单起见，本白皮书使用的是 StretchRect，但是一般情况下，我们推荐使用 VideoProcessBlt，因为它在后期处理过程中有出色的附加功能。可惜的是，D3D11 不支持 StretchRect，因此颜色转换可能非常复杂。此外，本文中的解码器和转码器使用了独立的 D3D 设备进行各种实验，如混合软件和硬件，但是 D3D 设备可以在二者之间进行共享以节约内存。以该方式设置管线后，解码输出将设置为 (mfxFrameSurface1 *) 类型。这仅为面向 D3D9 的封装，mfxFrameSurface1-> Data.MemId 可换算为 (IDirect3DSurface9 *)，并在解码后用于 CDecodeD3d9::ColorConvert 函数中的 StretchRect 或 VideoProcessBlt。媒体软件开发套件的输出曲面不可共享，但是必须要进行颜色转换才能供 OpenGL 使用，并且需要创建一个共享曲面存储颜色转换的结果。

启动转码器

转码器的解码结果将会直接输入编码器，并确保在分配曲面时使用该 MFX_MEMTYPE_FROM_DECODE。

绑定 D3D 和 OpenGL 之间的纹理

绑定纹理的代码位于 CRenderOpenGL::BindTexture 函数中。请确保对 WGLEW_NV_DX_interop 进行了定义，然后使用 wglDxOpenDeviceNV、wglDXSetResourceShareHandleNV 和 wglDXRegisterObjectNV。这将会把 D3D 曲面绑定至 OpenGL 纹理。但是，它无法自动更新纹理，而且调用 wglDXLockObjectsNV / wglDXUnlockObjectsNV 将会更新纹理（CRenderOpenGL::UpdateCamTexture 和 CRenderOpenGL::UpdateDecoderTexture）。对纹理进行更新后，便可像使用 OpenGL 中的其他纹理一样使用它。

多显示器拓扑变更注意事项

理论上，在外置显示器中提供另一个窗口并根据拓扑变化检测对其进行控制似乎相当简单。但是实际情况下，从操作系统开始转换，到显示器配置完成，以及内容显示都需要时间。当结合使用编码器 / 解码器 / D3D / OpenGL 及其优势时，调试将会非常复杂。本示例尝试重新使用切换过程中的大部分管线，但是实际上，关闭整个管线并重新启动将会更简单，因为添加显示器需要 10 多秒钟的时间，在这期间可能会出现任何故障 — 甚至 HDMI 或 VGA 连接故障。

未来展望

本白皮书中的示例代码面向 D3D9，且不可为 D3D11 实施提供支持。目前我们尚不清楚，在没有 StretchRect 或 VideoProcessBlt 时，哪种方式能够最有效地将色域从 NV12 转换至 RGBA。 D3D11 实施退出后，白皮书/代码将进行更新。

贡献

感谢 Petter Larsson、Michel Jeronimo、Thomas Eaton 和 Piotr Bialecki 对本文的贡献。

英特尔、Intel 标识、Xeon 和至强是英特尔在美国和其他国家的商标。
* 其他的名称和品牌可能是其他所有者的资产
 英特尔公司© 2013 年版权所有。所有权保留。

↧

Guía instructiva para Android*: cómo escribir una aplicación multiproceso con Threading Building Blocks de Intel®

June 16, 2014, 3:47 pm

Latest and popular articles on Intel Technologies

≫ Next: WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

≪ Previous: 从英特尔媒体软件开发套件向 OpenGL 分享纹理

Recientemente publicamos la “Guía instructiva para Windows* 8: cómo escribir una aplicación multiproceso para Windows Store* con Threading Building Blocks de Intel®”. En esa guía dijimos que el motor de cálculo paralelo se puede portar fácilmente a otras plataformas móviles o de escritorio. Android es un buen ejemplo de ese tipo de plataforma móvil.

En una versión estable recientemente publicada de Intel Threading Building Blocks (Intel® TBB), hemos agregado de forma experimental compatibilidad con aplicaciones Android, lo que es decir, bibliotecas de Intel TBB para usar en aplicaciones Android mediante la interfaz JNI. Esta versión se puede descargar desde threadingbuildingblocks.org.

Para iniciar el proceso en un host Linux*, hay que desempaquetar la distribución de código fuente de Intel TBB, obtener el script <unpacked_dir>/build/android_setup.csh y compilar las bibliotecas. Es necesario compilar las bibliotecas porque las versiones para desarrollo solo se distribuyen en forma de código. El archivo <unpacked_dir>/build/index.android.html contiene instrucciones para configurar el entorno y compilar la biblioteca en Linux.

Si suponemos que gnu make 3.81 se encuentra en %PATH% (en una plataforma host Microsoft Windows*) y $PATH (en un host Linux), necesitamos emitir el siguiente comando en el entorno NDK para compilar las bibliotecas Intel TBB para Android:

gmake tbb tbbmalloc target=android

Eso es todo lo que se necesita para compilar la biblioteca; ahora podemos pasar a compilar el ejemplo con Eclipse*. Para el ejemplo de abajo, voy a usar Android SDK Tools Rev.21 y Android NDK Rev 8C en Windows* con el fin de ilustrar el proceso de desarrollo multiplataforma.

Creamos un proyecto con la plantilla predeterminada «New Android Application». Por simplicidad, lo llamamos “app1”, el mismo nombre que en la guía anterior:

Seleccionamos FullscreenActivity como Activity. Eso es todo para la plantilla. Se puede observar que com.example* no es un nombre de paquete aceptable para Google Play*, pero sirve para nuestro ejemplo.

Luego hay que agregar un par de botones al marco principal. Después de agregarlos, el archivo XML del marco principal (app1/res/layout/activity_fullscreen.xml) se verá así:

<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"

    xmlns:tools="http://schemas.android.com/tools"

    android:layout_width="match_parent"

    android:layout_height="match_parent"

    android:background="#0099cc"

    tools:context=".FullscreenActivity">

    <TextView

        android:id="@+id/fullscreen_content"

        android:layout_width="match_parent"

        android:layout_height="match_parent"

        android:gravity="center"

        android:keepScreenOn="true"

        android:text="@string/dummy_content"

        android:textColor="#33b5e5"

        android:textSize="50sp"

        android:textStyle="bold" />

    <FrameLayout

        android:layout_width="match_parent"

        android:layout_height="match_parent"

        android:fitsSystemWindows="true">

        <LinearLayout

            android:id="@+id/fullscreen_content_controls"

            style="?buttonBarStyle"

            android:layout_width="match_parent"

            android:layout_height="74dp"

            android:layout_gravity="bottom|center_horizontal"

            android:background="@color/black_overlay"

            android:orientation="horizontal"

            tools:ignore="UselessParent">

            <Button

                android:id="@+id/dummy_button1"

                style="?buttonBarButtonStyle"

                android:layout_width="0dp"

                android:layout_height="wrap_content"

                android:layout_weight="1"

                android:text="@string/dummy_button1"

                android:onClick="onClickSR" />

            <Button

                android:id="@+id/dummy_button2"

                style="?buttonBarButtonStyle"

                android:layout_width="0dp"

                android:layout_height="wrap_content"

                android:layout_weight="1"

                android:text="@string/dummy_button2"

                android:onClick="onClickDR" />

        </LinearLayout>

    </FrameLayout></FrameLayout>

Y el archivo string (app1/res/values/strings.xml) se verá así

<?xml version="1.0" encoding="utf-8"?><resources>

    <string name="app_name">Sample</string>

    <string name="dummy_content">Reduce sample</string>

    <string name="dummy_button1">Simple Reduce</string>

    <string name="dummy_button2">Deterministic Reduce</string></resources>

Luego hay que agregar los controladores de botones:

// JNI functions
private native float onClickDRCall();
private native float onClickSRCall();

      public void onClickDR(View myView) {
            TextView tv=(TextView)(this.findViewById(R.id.fullscreen_content));
            float res=onClickDRCall();
            tv.setText("Result DR is n" + res);
      }

      public void onClickSR(View myView) {
            TextView tv=(TextView)(this.findViewById(R.id.fullscreen_content));
            float res=onClickSRCall();
            tv.setText("Result SR is n" + res);
      }

y la biblioteca se carga al archivo FullscreenActivity.java:

@Override

      protected void onCreate(Bundle savedInstanceState) {

            super.onCreate(savedInstanceState);

…
            System.loadLibrary("gnustl_shared");

            System.loadLibrary("tbb");

            System.loadLibrary("jni-engine");

      }

En el caso de la biblioteca "tbb", todo debería estar claro; la biblioteca "gnustl_shared" es necesaria para la compatibilidad con las características de lenguaje C++ de TBB. Sin embargo, para la biblioteca "jni-engine" tenemos que ser más detallados.

"jni-engine" es una biblioteca de ?++ que implementa un motor de cálculo y exporta las interfaces C para llamadas a JNI de nombre onClickSRCall() y onClickSRCall().

De acuerdo con las reglas para desarrollo de NDK, hay que crear una carpeta “jni” dentro del espacio de trabajo y 3 archivos en ella específicos para nuestra biblioteca "jni-engine".

Estos archivos son:

Android.mk (el texto entre signos menor y mayor <> se debe reemplazar con valores reales)

LOCAL_PATH := $(call my-dir)

TBB_PATH := <path_to_the_package>



include $(CLEAR_VARS)

LOCAL_MODULE    := jni-engine

LOCAL_SRC_FILES := jni-engine.cpp

LOCAL_CFLAGS += -DTBB_USE_GCC_BUILTINS -std=c++11 -I$(TBB_PATH)/include

LOCAL_LDLIBS := -ltbb -L./ -L$(TBB_PATH)/<path_to_libtbb_so>

include $(BUILD_SHARED_LIBRARY)



include $(CLEAR_VARS)

LOCAL_MODULE    := libtbb

LOCAL_SRC_FILES := libtbb.so

include $(PREBUILT_SHARED_LIBRARY)

Application.mk

APP_ABI := x86

APP_GNUSTL_FORCE_CPP_FEATURES := exceptions rtti

APP_STL := gnustl_shared

jni-engine.cpp:

#include <jni.h>



#include "tbb/parallel_reduce.h"

#include "tbb/blocked_range.h"

float SR_Click()

{

    int N=10000000;

    float fr = 1.0f/(float)N;

    float sum = tbb::parallel_reduce(

        tbb::blocked_range<int>(0,N), 0.0f,

        [=](const tbb::blocked_range<int>& r, float sum)->float

        {

            for( int i=r.begin(); i!=r.end(); ++i )

                sum += fr;

            return sum;

        },

        []( float x, float y )->float

        {

            return x+y;

        }

    );

    return sum;

}



float DR_Click()

{

    int N=10000000;

    float fr = 1.0f/(float)N;

    float sum = tbb::parallel_deterministic_reduce(

        tbb::blocked_range<int>(0,N), 0.0f,

        [=](const tbb::blocked_range<int>& r, float sum)->float

        {

            for( int i=r.begin(); i!=r.end(); ++i )

                sum += fr;

            return sum;

        },

        []( float x, float y )->float

        {

            return x+y;

        }

    );

    return sum;

}



 extern "C" JNIEXPORT jfloat JNICALL Java_com_example_app1_FullscreenActivity_onClickDRCall(JNIEnv *env, jobject obj)

{

    return DR_Click();

}



extern "C" JNIEXPORT jfloat JNICALL Java_com_example_app1_FullscreenActivity_onClickSRCall(JNIEnv *env, jobject obj)

{

    return SR_Click();

}

Usamos los mismos algoritmos que en la guía anterior.

Cuando usamos el NDK para compilar, compila las bibliotecas a las carpetas correspondientes, incluidas nuestras bibliotecas libjni-engine.so, libgnustl_shared.so y libtbb.so.

A continuación, hay que volver a Eclipse y compilar el archivo app1.apk. Ahora la aplicación está lista para instalarse en el AVD o en hardware real. En el AVD se ve así:

¡Y terminamos! Esta aplicación sencilla está lista y debería ser un buen primer paso hacia la escritura de una aplicación paralela más compleja para Android. Y para aquellos que usaron código de la guía anterior, la aplicación se pudo portar con éxito a Android.

* Es posible que la propiedad de otros nombres y marcas corresponda a terceros.

Related Articles and Resources:

Intel Hardware Accelerated Execution Manager (HAXM)

Módulos de sub-rotinas Intel®

Intel® Parallel Studio

Educação

Processadores Intel® Core™

Eficiência energética

Aumento de desempenho

Bibliotecas

Desenvolvimento de multithread

Learning Lab

TBB-Learn

Zona do tema:

Android

↧

WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

June 17, 2014, 12:45 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

≪ Previous: Guía instructiva para Android*: cómo escribir una aplicación multiproceso con Threading Building Blocks de Intel®

Overview

Introduction

WRF is used by many private and public organizations across the world for weather and climate prediction.

WRF has a relatively flat profile on Intel Architecture over many functions for atmospheric dynamics and physics: advection, microphysics, etc.

Technology (Hardware/Software)

System	Xeon E5-2697 v2 @ 2.7GHz
Coprocessor	Intel Xeon Phi coprocessor 7120A @ 1.23GHz
Intel® MPI	4.1.1.036
Intel® Compiler	composer_xe_2013_sp1.1.106
Intel® MPSS	6720-21

We used the above hardware and software configuration for all of our testing.

Note: Please use netcdf-3.6.3 and pnetcdf-1.3.0 for I/O.

Multi Node Symmetric Intel Xeon + Intel Xeon Phi coprocessor (4 Nodes)

Compile WRF for the Coprocessor

Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the Intel MPI for intel64 and Intel Compiler
1. source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for Intel Xeon Phi coprocessor is a prerequisite.
1. export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/netcdf/mic/
2. export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/pnetcdf/mic/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the ../WRFV3/ directory and run ./configure and select the option to build with Xeon Phi (MIC architecture) (option 17). On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.mic
This will build a wrf.exe in the ../WRFV3/main folder.
For a new, clean build run ./clean –a and repeat the process.

Compile WRF for Intel Xeon processor-based host

Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
Source the latest Intel MPI for intel64 and latest Intel Compiler (as an example below)
1. source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
1. export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/netcdf/xeon/
2. export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/pnetcdf/xeon/
Turn on Large file IO support
1. export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Cd into the WRFV3 directory created in step #1 and run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compiles, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
For a new, clean build run ./clean –a and repeat the process.

Run WRF Conus2.5km in Symmetric Mode

Download the CONUS2.5_rundir from http://www2.mmm.ucar.edu/WG2bench/conus_2.5_v3/
Follow the READ-ME.txt to build the wrf input files.
The namelist.input has to be altered. The changes are as follows:
1. In the &time_interval section, edit the values as below:
  1. restart_interval =360,
  2. io_form_history =2,
  3. io_form_restart =2,
  4. io_form_input =2,
  5. io_form_boundary =2,
2. Remove "perturb_input =.true." from the &domains section and replace with "nproc_x =8,"
3. Add "tile_strategy =2," under the &domains section.
4. Add "use_baseparam_fr_nml =.true." under the &dynamics section.
Create a new directory called CONUS2.5_rundir (../WRFV/CONUS_rundir) in the CONUS2.5_rundir, create 2 directories "mic" and "x86". Copy over contents of ../WRFV/run/ into “mic” and “x86” directories.
Copy the Intel Xeon Phi coprocessor binary into the CONUS2.5_rundir/mic directory and copy the Intel Xeon binary into the CONUS2.5_rundir/x64 directory.
Cd into the CONUS2.5_rundir and execute WRF as follows on 4 nodes (i.e 4 coprocessors + 4 Intel Xeon processors) in symmetric mode. To run conus2.5km, you need to have access to 4 nodes (example shown below)

Script to run on Xeon-Phi + Xeon (symmetric mode)

The nodes I am using are: node01 node02 node03 node04

When you request for nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536


source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh
source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64

export I_MPI_DEVICE=rdssm
export I_MPI_MIC=1
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto

./run.symmetric

Below is the run.symmetric to run the code in symmetric mode:

run.symmetric script


#!/bin/sh
mpiexec.hydra
 -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node02 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node03 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node04 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe
: -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node02-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node03-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
: -host node04-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh

In ../CONUS2.5_rundir/mic, create a wrf.sh file as below.

Below is the wrf.sh that is needed for the Xeon Phi part of the runscript.

wrf.sh script


export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH
/path/to/CONUS2.5_rundir/mic/wrf.exe

You will have 80 rsl.error.* and 80 rsl.out.* files in your CONUS2.5_rundir directory.
Do a 'tail –f rsl.error.0000' and when you see 'wrf: SUCCESS COMPLETE WRF' your run is successful.
After the run, compute the total time taken to simulate with the scripts below. The mean value (which indicates the Average Time Step (ATS)) is of interest for WRF (lower the better).

Parsing scripts

gettiming.sh – is the parsing script


grep 'Timing for main' rsl.out.0000 | sed '1d' | head -719 | awk '{print $9}' | awk -f stats.awk
bash-4.1$ cat stats.awk 
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 }
{
i ++
a += $1
if ( $1 > max ) max = $1
if ( $1 < min ) min = $1
}
END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation

To validate if the successful WRF run is correct or not, check the following:

It should generate a wrf_output file.
diffwrf your_output wrfout_reference > diffout_tag
'DIGITS' column should have high value (>3). If yes, the WRF run is considered valid.

Compiler Options

-mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor
-openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
-O3 :enable aggressive optimizations by the compiler.
-opt-streaming-stores always : generate streaming stores
-fimf-precision=low : low precision for higher performance
-fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision.
-opt-streaming-cache-evict=0 : turn off all cache line evicts.

Conclusion

About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).

Intel(R) Xeon Phi(TM) Coprocessor

Xeon Phi

Phi

MIC

Weather Research and Forecasting

Arquitetura Intel® Many Integrated Core

Computação paralela

Intel® Xeon Phi™ Coprocessor Developer zone

URL:

Intel® MPSS download

Intel® Many Integrated Core Architecture Forum

↧

Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

June 17, 2014, 4:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing Cyberlink PowerDVD 10* Improves Battery Life

≪ Previous: WRF Conus2.5km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors in Symmetric Mode

Using Intel® GPUs to Optimize the Performance and Power Consumption of Total Immersion's D'Fusion* Augmented Reality Pipeline

Michael Jeronimo, Intel (michael.jeronimo@intel.com)
Pascal Mobuchon, Total Immersion (pascal.mobuchon@t-immersion.com)

Executive Summary

This case study details the optimization of Total Immersion's D'Fusion* Augmented Reality pipeline, using the Intel® Integrated Performance Primitives (Intel® IPP) Asynchronous to execute key parts of the pipeline on the GPU. The paper explains the Total Immersion pipeline, the goals and strategy for the optimization, the results achieved, and the lessons learned.

Intel IPP Asynchronous

The Intel IPP Asynchronous (Intel IPP-A) library—available for Windows* 7, Windows 8, Linux*, and Android*—is a companion to the traditional CPU-based Intel IPP library. This library extends the successful Intel IPP acceleration library model to the GPU, providing a set of GPU-accelerated primitive functions that can be used to build visual computing algorithms. Intel IPP-A is a simple host-callable C API consisting of a set of functions that operate on matrix data, the basic data type used to represent image and video data. The functions provided by Intel IPP-A are low-, medium-, and high-level building blocks for video analysis algorithms. The library includes low-level functions such as basic math and Boolean logic operations; mid-level functions like filtering operations, morphological operations, edge detection algorithms; and high level functions including HAAR classification, optical flow, and Harris and Fast9 feature detection.

When a client application calls a function in the Intel IPP-A API, the library loads and executes the corresponding GPU kernel. The application does not explicitly manage GPU kernels; at application run time the library loads the correct highly optimized kernels for the specific processor. The Intel IPP-A library supports third generation Intel® Core™ processors (code named Ivy Bridge) and higher, and Intel® Atom™ processors, like the Bay Trail SoC, that include Intel® Processor Graphics. Allowing the library implementation to manage kernel selection, loading, dispatch, and synchronization simplifies the task of using the GPU for visual computing functionality. The Intel IPP-A library also includes a CPU-optimized implementation for fallback on legacy systems or application-level CPU/GPU balancing.

Like the traditional CPU-based Intel IPP library, when code is implemented using the Intel IPP-A API, the code does not need to be updated to take advantage of the additional resources provided by future Intel processors. For example, when a processor providing additional GPU execution units (EUs) is released, the existing Intel IPP-A kernels can automatically scale performance, taking advantage of the additional EUs. Or, if a future Intel processor provides new hardware acceleration blocks for video analysis operations, a new Intel IPP-A library implementation will use the accelerators while keeping the Intel IPP-A interface constant. Developers can simply recompile and relink with the new library implementation. Intel IPP-A provides a convenient abstraction layer for GPU-based visual computing that provides automatic performance scaling across processor generations.

It is easy to integrate Intel IPP-A code with the existing CPU-based code, so developers can take an incremental approach to optimization. They can identify key pixel processing hotspots and target those for offload to the GPU. But they must take care when offloading to the GPU so as not to introduce data transfer overhead. Instead, developers should create an algorithm pipeline that allows significant work to be performed on the GPU before the results are required by the CPU code, minimizing inter-processor data transfer.

Benefits of GPU Offload

Offloading time consuming pixel processing operations to the GPU can result in significant power and performance benefits. In particular, the GPU:

Has a lower operating frequency– the GPU runs at a lower clock frequency than the CPU, consuming less power for the same computation.
Has more hardware threads– the GPU has significantly more hardware threads, providing better performance for operations where performance scales with an increasing number of threads, such as the visual processing operations in Intel IPP-A.
Has the potential to run more complex algorithms– due to the better power and performance provided by the GPU, developers can use more computationally intensive algorithms to achieve improved results and/or process more pixels than they could otherwise using the CPU only.
Can free the CPU for other tasks – by moving processing to the GPU, developers can reduce CPU utilization, freeing up the CPU processing resources for other tasks.

The benefits offered by Intel IPP-A programming on the GPU can be applied in a variety of market segments to help ISVs reach specific goals. For example, in Digital Security and Surveillance (DSS), the primary metric is the number of channels of input video that a platform can process (the "channel density"), while in Augmented Reality, decreasing the time to acquire targets to track and increasing the number of objects that can be simultaneously tracked are key.

Augmented Reality

Augmented Reality (AR) enhances a user's perception with computer-generated input such as sound, video, or graphics data. AR merges the real world with computer-generated elements, either meta information or virtual objects, resulting in a composite that presents more information and capabilities than an un-augmented experience. AR applications usually overlay information about the environment and objects on a real-time video stream, making the virtual objects interactive. AR technology can be applied to many market segments including retail, medicine, entertainment, and education. For example:

Mobile augmented reality systems combine a mobile platform's camera, GPS, and compass sensors with its Internet connectivity to pinpoint the user's location, detect device orientation, and provide information about the scene, overlaying content on the screen.
Virtual dressing rooms allow customers to virtually try on clothes, shoes, jewelry, or watches, either in-store or at home, automatically sizing the item to the user in a 3D view on the device.
Construction managers can view and monitor work in progress, in real time, through Augmented Reality markers placed throughout a site.

Total Immersion

Total Immersion is an augmented reality company, founded in 1998, based in Suresnes, France. Through its patented D'Fusion software solution, Total Immersion combines the virtual world and the real world by integrating real-time interactive 3D graphics into a live video stream. The company maintains offices in Europe, North America, and Asia and supports the world's largest augmented reality partner network, with over 130 solution providers.

Today, mobile technology is everywhere. Total Immersion (TI) is developing compelling AR experiences for tablets and phones. Intel, recognizing Total Immersion as a leader in Augmented Reality, initiated a collaboration with TI to optimize the D'Fusion software for Intel processors, including GPU offloading. They aimed to improve the AR experience when running on Intel products that power mobile platforms, such as the Intel Atom SoC Z3680.

Optimization Goals and Strategy

Augmented Reality applications rely on computer vision algorithms to detect, recognize, and track objects in input video streams. While a large part of the AR processing doesn't deal directly with pixels, the pixel processing required is a computationally intensive, data parallel task appropriate for GPU offload. Intel and Total Immersion planned to offload the pixel processing to the GPU, using Intel IPP-A, so that the pipeline handled the pixel processing—from capture to rendering—and only the metadata about the pixel information would be returned to the CPU as input for higher-level AR operations. By offloading all of the pixel processing to the GPU, the application achieved better performance with less power consumption, making D'Fusion-based applications run efficiently on mobile platforms while conserving battery life.

The D'Fusion AR Pipeline

The core of the D'Fusion software is a processing pipeline that consists of the following stages:

Figure 1 – The Design of the PixelFlow Framework

Capture – The first step in the pipeline is capturing input video from the camera. The video can be captured in a variety of formats, such as RGB24, NV12, or YUY2, depending on the specific camera. Frames are captured at the full frame rate, typically 30 FPS, and passed to the next stage in the pipeline. Each captured frame has an associated time stamp that specifies the precise time of capture.
Preparation – Computer vision algorithms usually operate on grayscale images, and the TI AR pipeline is no exception. The first step after Capture is to convert the color format of the captured image to grayscale. Next, because computer vision algorithms often do not require the full frame size to operate effectively, input frames can be downscaled to a lower resolution. The reduced number of pixels to process saves computational resources. Then, depending on the orientation of the image, mirroring may also be required. Finally, in addition to the grayscale image required by the computer vision processing, a color image must also be sent down the pipeline so that the scene can eventually be rendered along with the AR-generated information. It is also necessary to obtain a second color format conversion from the camera input format, like NV12, to a format appropriate for display, such as ARGB. All of the operations in the Preparation stage are pixel-intensive operations appropriate to target for offload to the GPU.
Detection – Once a frame is prepared, the pipeline applies a feature detection algorithm, either Harris or Fast9, to the reduced-size grayscale input image. The algorithm returns a list of feature points detected in the image. The feature detection algorithm can be controlled by various parameters, including the threshold level. These parameters continuously adjust the feature point detection to return an optimal number of feature points and to adapt to changing ambient conditions, such as the brightness of the input scene. Non-maximal suppression is applied to the feature point calculation to get a better distribution of feature points, avoiding local "clustering." Both feature detection and non-maximal suppression are targeted for offload to the GPU.
Recognition – Once the features are generated by the Detection stage of the pipeline, the FERNS algorithm is used to match the features against a database of known objects. Instead of operating on the feature points directly, the FERNS algorithm uses a patch, a square region of pixels centered on the feature point. The patches are taken from a filtered version of the frame that has been convolved with a smoothing filter. Each of the patches is associated with a timestamp of the frame from which they were derived. Since the processing of each patch by the FERNS algorithm is an independent operation, it is easily parallelizable and a candidate for GPU offload. The frame smoothing can also happen on the GPU.
Tracking - Many image processing algorithms operate on multi-resolution images called image pyramids, where each level of the pyramid is a further downscaled version of the original input frame. The Tracking stage of the pipeline provides the image pyramid to the Lucas-Kanade optical flow algorithm to track the objects in the scene. Both the image pyramid generation and the optical flow are good candidates to run on the GPU.
Rendering – Rendering is the final stage of the pipeline. In this stage, the AR results are combined with the color video and rendered on the output, in this case using OpenGL*. The application renders the color video as an OpenGL texture and uses OpenGL functions to draw the graphics output, based on the video analysis, on top of the video frame.

Optimization Strategy

Initial profiling of the TI application confirmed that the pixel processing operations mentioned in the prior section were the primary bottlenecks in the AR pipeline. However, other bottlenecks existed, including a CPU-based copy of the color image data to an OpenGL texture.

To simplify collaboration, Intel delivered the optimizations to Total Immersion as a library to be incorporated into the TI software. The library, dubbed PixelFlow, encapsulates the pixel processing required by the TI AR pipeline and is implemented using Intel IPP-A library. Intel and Total Immersion decided that PixelFlow would target the Preparation, Detection, and Rendering bottlenecks first, while also providing information required for the Recognition and Tracking stages. Moving the first stages of the pipeline to the GPU would be a milestone towards the eventual goal of handling all pixel processing operations on the GPU.

To implement the Preparation and Detection stages, the operations performed by PixelFlow on the GPU included color format conversion, resizing, mirroring, Fast9 and Harris feature point detection, and non-maximal suppression. To support the Recognition and Tracking stages, the library provides a smoothed frame to be used by the FERNS algorithm and an image pyramid of the input to be used by the optical flow algorithm. Finally, PixelFlow also provides a GPU texture of the color input frame suitable for use in OpenGL.

Implementation

The PixelFlow framework was conceived as a flexible framework for analysis of multiple video input streams derived from a single video capture source. The PixelFlow pipeline runs on the GPU, operating asynchronously with the CPU. Each video capture source serves frames to one or more logical video streams, where the color format and resolution of each stream is independently configurable. Each stream runs on a separate thread and can use Intel IPP-A to analyze the video frames, producing meta information. The following diagram shows the general design of the framework.

Figure 2 – The Design of the PixelFlow Framework

The TI Augmented Reality pipeline is comprised of two video streams: the Analytics Stream and the Graphics Stream. The Analytics Stream processes a grayscale input frame, performing feature detection with non-maximal suppression, image pyramid generation, and smoothing of the input frame. The Graphics Stream converts the color camera input to ARGB for display. In both cases, the resulting data is placed in a queue for access by the CPU-based code. The following diagram shows the basic organization of the pipeline and the functions targeted for offload to the GPU.

Figure 3 – The PixelFlow implementation for the TI AR pipeline

The information on each queue has a timestamp of the original frame capture, allowing the CPU software to correlate each frame with the corresponding data produced by the analytics stream.

Implementation Challenges

Several challenges were encountered during the implementation of the PixelFlow framework:

Separate kernels for frame preparation– The initial PixelFlow implementation used separate Intel IPP-A functions for resizing, color format conversion, and mirroring. Because the functions didn't support multi-channel images to prepare the ARGB output for the Analytics Stream, the implementation used one Intel IPP-A function to split the input image into separate channels, then called other functions to resize and mirror each of the channels individually before combining them back into an interleaved format. To minimize the kernel overhead and simplify programming, the Intel IPP-A team developed a single hppiAdvancedResize function to combine the resize, color format conversion, and mirroring into a single GPU kernel, allowing the frame to be prepared for the Analytics Stream or the Graphics Stream with a single function call.
Direct-to-GPU-memory video input – The intention of the PixelFlow pipeline was to have the entire pipeline, from video capture to graphics rendering, on the GPU. However, the graphics drivers for the targeted platforms did not yet support direct-to-GPU-memory video capture. Instead, each frame was captured to system memory and then copied to GPU memory. To minimize the impact of the copy, the PixelFlow implementation took advantage of the Fast Copy feature supported by the Intel IPP-A library. Using a 4K-aligned system memory buffer, the GPU kernel is able to use shared physical memory to access the data, thus avoiding a copy.
NMS, weights, and orientation for Fast9 – The results produced by the Intel IPP-A Fast9 algorithm did not initially match the CPU-based function that it replaced. An investigation revealed that the TI code was also applying non-maximal suppression to the results of the Fast9 calculation. In addition, the TI code also calculated a weight and orientation value for each detected feature point. The team updated the Intel IPP-A Fast9 function to add NMS as an option and to return the weight and orientation values.
OpenGL surface sharing and DX9 surface import/export– OpenGL is used for rendering in this pipeline. The video frame is rendered as an OpenGL texture and other virtual elements are added by calling OpenGL drawing primitives. In the Frame Preparation stage of the pipeline, Intel IPP-A's AdvancedResize function converts the video frame from the input format (NV12, YUY2, etc.) to ARGB. A CPU-based copy of this image into an OpenGL texture was one of the top bottlenecks. The Intel IPP-A team added an import/export capability so that a DX9 surface handle could be extracted from an existing Intel IPP-A matrix, or an Intel IPP-A matrix could be created from an existing DX9 surface. This enabled the use of the OpenGL surface sharing capability in the Intel OpenGL driver. With is functionality, a DX9 surface could be shared with OpenGL as a texture, avoiding the CPU-based copy and keeping the data on the GPU.

Additional Non-PixelFlow Optimizations

After implementing the optimizations described in the previous section, a trace performed in the VTune™ analyzer showed that when tracking nine targets, with input video and analytics resolution at 1024x768, several hotspots remained in the computer vision module:

Remaining Hotspots – Ivy Bridge
Function	% of CV	Description
dcvGroupFernsRecognizer::RecognizeAll	18.95	Using x87 floating point. Should try using SIMD floating point instructions such as Intel® SSE3 or Intel® AVX.
dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim	16.76	General code generation issues. Expect these would be improved by using the Intel® compiler.
dcvPolynomSolver::solve_deg3	10.20	General code generation issues. Expect these would be improved by using the Intel compiler.

After building the computer vision module with the Intel® compiler with Intel® AVX instructions enabled, the hotspots were eliminated.

Remaining Hotspots – Ivy Bridge
Function	% of CV	Description
dcvGaussianPyramid3x3::ConstructFirstPyramidLevelOptim	33.56	Image pyramid generation.
dcvCorrelationsDetectorLite::ComputerIntegralImage	16.83	Integral image computation.
dcvKtlOptim::__CalcOpticalFlowPyrLK_Optim_ResizeNN_levels	13.0	LK optical flow.

The second trace uncovered an instance in the code that still used the old CPU-based image pyramid calculation. The instance was updated to use the image pyramid calculated by PixelFlow. The remaining hotspots were additional operations that were not yet included in PixelFlow, integral image, and LK optical flow. The team will target these functions first when extending the PixelFlow functionality.

Results – Performance and Power

The resulting AR pipeline offloads its initial stages to the GPU and provides data for subsequent stages of AR processing. To analyze the PixelFlow implementation of the AR pipeline, the team used a test application from Total Immersion, the "AR Player." This configurable test application allows the user to set operating parameters like the number of targets to track, the video capture resolution and format, the analytics processing resolution, and so on. In addition to the power and performance statistics, the team was interested in the feasibility and impact of increasing the analytics resolution. For the pre-optimized CPU-based flow, the TI AR software used a 320x240 analytics resolution. The additional performance provided by the GPU offload allowed us to experiment with higher resolutions and the resulting impact on responsiveness and quality. The team tested PixelFlow implementation on Ivy Bridge and Bay Trail platforms.

Results: Ivy Bridge

We tested the software on the following Ivy Bridge platform:

Ivy Bridge Platform Details
Item	Description
Computer	HP EliteBook* 8470p
Processor	Intel® Core™ I7 processor 3720QM
Clock Speed	2.6 GHz (3.6 GHz Max Turbo Frequency)
# Cores, Threads	4, 8
L1, L2, L3 Cache	256 KB, 1 MB, 6 MB
RAM	8 GB
Graphics	Intel® HD Graphics 4000
# of Execution Units	16
Graphics Driver	Igdumdim64, 9.18.10.3257, Win7 64-bit
OS	Windows* 7 Pro (Build 7601), 64-bit, SP1

The first test scenario tracked nine targets simultaneously, with both a video capture resolution and an analytics resolution of 640x480.

Test Scenario #1
Metric	Value
Number of targets	9
Capture resolution	640x480
Analytics resolution	640x480

Performance Results – Ivy Bridge, Test Scenario #1
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	60	60
Analytics FPS	30	30
Tracking FPS	30	30
Frame Preprocessing	0.399	0.088	-0.311	-77.83
Tracking	1.412	1.355	-0.057	-4.03
Construct Pyramid	0.548	0.025	-0.523	-95.44
Recognition	3.322	1.477	-1.846	-55.55
Compute Interest Points	1.358	0.035	-1.323	-97.43
Smooth Image	0.693	0.001	-0.692	-99.89

The second test scenario also tracks nine targets, but increases the video capture resolution to 1024x768 with an analytics resolution of 640x480.

Test Scenario #2
Metric	Value
Number of targets	9
Capture resolution	1024x768
Analytics resolution	640x480

Performance Results – Ivy Bridge, Test Scenario #2
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	60	60
Analytics FPS	30	30
Tracking FPS	30	30
Frame Preprocessing	0.391	0.094	-0.297	-75.99
Tracking	1.355	0.900	-0.455	-33.58
Construct Pyramid	0.532	0.024	-0.508	-95.58
Recognition	2.844	0.917	-1.927	-67.77
Compute Interest Points	1.225	0.027	-1.199	-97.83
Smooth Image	0.708	0.001	-0.7070	-99.93

Results: Bay Trail

Similar tests were run on the following Bay Trail platform:

Bay Trail Platform Details
Item	Description
Computer	Intel® Atom™ (Bay Trail) Tablet PR1.1B
Processor	Intel® Atom™ processor Z3770
Clock Speed	1.46 GHz
# Cores, Threads	4, 4
L1, L2, L3 Cache	128 KB, 2048 KB
RAM	2 GB
Graphics	Intel® HD Graphics
# of Execution Units	4
Graphics Driver	Igdumdim32.dll, 10.18.10.3341, Win8 32-bit
OS	Windows* 8 (Build 9431), 32-bit

The test scenario is slightly different than the first test scenario run on the Ivy Bridge platform due to the different resolutions supported by the camera on the Bay Trail system.

Test Scenario #1
Metric	Value
Number of targets	9
Capture resolution	640x360
Analytics resolution	640x360

Performance Results – Bay Trail, Test Scenario #1
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	55	35
Analytics FPS	30	30
Tracking FPS	15	15
Frame Preprocessing	5.215	0.385	-4.830	-92.62
Tracking	15.484	10.411	-5.074	-32.77
Construct Pyramid	6.081	0.122	-5.985	-97.99
Recognition	28.389	15.590	-12.799	-45.09
Compute Interest Points	9.235	0.365	-8.870	-96.04
Smooth Image	7.236	0.011	0.7255	-99.85

The second scenario for Bay Trail tests the video capture resolution at 1280x720, while the analytics resolution remains at 640x460.

Test Scenario #2
Metric	Value
Number of targets	9
Capture resolution	1280x720
Analytics resolution	640x360

Performance Results – Bay Trail, Test Scenario #2
Processor Number	Software (ms)	PixelFlow (ms)	Difference (ms)	Difference (%)
Rendering FPS	12	30
Analytics FPS	30	25
Tracking FPS	8	12
Frame Preprocessing	4.865	0.408	-4.458	-91.62
Tracking	16.158	9.718	-6.440	-39.86
Construct Pyramid	5.995	0.122	-5.872	-97.96
Recognition	32.398	14.532	-17.865	-55.14
Compute Interest Points	8.864	0.376	-8.488	-95.76
Smooth Image	7.337	0.013	-7.324	-99.82

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more complete information about performance and benchmark results, visit Performance Test Disclosure

Power Analysis

After implementing GPU offload using the PixelFlow pipeline, investigations into the power savings achieved by the GPU offload yielded unexpected results; instead of achieving a significant power savings from offloading the processing to the GPU from the CPU, the power consumption of the PixelFlow implementation was on par with the CPU-only implementation. The following GPUView trace shows why this occurred.

Figure 4 –GPUView trace of the processing for a single frame

The application dispatched the work to the GPU in separate chunks: CPU setup, GPU operation, wait for completion, CPU setup, GPU operation, wait for completion, etc. This approach impacted power consumption, causing the processor package to be continually active and not allowing the processor to enter deeper sleep states.

Instead, the pipeline should consolidate GPU operations and maximize CPU/GPU concurrency. The following diagram illustrates the ideal situation to achieve maximum power savings: GPU operations consolidated into a single block, executing concurrently with CPU threads and leaving a period of inactivity that allows the processor package to achieve deeper sleep states.

Figure 5 – Ideal pattern to maximize power savings

Conclusion

Moving the key pixel processing bottlenecks of the Total Immersion AR pipeline to the GPU resulted in performance gains on Intel processors, allowing the application to use a larger input frame size for video analysis, find targets faster, track more targets, and track them more smoothly. We expect similar gains can be achieved for similar video analysis pipelines.

While achieving performance benefits using Intel IPP-A is fairly straightforward, achieving power benefits requires a careful design of the processing pipeline. The best is one that consolidates the GPU operations and maximizes CPU/GPU concurrency to allow the processor to reach deeper sleep states. Diagnostic and profiling tools that are GPU-capable, like GPUView and Intel VTune analyzer, are essential as they can help to identify power-related problems with the pipeline. Consider using these tools during development to verify the power efficiency of a pipeline and avoid having to re-architect a pipeline to address power-related issues.

The PixelFlow pipeline offloaded several of the pixel processing bottlenecks in the TI pipeline. Work remains to move additional operations to the GPU such as integral image, optical flow, FERNS, etc. Once these operations are included in PixelFlow, all of the pixel processing will occur on the GPU with these operations returning metadata to the CPU as input for higher-level operations. The success of the current PixelFlow implementation, which uses IPP-A-based GPU offload, indicates that further gains are possible with additional offloading of pixel processing operations.

Finally, power and performance optimization can go beyond just the vision processing algorithms, but can extend to other areas such as video input, codecs, and graphics output. Intel IPP-A allows for DX9-based surface sharing with related Intel technologies such as the Intel® Media SDK for codecs and the OpenGL graphics driver. Understanding the optimization opportunities with these related technologies is also important. This allows developers to create entire GPU-based processing pipelines.

Author Biographies

Michael Jeronimo is a software architect and applications engineer in Intel's Software and Solutions Division (SSG), focused on helping customers to accelerate computer vision workloads using the GPU.

Pascal Mobuchon is the VP of Engineering at Total Immersion.

References

Item	Location
Total Immersion web site	http://www.t-immersion.com/
Total Immersion Wikipedia page	http://en.wikipedia.org/wiki/Total_Immersion_(augmented_reality)
Augmented Reality – Wikipedia page	http://en.wikipedia.org/wiki/Augmented_reality
Intel® VTune™ Amplifier XE	https://software.intel.com/en-us/intel-vtune-amplifier-xe
Intel® Graphics Performance Analyzers	https://software.intel.com/en-us/vcsource/tools/intel-gpa
GPUView	http://msdn.microsoft.com/en-us/library/windows/hardware/ff570133(v=vs.85).aspx
Intel® IPP-A web site	https://software.intel.com/en-us/intel-ipp-preview

Intel® Integrated Performance Primitives

Intel® IPP

Augmented Reality Pipeline

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

Intermediário

Primitivas Intel® Integrated Performance

Intel® VTune™ Amplifier XE

Processadores Intel® Atom™

Gráficos

Processadores Intel® Core™

Eficiência energética

http://legal.intel.com/Marketing/notices+and+disclaimers.htm

↧

Optimizing Cyberlink PowerDVD 10* Improves Battery Life

June 9, 2014, 2:33 pm

Latest and popular articles on Intel Technologies

≫ Next: NAMD* for Intel® Xeon Phi™ Coprocessor

≪ Previous: Optimizing an Augmented Reality Pipeline using Intel® IPP Asynchronous

Download PDF

Authors:
Manuj Sabharwal and Gael Hofemeier, Software Engineers, Software Solutions Group, Intel Corporation

Introduction

Low battery life is one of the most serious issues currently plaguing mobile devices in general and Ultrabook™ devices and tablets specifically. Users have become accustomed to streaming multimedia content to their mobile devices “on-demand” from content servers in the cloud. Because these devices have limited battery capacity, energy efficiency is important. Cyberlink PowerDVD 10* (PowerDVD*) is one of the top players in the industry for HD, and 3D movie playback. This app is often included as a pre-bundled application from OEMs. In this case study, we showcase how Intel and Cyberlink collaborated to optimize the PowerDVD* application to give best-in-class experience on Intel devices.

First, we’ll talk about the challenges that Cyberlink encountered when adding content streaming features to PowerDVD and the tools and techniques Intel used to improve the power consumption of PowerDVD.

Then, we’ll discuss the power consumption profile of a Cyberlink PowerDVD streaming media application and its impact on battery life for mobile devices. We also provide an analysis of PowerDVD behavior to identify issues such as decoding on CPU, large numbers of context switches, high interrupt rates, etc., causing increased power consumption. Finally, we’ll provide the data that shows the reduced power consumption following optimization.

The optimization was a huge success. The Intel team was able to make the following improvements to PowerDVD:

Package C0 reduced to 20% from 100% during media playback
Reduced SoC power from ~6 W to ~1.8W using Intel® Power Gadget
Intel® VTune™ analyzer reported CPU utilization of 25% down from 70%
The Windows* Performance Analyzer showed frequent wakeups (5 Msec) vs. 10 msec wake up frequency for local or streaming media playback frequency of 10%.

Definitions

Acronym	Definition
BLA	Battery Life Analyzer
GPU	Graphics processing unit
WPA	Windows Performance Analyzer
DLNA Server	Digital Living Network Alliance Server
HD	High density
SoC	System on Chip
FPS	Frames per second
SDK	Software development kit
SKU	Stock Keeping Unit

The Challenges of Optimizing Battery Life

PowerDVD offers new features for organizing, streaming media, mobile devices, and social media. In addition to functioning on a client, the latest software can turn a device into a DLNA server and stream multimedia content from a PC across a network to other devices. It can also stream content from external content servers. Adding content streaming came with a price, however. New capabilities, such as HD streaming, required running more processes, consuming much more memory and CPU cycles. This took a toll on battery life. We needed to answer the following questions:

What is the power consumption from PowerDVD during a 1080p streaming media playback?
Why was PowerDVD able to playback only an hour of media on a fully charged battery?

After two months and three iterations of analysis and validation, the engineering teams improved battery life by making the following changes:

Offloaded graphics to the GPU (using the Intel® Media SDK)
Removed the sleep loop calls from two threads
Used an overlay to reduce extra memory copies

The following describes the process and tools that resulted in the optimized version of PowerDVD.

Optimization of Cyberlink PowerDVD for Power Consumption

Test System Configuration:

4th generation Intel® Core™ i7 processor
Lenovo Yoga* 2 Pro
CPU speed : 1.4 GHz non-turbo frequency
Memory 4 GB display : 1920x1080p HD panel
Cyberlink PowerDVD 10 and Cyberlink PowerDVD 12

Validation and analysis showed:

Package C0 was pegged 100% during media playback, while we expected it to be at 20%.
Intel Power Gadget showed SoC power to be ~6 W. It should be ~1.7 W on a 4th generation Intel processor.
Intel VTune results revealed no offloading of graphics to the GPU and high CPU utilization of 70% (we expected about 10%)
The Windows Performance Analyzer tests revealed frequent wakeups (5 msec). The normal frequency is 10 msec with audio playback.

First Step - Validation

To understand and address PowerDVD’s impact on battery life, we used Intel Power Gadget and Battery Life Analyzer (BLA) to validate the application’s SoC power usage. Figure 1 shows the Intel Power Gadget’s UI on a Windows platform.

Figure 1. Intel® Power Gadget UI on Windows* Platform

As part of our validation of PowerDVD, we used Intel Power Gadget to determine power impacts during playback. Figure 2 shows the power output Intel Power Gadget recorded.

PowerDVD’s power usage was ~6 W of SoC power during playback. Intel recommends a maximum of ~2.0 W on 4th generation Intel processors (low power processors typically used in Ultrabook devices).

Figure 2. Processor Power Usage during PowerDVD* Playback

To gain deeper insight into what other activities were affecting power, we used the Battery Life Analyzer (BLA) tool to understand the impact of media playback on residencies. Understanding residency is important as changing the SoC SKU can impact power.

BLA is a power management analysis tool developed by Intel to identify issues that impact battery life. BLA helps to identify a wide range of issues during software analysis such as:

Software CPU utilization
OS timer resolution changes
Frequent C state transitions
Excessive ISR/DPC activity

Figure 3 shows package residency during 1080p HD video playback using Cyberlink PowerDVD.

Figure 3. Package Residency during 1080p HD Video Playback using PowerDVD*

The package residency includes CPU, Graphics, and UnCore events. More time in package C0 results in higher SoC power. Expected package C0 for Cyberlink PowerDVD 1080p playback is ~20% on 4th generation U-Processor. As we can see from Figure 3, package residency is far higher than it should be.

Both Intel Power Gadget and BLA confirmed higher power usage and ~4 hrs. of battery life on 42 Whr (Watt-hours) battery capacity with ~6 W SoC+3 W of display and 2+ W for other components.

Our next step was to analyze the application for power optimization.

Second Step - Analysis

For the analysis phase, we used two tools:

The following tables summarize the results of the analysis, which showed definite room for improvement.

Table 1. Intel® Power Gadget and BLA Results

Actual Results	Expected Results
Package C0 is pegged at 100% during media playback	Package C0 should be at 20% during media playback
SoC power using Intel® Power Gadget is ~6 W	SoC power should be ~1.7 W on 4th generation Intel processor

Table 2. Intel® Vtune™ and WPA Results

Analysis Tool	Observations
Intel VTune results	Since the app had no codecs, there was no offloading to graphics High CPU utilization (70% vs. the expected 10%)
Windows Performance Analyzer	Frequent wakeups (5 msec) occurred– expected frequency is 10 msec with audio playback

The next figures provide a walkthrough of some of the important screenshots from our analysis.

Intel VTune analyzer was used to validate the PowerDVD application for the presence of spin waits, the presence of hardware acceleration, and hotspots (a micro-architecture issue). Figure 4 shows the steps for collecting the graphics call stacks.

Figure 4. VTune™ UI for Analyzing DirectX* Pipeline Events

Figure 5 shows the VTune summary with significant time spent in spin loop. GPU Usage shows no codec usage. Most of the time spent in the GPU is for display and other pre-processing algorithms during playback.

Figure 5. VTune™ Summary showing Spin Loop time

Digging deeper into the analysis, Intel VTune shows high CPU utilization during media playback, and instances where VSync (the red highlights in Figure 5) and GPU software queue are not occurring every ~33 msec (30 FPS playback). This analysis shows software glitches during media playback.

Figure 6. VTune™ Summary Report

Looking at Figure 7, the summary report confirms an inconsistent frame rate over time. The FPS varies for 30 FPS movie playback between 0-60 FPS. The chart shows the total number of frames executed in an application with a specific frame rate. A high number of slow or fast frames signals a performance bottleneck. The goal is to optimize the code to keep the frame rate constant, for example, from 30 to 60 FPS.

Figure 7. VTune™ analysis of Frame Rates

Next, we used the Windows Performance Analyzer (WPA) tool to analyze the application for wakeup activities, interrupts, and context switches. Figure 8 shows using CPU-based Intel® SSE instructions for H264 decode. It is more efficient to offload this work on to the GPU than to run it on the CPU.

Figure 8. WPA Analysis of Wakeup Activities, Interrupts, and Context Switches

WPA also shows wakeup activities from PowerDVD during playback. Figure 9 displays the two PowerDVD threads, both running at 10 msec. The two threads are not coalesced, which causes the overall system to wake up at a 5 msec timer interval. Figure 10 shows the call stack with sleep loop Win32* API being called every 10 msec interval.

Figure 9. WPA thread analysis

Figure 10. WPA call stack with sleep loop analysis

Table 3 reveals significant reduction in package residency after optimization.

Table 3. Validating Package Residency after Optimization

C-state Counters	Average (%) Before Optimization	Average (%) After Optimization
PackageC0-C1	100%	20.18%
PackageC0-C2	0%	8.29%
PackageC0 C3	0% 0%	0.19%
PackageC0 C6	0%	1.91%
PackageC0 C7	0%	69.43%

Optimization Results/Validation

The following tables show the “before” and “after” results:

Table 4. Intel® Power Gadget and BLA: Before and After¹

Before Optimization	After Optimization
Package C0 is pegged at 100% during media playback	Package C0 is reduced to 20%
SoC power is ~6 W	SoC power reduced to ~1.8 W on test system

Table 5. Intel® VTune™ Amplifier and WPA Results: Before and After¹

	Before	After
Intel® VTune™ Amplifier	Since the app had no codecs, there was no offloading to graphics High CPU utilization (70% vs. the expected 10%)	Video codecs now reported CPU utilization decreased by 25%
Windows Performance Analyzer	Frequent wakeups (5 msec) – expected frequency is 10 msec with audio playback	Sleep thread removed – reduced wakeups by 2x (5 msec to 10 msec)
Battery Life Analyzer	Package residency 100%	Package residency ~20%

¹ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

We optimized by:

Offloading to Intel® HD Graphics using Intel Media SDK
Optimizing Win32 API calls that cause periodic wakeup on CPU
Using an overlay to save one memory copy per frame

The first task was to use the Intel Media SDK for offloading decode to graphics which will provide better efficient/watt usage of Intel HD graphics. The pseudo code in Figure 11 provides an example of a simple use of Intel Media SDK to offload a stream of frame to graphics.

Figure 11. Intel® Media SDK code snippet – offloading a frame to graphics.

Once we offloaded to graphics using the Intel Media SDK, we ran PowerDVD and measured the results using Intel VTune Amplifier. Compared to Figure 5 where we didn’t see any codec usage, we now see Video Enhancement in the summary (Figure 12).

Figure 12.Intel® VTune™ Amplifier Summary result

Examining other Intel VTune graphics views, we verified that by using Intel Media SDK [to do what?] use of frame decoded on the GPU vs. on the CPU. Figure 13 shows a batch of frames being decoded after ~20 msec on GPU. Offloading the decode work to the GPU helped to reduce CPU utilization by ~25% on the test system.

Figure 13. Frame decoding after ~20 msec on the GPU

To verify our optimization of offloading graphics, we ran Intel Power Gadget. Compared to the baseline result shown in Figure 2, we saw ~2 W of power saving just by performing graphics offloading (Figure 14).

Figure 14. Power Savings resulting from Graphics Offload

We made some good progress, but ~4 W was not low enough. As stated earlier, the goal for streaming media 1080p playback is ~1.7 W of SoC/package power.

The next step was to find other CPU-based optimizations. Initial analysis showed sleep loop calls from two threads (non-coalesced) waking the CPU every 5 msec. CyberLink engineers needed to remove the sleep threads from their application. However, this was one of the most difficult changes since it required modifying the structure of the application. Figure 15 shows wakeup activities increase to 10 mse after periodic activities were removed.

Figure 15. Optimized Cyberlink PowerDVD* after removing periodic activities

Removing periodic activities revealed a ~800 mW saving. With current optimizations, 1080p HD streaming playback SoC power went from ~6 W to 2.8 W, but additional optimizations still had to be done to reach the 1.7 W goal seen in best-in-class applications.

Figure 16. Power Optimizations down to ~2.8 W

The next step was to reduce extra memory copies using an overlay. With the overlay, the overall package power was reduced by ~400 mW. Figure 17 shows power was reduced to ~1.8 W from ~6 W.

Figure 17. Cyberlink PowerDVD* at final Power Consumption (1.8 W)

With that, the most important optimization goals had been achieved, and Intel and Cyberlink engineers deemed the project a success.

Close collaboration between Cyberlink and Intel helped to complete the optimization in two months with full validation. The final product with all optimizations was released to OEMs six months from when we started.

Conclusion

The Intel and PowerDVD engineers used several tools including Intel VTune and Microsoft Windows Performance Analyzer to reach the optimum low-power playback. The collaboration included knowledge sharing on tools with weekly analysis/meetings to meet the battery life goal before the release deadline.

Several iterations were completed before the team was satisfied with their results (PowerDVD consumes ~1.8 W down from ~6 W.) Intel and Cyberlink engineers faced the challenge of keeping the quality of playback the same before and after optimization. Each optimization required a validation and analysis process before it could pass the Cyberlink team’s internal quality tests. Thus, every change was tracked and user experience metrics (power and performance) were evaluated.

The following optimizations were found to work the best for achieving the optimization goals, but as noted above, these were accomplished over several iterations:

Offloading graphics to the GPU (using the Intel Media SDK)
Removing sleep loop calls from two threads
Using an overlay to reduce extra memory copies

The combined efforts between the Intel and CyberLink PowerDVD team resulted in optimizing their streaming media playback application to reach the best-in-class goal.

About the Authors

Manuj Sabharwal is a Software Engineer in the Software Solutions Group at Intel. Manuj has been involved in exploring power enhancement opportunities for idle and active software workloads. He has significant research experience in power efficiency and has delivered tutorials and technical sessions in the industry. He also works on enabling client platforms through software optimization techniques.

Gael Hofemeier has worked for Intel since 2000 as an Application Engineer in the Software Solutions Group at Intel. Gael’s current focus is in Technology Evangelism for Business Client Apps and Technologies.

References

Windows Performance Analyzer: http://www.microsoft.com/en-us/download/details.aspx?id=30652
Battery Life Analyzer: http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=19351
Intel® Power Gadget: https://software.intel.com/en-us/articles/intel-power-gadget-20
Cyberlink PowerDVD: http://www.cyberlink.com/products/powerdvd-ultra/features_en_US.html?&r=1
Intel® Media SDK: https://software.intel.com/en-us/vcsource/tools/media-sdk-clients

Relevant Intel Links

Energy Efficient Software Development: https://software.intel.com/en-us/energy-efficient-software
Power Analysis Guide for Windows*: https://software.intel.com/en-us/articles/power-analysis-guide-for-windows
Windows 8* Software Power Optimization: https://software.intel.com/en-us/articles/windows-8-software-power-optimization
Intel processor numbers: http://www.intel.com/products/processor_number/

Notices and Disclaimers

Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright© 2014 Intel Corporation. All rights reserved.

Analyzing Power Efficiency

optimizing applications

WPA

Battery Life analyzer

Intel® VTune™ Amplifier XE

Processamento de mídia

Eficiência energética

Laptop

Desktop

Scalable Molecular Dynamics

URL:

Intel(r) Power Gadget

Intel(r) Media SDK

↧

NAMD* for Intel® Xeon Phi™ Coprocessor

June 19, 2014, 9:38 am

Latest and popular articles on Intel Technologies

≫ Next: 竞赛获胜者将带有百科全书的增强现实整合至 ARPedia*

≪ Previous: Optimizing Cyberlink PowerDVD 10* Improves Battery Life

Purpose

This code recipe describes how to get, build, and use the NAMD* Scalable Molecular Dynamics code for the Intel® Xeon Phi™ Coprocessor.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++* parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 200,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER*, CHARMM*, and X-PLOR*.

NAMD is distributed free of charge with source code. Users can build NAMD or download binaries for a wide variety of platforms. Tutorials show how to use NAMD and VMD* for biomolecular modeling. Find out more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

Code Support for Intel® Xeon Phi™ Coprocessor

NAMD 2.10 with Intel® Xeon Phi™ Coprocessor support is expected to be released in early to mid 2014. With support for Intel® many-integrated core (MIC) architecture, Intel expects to push NAMD performance and scalability to higher limits on Intel® architecture. Currently the code remains in development, but it can be compiled from nightly source code builds. Pre-built binaries are not available at this time.

NAMD code for Intel Xeon Phi Coprocessor continues to evolve. Intel developers are diligently working on known issues in order to achieve the project goals of performance and scalability on Intel Xeon Phi Coprocessor.

Code Access

To get access to the NAMD for Intel Xeon Phi Coprocessor code:

Download the original code at http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD and select Source Code under Version Nightly Build.

Build Directions

To build NAMD you also need the following libraries.

TCL (http://www.tcl.tk/);
FFTW (http://www.fftw.org/) , use fftw2 version (if you want you can try fftw3 version):

./configure --enable-float --enable-type-prefix --enable-static --prefix=<fftwBaseDirHere> --disable-fortran CC=icc
make CFLAGS=" -O2 " clean install
CHARM ++ (http://charm.cs.uiuc.edu/software/) can be built in 2 ways:
1. Infiniband (verbs-linux-x86_64-smp-iccstatic) version:
  
  ./build charm++ verbs-linux-x86_64 smp iccstatic --with-production
  
  Notes: check where your ibverbs lib is, if it is not in /opt/ofed/lib64 or /usr/local/ofed/lib64 directories you need to change [charmDir]/src/arch/verbs-linux-x86_64/conv-mach.sh file
2. MPI (mpi-linux-x86_64-smp-mpicxx) version: ./build charm++ mpi-linux-x86_64 smp mpicxx --with-production -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK

NAMD build instructions for the Intel Xeon Phi Coprocessor version are essentially the same as compiling standard NAMD, with the following changes:

Note: You can obtain Intel® Composer XE Version 13 from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.

Notes: using make’s "-j" option will speedup compilation significantly.

Running NAMD Workloads on Intel Xeon Phi Coprocessor

Running NAMD on Intel Xeon Phi Coprocessor is much like running the standard NAMD code, with the following exceptions:

Source the Intel® compiler, so libraries can be found.

Setup the following extra environment variables:

export KMP_AFFINITY=granularity=fine,compact
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=granularity=fine,balanced

To execute NAMD, on the namd2 command line, add +devices xxx, where xxx is a list of devices (e.g. "0,1" for the first two devices on a node). If the user omits the "+devices xxx" option at runtime, the application will attempt to use all available devices on a given node.
The number of PE’s per node must be > number of MICs in the node, and there must be at least one patch per PE.

Host threads and PEs are part of the command line options traditionally used.

Some examples of running NAMD workloads:

Ibverbs:
$BIN_DIR/charmrun ++nodelist $NODEFILE +p $NUM_PROCS ++ppn $PPN $BIN_DIR/wrapper.sh $BIN_DIR/$BIN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 "+devices 0,1"
PPN – for best results use 1 less than the number of available cores, for example PPN=23 if you have 24 cores per node(or PPN=47 if you use hyperthreading⁵)
NUM_PROCS = $PPN * $ NODECOUNT
MPI:
mpiexec.hydra -perhost 1 -n $NODECOUNT $BIN_DIR/$BIN +ppn $PPN $WORKLOAD_DIR/$CONFIG_FILE +pemap 1-$PPN +commap 0 +devices 0,1
Notes: "+pemap 1-$PPN +commap 0" more effective than "+setcpuaffinity"

Performance Testing^2,3

The following results show performance on a single node and cluster.

Single-node Performance Testing

Note: Single-node performance uses the multi-core build of NAMD (no network layers are used).

Single-node Platform Configurations⁴

The following hardware and software were used for the above recipe and performance testing.

Server Configuration (Intel® Xeon® processor E5 V2 family):

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-15
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

Server Configuration (Intel® Xeon® processor E5 family):

2-socket/16 cores:
Processor: Intel® Xeon® processor E5 @ 2.60GHz (8 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-13
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

NAMD

NAMD: Linux-x64_64-icc
Charm++: multicore-linux64-icc
Configuration parameters were modified to achieve optimal performance⁴

Cluster Performance Testing^2,3

Note: Cluster results use Infiniband*.

Cluster Platform Configuration⁴

The following hardware and software were used for the above recipe and performance testing.

Endeavor Cluster Configuration:

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading⁵
Operating System: Red Hat Enterprise Linux* 2.6.32-358.6.2.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading⁵, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-16
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

NAMD

NAMD: Linux-x64_64-icc
Charm++: verbs-linux-x86_64-smp-iccstatic
Configuration parameters were modified to achieve optimal performance⁴

DISCLAIMERS:

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

2. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

3. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

4. For more information go to http://www.intel.com/performance

5. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Intel, the Intel logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

*Other names and brands may be claimed as the property of others.

Intel(R) Xeon Phi(TM) Coprocessor

Arquitetura Intel® Many Integrated Core