11 - OpenCL Fundamentals
11 - OpenCL Fundamentals
SETTING UP OPENCL
PLATFORMS
Devices and Platforms
• Devices
– AMD CPU, GPU and APU
– Intel CPU/GPU
– NVIDIA GPU
• Platforms:
– Linux
– Windows 7/8/8.1
How to run OpenCL programs?
• Make sure you have installed OpenCL
drivers
– Intel
https://software.intel.com/en-us/articles/opencl-drivers
– AMD http://support.amd.com/en-us/download
– NVIDIA http://www.nvidia.com/Download/index.aspx
• You could check whether your devices
support OpenCL by using “GPU Caps
Viewer”
Installation Notes
• Google is your friend:
– AMD APP SDK Installation Notes
– CUDA Toolkit Installation Notes
– Intel OpenCL SDK Installation Notes
Developing OpenCL applications
• Make sure you have a machine which supports
OpenCL, as described above.
• Get the OpenCL headers and libraries included in
the OpenCL SDK from your device vendor.
• Start writing OpenCL code. That's the difficult part.
• Tell the compiler where the OpenCL headers are
located.
• Tell the linker where to find the “OpenCL.lib” files.
• Build the fabulous application.
• Run and prepare to be awed in amazement.
Compiling OpenCL in Linux
(gcc/g++)
• In order to compile your OpenCL program you must tell the
compiler to use the OpenCL library with the flag: –lOpenCL
• If the compiler cannot find the OpenCL header files (it
should do) you must specify the location of the CL/ folder
with the –I (capital “i”) flag
• If the linker cannot find the OpenCL runtime libraries (it
should do) you must specify the location of the lib file with
the –L flag
• Make sure you are using a recent enough version of gcc/g++
- at least v4.7 is required to use the OpenCL C++ API (which
needs C++11 support)
Compiling OpenCL in Windows
(Visual C++)
• Adding OpenCL headers and libraries to the project
– If using AMD SDK or Intel SDK, replace “$(CUDA_INC_PATH)” with “$
(AMDAPPSDKROOT)” or "$(INTELOCLSDKROOT)"
Lecture 2
AN OVERVIEW OF OPENCL
It’s a Heterogeneous world
A modern computing
platform includes:
• One or more CPUs
• One of more GPUs
• DSP processors E.g. Samsung® Exynos 5:
• Accelerators • Dual core ARM A15
• … other? 1.7GHz, Mali T604 GPU
E.g. Intel XXX with IRIS
10 cores
61 cores 16 wide SIMD 16 cores
16 wide SIMD 32 wide SIMD
ATI™ RV770
Intel® Xeon Phi™
NVIDIA® Tesla®
coprocessor
C2090
The Heterogeneous many-core challenge:
How are we to build a software ecosystem for the
Heterogeneous many core platform?
Third party names are the property of their owners.
Industry Standards for Programming
Heterogeneous Platforms
GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases Intersection
computing
Graphics
Multi- Heterogeneous APIs and
processor Shading
programming – Computing
Languages
e.g. OpenMP
During 2H09
Multiple conformant
implementations ship across a
diverse range of platforms.
Dec08 Jun10 Nov11
Cannot synchronize
between work-groups
within a kernel
Output 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Building Program Objects
• The program object encapsulates:
OpenCL uses runtime
– A context compilation … because
– The program kernel source or binary in general you don’t
know the details of the
– List of target devices and build options
target device when you
• The C API build process to create a ship the program
program object:
– clCreateProgramWithSource()
– clCreateProgramWithBinary()
__kernel void
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst) Compile for GPU
{ GPU code
int x = get_global_id(0); // x-coord
int y = get_global_id(1); // y-coord
int width = get_image_width(src);
float4 src_val = read_imagef(src, sampler, Compile for CPU
(int2)(width-1-x, y)); CPU code
write_imagef(dst, (int2)(x, y), src_val);
}
Example: vector addition
• The “hello world” program of data parallel
programming is a program to add two vectors
CPU GPU
Context
Programs
Programs Kernels Memory Objects Command Queues
__kernel void
dp_mul
Buffers Images
dp_mul(global const float *a, dp_mul arg[0]
arg [0]value
value InIn Outofof
Out
global const float *b, CPU program binary arg[0] value
global float *c) Order
Order Order
Order
arg[1]
[1]value
value
{ dp_mul arg
arg[1] value Queue
Queue Queue
Queue
int id = get_global_id(0); GPU program binary
c[id] = a[id] * b[id]; arg[2]
[2]value
value
arg
} arg[2] value GPU Device
Compute
1. Define the platform
• Grab the first available platform:
err = clGetPlatformIDs(1, &firstPlatformId,
&numPlatforms);
program = clCreateProgramWithSource(context, 1
(const char**) &KernelSource, NULL, &err);
if (err != CL_SUCCESS) {
size_t len;
char buffer[2048];
clGetProgramBuildInfo(program, device_id,
CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
printf(“%s\n”, buffer);
}
It’s complicated, but most of this is “boilerplate” and not as bad as it looks.
Exercise 2: Running the Vadd kernel
• Goal:
– To inspect and verify that you can run an OpenCL kernel
• Procedure:
– Take the provided C Vadd program. It will run a simple kernel to
add two vectors together.
– Look at the host code and identify the API calls in the host code.
Compare them against the API descriptions on the OpenCL
reference card.
– There are some helper files which time the execution, output
device information neatly and check errors.
• Expected output:
– A message verifying that the vector addition completed successfully
Lecture 4
1
especially for C++ programmers…
C++ Interface:
setting up the host program
d_a = cl::Buffer(context,
cl::Buffer d_a, d_b, d_c; h_a.begin(), h_a.end(), true);
cl::Context context( d_b = cl::Buffer(context,
CL_DEVICE_TYPE_DEFAULT); h_b.begin(), h_b.end(), true);
• Goal:
– To learn the C++ interface to OpenCL’s API
• Procedure:
– Examine the provided program. They will run a simple kernel
to add two vectors together
– Look at the host code and identify the API calls in the host
code. Note how some of the API calls in OpenCL map onto C+
+ constructs
– Compare the original C with the C++ versions
– Look at the simplicity of the common API calls
• Expected output:
– A message verifying that the vector addition completed
successfully
Exercise 4: Chaining vector add kernels
(C++)
• Goal:
– To verify that you understand manipulating kernel invocations and
buffers in OpenCL
• Procedure:
– Start with a VADD program in C++
– Add additional buffer objects and assign them to vectors defined on
the host (see the provided vadd programs for examples of how to do
this)
– Chain vadds … e.g. C=A+B; D=C+E; F=D+G.
– Read back the final result and verify that it is correct
– Compare the complexity of your host code to C
• Expected output:
– A message to standard output verifying that the chain of vector
additions produced the correct result
(Sample solution is for C = A + B; D = C + E; F = D + G; return F)
Review
cl::make_kernel
<cl::Buffer, cl::Buffer, cl::Buffer, int> vadd(program, "vadd");
Create a kernel (advanced)
• If you want to query information about a
kernel, you will need to create a kernel
object too: If we set the local dimension
ourselves or accept the OpenCL
runtime’s, we don’t need this step
cl::Kernel ko_vadd(program, “vadd”);
CPU GPU
Context
Programs
Programs Kernels Memory Objects Command Queues
__kernel void
dp_mul
Buffers Images
dp_mul(global const float *a, dp_mul arg[0]
arg [0]value
value InIn Outofof
Out
global const float *b, CPU program binary arg[0] value
global float *c) Order
Order Order
Order
arg[1]
[1]value
value
{ dp_mul arg
arg[1] value Queue
Queue Queue
Queue
int id = get_global_id(0); GPU program binary
c[id] = a[id] * b[id]; arg[2]
[2]value
value
arg
} arg[2] value GPU Device
Compute
Lecture 5
INTRODUCTION TO OPENCL
KERNEL PROGRAMMING
OpenCL C for Compute Kernels
• Derived from ISO C99
– A few restrictions: no recursion, function
pointers, functions in C99 standard headers ...
– Preprocessing directives defined by C99 are
supported (#include etc.)
• Built-in data types
– Scalar and vector data types, pointers
– Data-type conversion functions:
• convert_type<_sat><_roundingmode>
– Image types:
• image2d_t, image3d_t and sampler_t
OpenCL C for Compute Kernels
• Built-in functions — mandatory
– Work-Item functions, math.h, read and write image
– Relational, geometric functions, synchronization
functions
– printf (v1.2 only, so not currently for NVIDIA GPUs)
• Built-in functions — optional (called
“extensions”)
– Double precision, atomics to global and local memory
– Selection of rounding mode, writes to image3d_t
surface
OpenCL C Language Highlights
• Function qualifiers
– __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued
– Kernels can call other kernel-side functions
• Address space qualifiers
– __global, __local, __constant, __private
– Pointer kernel arguments must be declared with an address space
qualifier
• Work-item functions
– get_work_dim(), get_global_id(), get_local_id(), get_group_id()
• Synchronization functions
– Barriers - all work-items within a work-group must execute the
barrier function before any work-item can continue
– Memory fences - provides ordering between memory operations
OpenCL C Language
Restrictions
• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but
not as an argument to a kernel invocation
• Bit-fields are not supported
• Variable length arrays and structures are not
supported
• Recursion is not supported (yet!)
• Double types are optional in OpenCL v1.1, but the
key word is reserved
(note: most implementations support double)
Worked example: Linear Algebra
• Definition:
– The branch of mathematics concerned with the study of vectors,
vector spaces, linear transformations and systems of linear
equations.
• Example: Consider the following system of linear equations
x + 2y + z = 1
x + 3y + 3z = 2
x + y + 4z = 6
– This system can be represented in terms of vectors and a matrix
as the classic “Ax = b” problem.
1 2 1 x 1
1 3 3 y = 2
1 1 4 z 6
Solving Ax=b
• LU Decomposition:
– transform a matrix into the product of a lower triangular and upper
triangular matrix. It is used to solve a linear system of equations.
1 0 0 1 2 1 1 2 1
1 1 0 0 1 2 = 1 3 3
1 -1 1 0 0 5 1 1 4
L U = A
• We solve for x, given a problem Ax=b
– Ax=b LUx=b
– Ux=(L-1)b x = (U-1)(L-1)b
So we need to be able to do matrix multiplication
Matrix multiplication: sequential code
We calculate C=AB, where all three matrices are NxN
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
{
__kernel void mmul( int k;
const int N, int i = get_global_id(0);
__global float *A, int j = get_global_id(1);
__global float *B, float tmp = 0.0f;
__global float *C) for (k = 0; k < N; k++)
tmp += A[i*N+k]*B[k*N+j];
}
C[i*N+j] += tmp;
}
Matrix multiplication host program (C++ API)
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz
C(i,j) A(i,:)
= x B(:,j)
Cannot synchronize
between work-groups
within a kernel
• Private Memory:
– A very scarce resource, only a few tens of 32-bit words
per Work-Item at most
– If you use too much it spills to global memory or reduces
the number of Work-Items that can be run at the same
time, potentially harming performance*
– Think of these like registers on the CPU
* Occupancy on a GPU
Local Memory*
• Tens of KBytes per Compute Unit
– As multiple Work-Groups will be running on each CU, this means only
a fraction of the total Local Memory size is available to each Work-
Group
• Assume O(1-10) KBytes of Local Memory per Work-Group
– Your kernels are responsible for transferring data between Local and
Global/Constant memories … there are optimized library functions
to help
– E.g. async_work_group_copy(), async_workgroup_strided_copy(), …
• Use Local Memory to hold data that can be reused by all the
work-items in a work-group
• Access patterns to Local Memory affect performance in a
similar way to accessing Global Memory
– Have to think about things like coalescence & bank conflicts
* Typical figures for a 2013 GPU
Local Memory
• Local Memory doesn’t always help…
– CPUs don’t have special hardware for it
– This can mean excessive use of Local Memory
might slow down kernels on CPUs
– GPUs now have effective on-chip caches which
can provide much of the benefit of Local
Memory but without programmer intervention
– So, your mileage may vary!
The Memory Hierarchy
Bandwidths Sizes
Private memory Private memory
O(2-3) words/cycle/WI O(10) words/WI
Speeds and feeds approx. for a high-end discrete GPU, circa 2011
Memory Consistency
• OpenCL uses a relaxed consistency memory model; i.e.
– The state of memory visible to a work-item is not guaranteed to be
consistent across the collection of work-items at all times.
• Within a work-item:
– Memory has load/store consistency to the work-item’s private view
of memory, i.e. it sees its own reads and writes correctly
• Within a work-group:
– Local memory is consistent between work-items at a barrier.
• Global memory is consistent within a work-group at a barrier,
but not guaranteed across different work-groups!!
– This is a common source of bugs!
• Consistency of memory shared between commands (e.g.
kernel invocations) is enforced by synchronization (barriers,
events, in-order queue)
Optimizing matrix multiplication
• There may be significant overhead to manage work-items
and work-groups.
• So let’s have each work-item compute a full row of C
C(i,j) A(i,:)
= x B(:,j)
64
1024
{
__kernel void mmul( int j, k;
const int N, int i = get_global_id(0);
__global float *A, float tmp;
__global float *B, for (j = 0; j < N; j++) {
__global float *C) tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k]*B[k*N+j];
C[i*N+j] = tmp;
}
}
Matrix multiplication host program (C++ API)
Changes to host program: // Setup the buffers, initialize matrices,
int main(int argc, char *argv[])
// and write them into global memory
{
1. 1D ND Range set to number of Ndim,
initmat(Mdim,
std::vector<float> h_A, h_B, h_C; // matrices
rows Pdim,inh_A,the
h_B,C matrix
h_C);
2. Local Dimension set tocl::Buffer
64 sod_a(context,
int Mdim, Ndim, Pdim; // A[N][P],B[P][M],C[N][M]
int i, err; cl::Buffer number
d_b(context, of work-groups
h_A.begin(),h_A.end(), true);
h_B.begin(),h_B.end(), true);
match number of compute units (16 in this case) for our
cl::Buffer d_c = cl::Buffer(context,
int szA, szB, szC; // num elements in each matrix
CL_MEM_WRITE_ONLY,
double start_time, run_time; // timing data
cl::Program program; order 1024 matrices sizeof(float) * szC);
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C(i,j) A(i,:)
= x B(:,j)
(*Actually, this is using far more private memory than we’ll have and so Awrk[] will be spilled to global memory)
Matrix multiplication performance
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C row per work-item, A row private 3,385.8 8,584.3
C(i,j) A(i,:)
= x B(:,j)
Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz
SYNCHRONIZATION IN OPENCL
Consider N-dimensional domain of work-items
• Global Dimensions:
– 1024x1024 (whole problem space)
• Local Dimensions:
– 64x64 (work-group, executes together)
1024
Synchronization between
work-items possible only
within work-groups:
barriers and memory fences
1024
Cannot synchronize
between work-groups
within a kernel
• Across work-groups
– No guarantees as to where and when a particular work-group will be executed
relative to another work-group
– Cannot exchange data, or have barrier-like synchronization between two different
work-groups! (Critical issue!)
– Only solution: finish the kernel and start another
Where might we need synchronization?
0.0 1.0
X
Numerical integration source code
The serial Pi program
HETEROGENEOUS COMPUTING
WITH OPENCL
Running on the CPU and GPU
• Kernels can be run on
multiple devices at the same
time GPU CPU
• We can exploit many GPUs
and the host CPU for
computation
• Simply define a context with
Queue Queue
multiple platforms, devices
and queues
• We can even synchronize
between queues using Events
(see appendix) Context
• Can have more than one
context
Running on the CPU and GPU
1. Discover all your platforms and devices
– Look at the API for finding out Platform and Device IDs
The steps are the same in C and Python, just the API calls differ as usual
Exercise 10: Heterogeneous Computing
• Goal:
– To experiment with running kernels on multiple devices
• Procedure:
– Take one of your OpenCL programs
– Investigate the Context constructors to include more than one
device
– Modify the program to run a kernel on multiple devices, each
with different input data
– Split your problem across multiple devices if you have time
– Use the examples from the SDKs to help you
• Expected output:
– Output the results from both devices and see which runs faster
Lecture 9
ENABLING PORTABLE
PERFORMANCE VIA OPENCL
Portable performance in OpenCL
• Portable performance is always a challenge, more
so when OpenCL devices can be so varied (CPUs,
GPUs, …)
X values
Y values
Exercise 11: Optimize matrix multiplication
• Goal:
– To understand portable performance in OpenCL
• Procedure:
– Optimize a matrix multiply solution step by step, saving
intermediate versions and tracking performance improvements
– After you’ve tried to optimize the program on your own, study
the blocked solution optimized for an NVIDIA GPU. Apply these
techniques to your own code to further optimize performance
– As a final step, go back and make a single program that is
adaptive so it delivers good results on both a CPU and a GPU
• Expected output:
– A message confirming that the matrix multiplication is correct
– Report the runtime and the MFLOPS
Lecture 10
OPTIMIZING OPENCL
PERFORMANCE
Extrae and Paraver
• From Barcelona Supercomputing Center
– http://
www.bsc.es/computer-sciences/performance-tools
/trace-generation
– http://
www.bsc.es/computer-sciences/performance-tools
/paraver
• Create and analyze traces of OpenCL programs
– Also MPI, OpenMP
• Required versions:
– Extrae v2.3.5rc
Extrae and Paraver
1. Extrae instruments your application and
produces “timestamped events of
runtime calls, performance counters and
source code references”
– Allows you to measure the run times of your
API and kernel calls
• Follow the wizard, selecting the compiled binary in the File box
(you do not need to make any code or compiler modifications).
You can leave the other options as the default.
• The binary is then run and profiled and the results displayed.
• Goal:
– To experiment with profiling tools
• Procedure:
– Take one of your OpenCL programs, such as matrix multiply
– Run the program in the profiler and explore the results
– Modify the program to change the performance in some way
and observe the effect with the profiler
– Repeat with other programs if you have time
• Expected output:
– Timings reported by the host code and via the profiling
interfaces should roughly match
Lecture 11
DEBUGGING OPENCL
Debugging OpenCL
• Parallel programs can be challenging to debug
• Luckily there are some tools to help
• Firstly, if your device can run OpenCL 1.2, you can printf straight from
the kernel.
func<<<num_blocks, clSetKernelArg(kernel, 0,
num_threads_per_block, sizeof(int)*num_elements,
shared_mem_size>>>(args);
NULL);
Dividing up the work
Problem size
CUDA OpenCL
Thread Work-item
gridDim get_num_groups()
blockIdx get_group_id()
blockDim get_local_size()
threadIdx get_local_id()
__threadfenceblock() mem_fence(
CLK_GLOBAL_MEM_FENCE |
CLK_LOCAL_MEM_FENCE)
No equivalent read_mem_fence()
No equivalent write_mem_fence()
__threadfence() Finish one kernel and start
another
Translation from CUDA to OpenCL
CUDA OpenCL
GPU Device (CPU, GPU etc)
Multiprocessor Compute Unit, or CU
Scalar or CUDA core Processing Element, or PE
Global or Device Memory Global Memory
Shared Memory (per block) Local Memory (per workgroup)
Local Memory (registers) Private Memory
Thread Block Work-group
Thread Work-item
Warp No equivalent term (yet)
Grid NDRange
More information
• http://
developer.amd.com/Resources/hc/Open
CLZone/programming/pages/portingcuda
toopencl.aspx
Exercise 13: Porting CUDA to OpenCL
• Goal:
– To port the provided CUDA/serial C program to
OpenCL
• Procedure:
– Examine the CUDA kernel and identify which parts
need changing
• Change them to the OpenCL equivalents
– Examine the Host code and port the commands to
the OpenCL equivalents
• Expected output:
– The OpenCL and CUDA programs should produce the
same output – check this!
SOME CONCLUDING REMARKS
Conclusion
• OpenCL has widespread industrial support
• OpenCL has the potential to deliver portably performant code; but it has
to be used correctly
• The latest C++ and Python APIs make developing OpenCL programs much
simpler than before
• For the latest news on SPIR and new OpenCL versions see:
– http://www.khronos.org/opencl/
Third party names are the property of their owners.
Resources:
https://www.khronos.org/opencl/
The OpenCL specification
Surprisingly approachable for a spec!
https://www.khronos.org/registry/cl/
• OpenCL Forums:
– Khronos' OpenCL forums are the central place
to be:
– http://www.khronos.org/message_boards/for
umdisplay.php?f=
61
Other OpenCL resources
• CLU: a library of useful C-level OpenCL
utilities, such as program initialization, CL
kernel code compilation and calling kernels
with their arguments (bit like GLUT!):https
://github.com/Computing-Language-Utility/CL
U
__m128 ramp = _mm_setr_ps(0.5, 1.5, 2.5, 3.5); // pack 4 floats into vector register
__m128 vstep = _mm_load1_ps(&step); // pack step into a vector register
__m128 xvec; = _mm_mul_ps(ramp,vstep); // multiple corresponding 32 bit
// floats and assign to xvec
Vector intrinsics challenges
• Requires an assembly code style of programming:
– Load into registers
– Operate with register operands to produce values in another vector
register
• Non portable
– Change vector instruction set (even from the same vendor) and code
must be re-written. Compilers might treat them differently too
• Consequences:
– Very few programmers are willing to code with intrinsics
– Most programs only exploit vector instructions that the compiler can
automatically generate – which can be hit or miss
– Most programs grossly under exploit available performance.
Solution: a high level portable vector instruction set …
which is precisely what OpenCL provides.
Vector Types
• The OpenCL C kernel programming language
provides a set of vector instructions:
– These are portable between different vector instruction
sets
• These instructions support vector lengths of 2, 4, 8,
and 16 … for example:
– char2, ushort4, int8, float16, double2, …
• Properties of these types include:
– Endian safe
– Aligned at vector length
– Vector operations (elementwise) and built-in functions
Remember, double (and hence vectors
of double) are optional in OpenCL v1.1
Vector Operations
• Vector literal
int4 vi0 = (int4) -7; -7 -7 -7 -7
• Vector components
vi0.lo = vi1.hi;
2 3 -7 -7
2 3 -7 -7
• Vector ops 0 1 2 3
vi0 += vi1; +
2 4 -5 -4
vi0 = abs(vi0);
2 4 5 4
Using vector operations
• You can convert a scalar loop into a vector loop using
the following steps:
– Based on the width of your vector instruction set and your
problem, choose the number of values you can pack into a
vector register (the width):
• E.g. for a 128 bit wide SSE instruction set and float data (32 bit),
you can pack four values (128 bits =4*32 bits) into a vector register
– Unroll the loop to match your width (in our example, 4)
– Set up the loop preamble and postscript. For example, if the
number of loop iterations doesn’t evenly divide the width,
you’ll need to cover the extra iterations in a loop postscript
or pad your vectors in a preamble
– Replace instructions in the body of the loop with their
vector instruction counter parts
Vector instructions example
• Scalar loop:
for (i = 0; i < 34; i++) x[i] = y[i] * y[i];
• Width for a 128-bit SSE is 128/32=4
• Unroll the loop, then add postscript and premable as needed:
NLP = 34+2; x[34]=x[35]=y[34]=y[35]=0.0f // preamble to zero pad
for (i = 0; i < NLP; i = i + 4) {
x[i] = y[i] * y[i]; x[i+1] = y[i+1] * y[i*1];
x[i+2] = y[i+2] * y[i*2]; x[i+3] = y[i+3] * y[i*3];
}
• Replace unrolled loop with associated vector instructions:
cl_int clEnqueueNDRangeKernel (
cl_command_queue command_queue, Number of events this command
cl_kernel kernel, is waiting to complete before
cl_uint work_dim, executing
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
Array of pointers to the events
being waited upon … Command
queue and events must share a
Pointer to an event object context.
generated by this command
Event: basic event usage
• Events can be used to impose order constraints
on kernel execution.
• Very useful with out-of-order queues.
cl_event k_events[2];
err = clEnqueueNDRangeKernel(commands, kernel1, 1,
NULL, &global, &local, 0, NULL, &k_events[0]);
Enqueue two
err = clEnqueueNDRangeKernel(commands, kernel2, 1, kernels that
NULL, &global, &local, 0, NULL, &k_events[1]); expose events
err = clEnqueueNDRangeKernel(commands, kernel3, 1,
NULL, &global, &local, 2, k_events, NULL);
Wait to execute
until two previous
events complete
OpenCL synchronization: queues & events
• Events connect command invocations. Can be used to synchronize
executions inside out-of-order queues or between queues
• Example: 2 queues with 2 devices
Enqueue Kernel 1
Enqueue Kernel 2
Enqueue Kernel 2
Enqueue Kernel 1
Kernel 2 starts Kernel 2 waits for
before the results an event from
from Kernel 1 are Kernel 1 and does
ready not start until the
results are ready
Time Time
Why Events? Won’t a barrier do?
• A barrier defines a synchronization
point … commands following a barrier
wait to execute until all prior
enqueued commands complete
cl_int clEnqueueBarrier(cl_command_queue queue)
• Events provide fine grained control …
this can really matter with an out-of-
GPU CPU
order queue.
• Events work between commands in
the different queues … as long as they
Queue Queue
share a context
Event
• Events convey more information than
a barrier … provide info on state of a
command, not just whether it’s Context
complete or not.
Barriers between queues: clEnqueueBarrier doesn’t work
1st Command Queue 2nd Command Queue
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueBarrier() clEnqueueBarrier()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
Barriers between queues: this works!
1st Command Queue 2nd Command Queue
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueBarrier()
clEnqueueMarker(event)
clEnqueueWaitForEvent(event)
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueWriteBuffer() clEnqueueWriteBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
clEnqueueReadBuffer() clEnqueueReadBuffer()
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
Host generated events influencing execution of
commands: User events
• “user code” running on a host thread can
generate event objects
cl_event clCreateUserEvent(cl_context context, cl_int *errcode_ret)
• Created with value CL_SUBMITTED.
• It’s just another event to enqueued commands.
• Can set the event to one of the legal event
values
cl_int clSetUserEventStatus(cl_event event, cl_int execution_status)
• Example use case: Queue up block of
commands that wait on user input to finalize
state of memory objects before proceeding.
Command generated events influencing
execution of host code
• A thread running on the host can pause
waiting on a list of events to complete. This
can be done with the function:
cl_int clWaitForEvents( Number of events to wait on
cl_uint num_events,
const cl_event *event_list) An array of pointers
to event object
Profiling data
to query (see
cl_int clGetEventProfilingInfo( next slide)
cl_event event,
cl_profiling_info param_name,
Expected and size_t param_value_size, Pointer to
actual size of void *param_value, memory to
profiling data. size_t *param_value_size_ret) hold results
cl_profiling_info values
• CL_PROFILING_COMMAND_QUEUED
– the device time in nanoseconds when the command is
enqueued in a command-queue by the host. (cl_ulong)
• CL_PROFILING_COMMAND_SUBMIT
– the device time in nanoseconds when the command is
submitted to compute device. (cl_ulong)
• CL_PROFILING_COMMAND_START
– the device time in nanoseconds when the command starts
execution on the device. (cl_ulong)
• CL_PROFILING_COMMAND_END
– the device time in nanoseconds when the command has
finished execution on the device. (cl_ulong)
Profiling Examples
cl_event prof_event; cl_ulong start_time, end_time;
cl_command_queue comm; size_t return_bytes;
comm = clCreateCommandQueue( err = clGetEventProfilingInfo(
context, device_id, prof_event,
CL_QUEUE_PROFILING_ENABLE,
&err); CL_PROFILING_COMMAND_QUEUED,
sizeof(cl_ulong),
err = clEnqueueNDRangeKernel( &start_time,
comm, kernel, &return_bytes);
nd, NULL, global, NULL,
0, NULL, prof_event); err = clGetEventProfilingInfo(
prof_event,
clFinish(comm); CL_PROFILING_COMMAND_END,
err = clWaitForEvents(1, &prof_event ); sizeof(cl_ulong),
&end_time,
&return_bytes);
for(j=0;j<Mdim;j++){
event_t ev_cp = async_work_group_copy( Start an async. copy
(__local float*) Bwrk, (__global float*) B,
for row of B returning
(size_t) Pdim, (event_t) 0);
an event to track
wait_group_events(1, &ev_cp);
progress.
for(k=0, tmp= 0.0;k<Pdim;k++) Wait for async. copy to
tmp += Awrk[k] * Bwrk[k]; complete before
C[i*Ndim+j] = tmp; proceeding.
}
Compute element of C
using A from private
memory and B from
local memory.
Events and the C++ interface
(for profiling)
• Enqueue the kernel with a returned event
Event event =
vadd(
EnqueueArgs(commands,NDRange(count), NDRange(local)),
a_in, b_in, c_out, count);
cl_ulong ev_start_time =
event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong ev_end_time =
event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
Appendix C
PINNED MEMORY
Pinned Memory
• In general, the fewer transfers you can
do between host and device, the better
• But some are unavoidable
• It is possible to speed up these transfers,
by using pinned memory (also called
page-locked memory)
• If supported, can enable much faster host
<-> device communications
Pinned Memory
• A regular enqueueRead/enqueueWrite
command might manage ~6GB/s
• But PCI-E Gen 3.0 can sustain transfer
rates of up to 16GB/s
• So, where has our bandwidth gone?
• The operating system
• Why? Let's consider when memory is
actually allocated…
Malloc Recap
• Consider a laptop which #include <stdlib.h>
#include <stdio.h>
has 16GB of RAM. int
main
(int argc, char **argv)
• What is the output of the {
//64 billion floats
code on the right if run size_t len = 64 * 1024*1024*1024;
% ./test
got ptr 0x7f84b0c03350
Malloc Recap
• A non-NULL pointer was #include <stdlib.h>
#include <stdio.h>
returned int
• Both OS X and Linux will main
(int argc, char **argv)
oversubscribe memory {
//64 billion floats
• When will this memory size_t len = 64 * 1024*1024*1024;
return 0;
}
Malloc Recap
• So what happens here? #include <stdlib.h>
#include <stdio.h>
• The pointer we got back, int
when accessed, will main
(int argc, char **argv)
trigger a page fault in the {
kernel. size_t len = 16 * 1024*1024;
~Vector // destructor
{
cout << “vector destructor”;
}
int getX() const { return x_; } // access member function
…
};
The keyword “const” can be applied to member functions such as getX() to state that the
particular member function will not modify the internal state of the object, i.e it will not cause
any visual effects to someone owning a pointer to the said object. This allows for the compiler to
report errors if this is not the case, better static analysis, and to optimize uses of the object ,
i.e. promote it to a register or set of registers.
More information about constructors
• Consider the constructor from the previous slide …
Vector (int x, int y, int z): x_(x), y_(y), z_(z) {}
• C++ member data local to a class (or struct) can be initialized using the noation
: data_name(initializer_name), ...
• Consider the following two semantically equivalent structs in which the
constructor sets the data member x_ to the input value x:
{
Vector v(10,20,30);
// vector {x_ = 10, y_ = 20 , z_ = 30}
// use v
} // at this point v’s destructor would be called!
struct Functor
{
int operator() (int x) { return x*x; }
};
// create an object of type Functor
Functor f();
int value = f(10); // call the operator()
Template functions
• Don’t want to write the same function many times for
different types?
• Templates allow functions to be parameterized with a
type(s).
template<typename T>
T add(T x, T y) { return x+y; }
#include <functional>
• We can define a C++ function object (e.g. functor) and then store it
in the tempated class std::function
struct Functor
{
int operator() (int x) { return x*x; }
};
std::function<int (int)> square(Functor());
C++ function template: example 1
The header <functional> just defines the template std::function. This can
be used to warp standard functions or function objects, e.g.:
int foo(int x) { return x; } // standard function
std::function<int (int)> foo_wrapper(foo);
struct Foo // function object
{
void operator()(int x) {return x;}
};
std::function<int (int)> foo_functor(Foo());