******
OpenMP
******
`OpenMP `_ provides a *directive-based*
approach to marking regions of code for parallelism. It supports
shared-memory parallelism and offloading to accelerators.
Some nice examples are provided in the `OpenMP Application Programming
Interface Examples document
`_.
.. tip::
The `OpenMP Reference Guide `_ provides
a quick overview of the different syntax in OpenMP.
Compiler support
================
In order to build an OpenMP application, you need a compiler that
supports it. Fortunately, most recent compilers support OpenMP. For
g++, the OpenMP standards are fully supported (up to version 5.0)
See this table for `compiler support for OpenMP `_.
Threads
=======
In an OpenMP application, threads are spawned as needed.
* When you start the program, there is one thread---the master thread
* When you enter a parallel region, multiple threads run concurrently
This looks like:
.. figure:: 1280px-Fork_join.svg.png
:align: center
(A1 / Wikipedia)
Hello, World
============
Here's a simple "Hello, World" in OpenMP. We print out the number of
threads and then enter a parallel region where each thread prints its
id separately:
.. literalinclude:: ../../examples/parallel/openmp/hello.cpp
:language: c++
:caption: ``hello.cpp``
When we compile this, we need to tell the compiler to interpret the OpenMP directives:
.. prompt:: bash
g++ -fopenmp -o hello hello.cpp
A few notes:
* OpenMP directives are specified via ``#pragma omp``
* When we run this, the threads are all printing independent of one
another, so the output is all mixed up. Run it again and you'll see
a different ordering.
* There are a few library functions that we access, by including ``omp.h``
Critical regions
================
We can use a *critical region* in an OpenMP parallel region to force
the threads to operate one at a time. For example, in the above
``hello.cpp``, we can get the prints to be done one at a time as:
.. literalinclude:: ../../examples/parallel/openmp/hello-critical.cpp
:language: c++
:caption: ``hello-critical.cpp``
Controlling the number of threads
=================================
The easiest way to control the number of threads an OpenMP application
uses is to set the ``OMP_NUM_THREADS`` environment variable. For
instance, you can set it globally in your shell as:
.. prompt:: bash
export OMP_NUM_THREADS=2
or just for the instance of your application as:
.. prompt:: bash
OMP_NUM_THREADS=2 ./hello
.. tip::
Your code will still run if you specify more threads than there are
cores in your machine. On a Linux machine, you can do:
.. prompt:: bash
cat /proc/cpuinfo
To see the information about your processor and how many cores Linux thinks
you have.
Note: modern processors sometimes use hyperthreading, which makes a
single core look like 2 to the OS. But OpenMP may not benefit from
this hardware threading.
Parallelizing Loops
===================
Here's a matrix-vector multiply:
.. literalinclude:: ../../examples/parallel/openmp/matmul.cpp
:language: c++
:caption: ``matmul.cpp``
.. warning::
There is an overhead associated with spawning threads, and some
regions might not have enough work to offset that overhead. Some
experimentation may be necessary with your application.
.. tip::
We cannot put the ``{`` on the same line as the ``#pragma``, since the
``#pragma`` is dealt with by the preprocessor. So we do:
.. code:: c++
#pragma omp parallel
{
...
}
and not
.. code:: c++
#pragma omp parallel {
...
}
One thing we want is for the performance to scale with the number of
cores---if you double the number of cores, does the code run twice as
fast?
Reductions
==========
Reductions (e.g., summing, min/max) are trickier, since each thread
will be updating its local sum or min/max, but then at the end of the
parallel region, a reduction over the threads needs to be done.
A reduction clause takes the form:
.. code:: c++
#openmp parallel reduction (operator | variable)
Each thread will have its own local copy of ``variable`` and they will
be reduced into a single quantity at the end of the parallel region.
The possible operators are listed here: https://www.openmp.org/spec-html/5.0/openmpsu107.html
and include:
* ``+`` for summation
* ``-`` for subtraction
* ``*`` for multiplication
* ``min`` for finding the global minimum
* ``max`` for finding the global maximum
Here's an example where we construct the sum:
.. math::
S = \sum_{i = 0}^{N-1} \left [ e^{i \% 5} - 2 e^{i \% 7} \right ]
.. note::
This will give slightly different answers depending on the number of threads
because of different roundoff behavior.
.. literalinclude:: ../../examples/parallel/openmp/reduce.cpp
:language: c++
:caption: ``reduce.cpp``