Main image of article Speeding Up C++ With OpenMP

Even though C++ 11 has a new <thread> library, you might not be comfortable in its use. But never fear: You can still use threads, and speed up your program, through parallelization and APIs such as OpenMP. Parallelization only works on a multiprocessor CPU, but considering those are pretty much ubiquitous, that shouldn’t be an issue. Parallelization of code shouldn’t be confused with vectorization, which is a special case of parallelization (using SIMD technology), that’s been pre-baked into pretty much every CPU for the past decade. Fortunately, you have a number of choices when it comes to parallelizing. The Intel Thread Building Blocks template library is free, cross-platform and built into Intel’s C++ compilers; note, though, it’s only suitable for C++ and not C. This second option from Intel (don’t confuse it with the Intel Thread Building Blocks) supports both C and C++ and offers an easier way to achieve parallelism for both data and tasks. It comes with Intel C++ compilers, and there’s an open-source version that’s part of GCC and in a Clang fork. Check out the latest C++ jobs.

OpenMP

Another popular option is OpenMP, which is open source, cross-platform and offered as an extension to C, C++ and Fortran. Developers rely on OpenMP as a portable and scalable way to develop parallel applications on a variety of devices, and they like it so much that some version of the API has been in use for more than 15 years. With OpenMP, a master thread spins up worker threads as needed; each thread is independent. Using OpenMP is pretty straightforward. In Visual Studio, it’s a compiler option in the Property Pages under C/C++ Languages, or /openmp in the command line. For more information on its compilers, check out this list.

Sample Program

I created a simple C++ program that initializes two large arrays with random values, and then multiplies each pair of values and stores the result in a third array. The arrays each hold 10 million integers. I used a high-resolution timer to measure how much time it took. To make the timing more accurate, I ran the program ten times in a loop, timing each run, and worked out the average time of runs 2-10. (The first run always took a bit longer, possibly due to memory allocation, so I omitted it from the average time.) The following are the run figures before I added OpenMP. I built this using Visual C++ 2013 from the Visual Studio 2013 community version; note that you will need to check that the Auto Parallelization option (/QPar) is disabled, or the compiler will try and do parallelization for you. I tried the various optimization features and got these average times that I've rounded to two decimal places.

  • Without optimization: 1.57
  • With whole program optimization: 1.53
  • With auto-parallelization: 1.53

(If you want to investigate Microsoft auto parallelization and see which loops get auto-parallelized or not, add /Qpar-report:2 to the property pages command line for C/C++. When I compiled it with this option, it reported that no loops were auto-parallelized; the thread count in task manager always showed as one.)

Using OpenMP

The bulk of this program’s code is in two double loops, which are prime candidates for parallelizing. OpenMP has five restrictions regarding loops that can be threaded. (When I say “constant” below, I mean either a literal number or a variable that is unchanged by the loop.)

  • The loop variable must be a signed integer; don't use unsigned ints.
  • The loop comparison must be in the form "var op constant," e.g. i < 5. Op can be <, <=, >, or >=.
  • The increment part of the “for” loop must be either integer addition or integer subtraction with a constant, e.g. 1.
  • The loop variable must increment or decrement on each iteration according to the comparison, e.g. increment for < and <=. There's no jumping out of the loop except with an exit, which terminates the whole application.
  • Any jumps (e.g., goto) must be within the loop.

It took just two pragmas to speed up this code: [cpp] // constants const int loopsize = 20000000; // 20 million const short int loopcount = 10; // variables int a[loopsize],b[loopsize],c[loopsize]; void doCalcs() { #pragma omp parallel for for ( int i=0; i < loopsize;i++) { b[i] = (rand() % loopsize); c[i] = rand() % loopsize; } #pragma omp parallel for for (int i=0; i< loopsize;i++) { a[i] = b[i]*c[i]; } } [/cpp] When built and run with openMP support, the average execution time dropped from 1.53 to 0.36, over four times faster. The thread count in task manager now shows as eight. It's possible to set the maximum number of threads to use with the line of code below. (For four threads average time was 0.45, for two it was 0.78.) [cpp] omp_set_num_threads(8); [/cpp] It's not always so simple to speed things up. What effect do you think adding the #pragma omp parallel for to the ten times loop in the main function would have? [cpp] int main(int argc, char* argv[]) { omp_set_num_threads(8); CStopWatch elapsed; elapsed.startTimer(); float times=0.0f; CStopWatch stopWatch; #pragma omp parallel for // <-- very bad! Remove for (int loop=0;loop < loopcount;loop++) { stopWatch.startTimer(); doCalcs(); stopWatch.stopTimer(); if (loop>0) { times += (float)stopWatch.getElapsedTime(); std::cout << stopWatch.getElapsedTime() << std::endl; } } elapsed.stopTimer(); std::cout << "Average run time = " << times/(loopcount-1) << std::endl; std::cout << "Total time = " << elapsed.getElapsedTime() << std::endl; return 0; } [/cpp] It broke things in two ways: Cout isn't thread-safe, and the output messed up. Even worse, the average time increased to 2.40 for all but the last two runs, which were 1.58.

Conclusion

OpenMP has considerably more to offer than what I've shown here: Data can be private or shared between threads; code blocks can be run on different threads; threads can be scheduled as critical. For some reason, Microsoft only supports OpenMP 2.0, even though the current version—already a year old—is 4.0, with 4.1 on the way. This is a shame, as version 2.0 doesn’t seem to support the SIMD clause, which would allow vectorization, as well. Nonetheless, you should give OpenMP a shot.