Main image of article Comparing C++ Compilers' Parallel-Programming Performance
Many of today's C++ compilers can be used for parallel programming. There are two main ways to program in parallel:
  • Multicore programming
  • Vectorized programming
Multicore programming is just what it sounds like: Your code can run simultaneously on multiple cores. There are different ways to make this happen, whereby you specify directives or use extensions to the language so that a loop can, for example, run its iterations simultaneously on as many cores as it can obtain from the processor. Vectorized programming takes place on a single core. The processor core has multiple registers that are storage areas built inside the processor. You can store numbers inside the registers and perform mathematical operations in them. But these registers are much larger than the numbers you're storing. For example, a single-precision floating-point number is typically 32 bits. (The number can vary, but there is an IEEE standard that most compiler vendors try to adhere to, which states that single-precision numbers get 32 bits.) But the register, depending on the processor, might be 256 bits; that means you can store eight separate single-precision numbers next to each other within the single register. Then, using a single assembler instruction, you can perform a mathematical operation on all eight numbers together. This is known as “SIMD,” or Single Instruction, Multiple Data. For example, you might put the numbers 10, 20, 30, and 40 inside register, and your operation might be to double the number. With a single operation, then, you would double all four numbers to get 20, 40, 60, and 80. To accomplish vectorized programming, you carefully code your algorithms so that the code inside the loops can be executed as a vectorized statement. That means sticking to primarily mathematical operations that the processor supports at the vectorization level. Then you can choose to let the compiler “autovectorize” whichever loops the compiler sees fit. Or, you can specify exactly which loops you want to vectorize, although the compile will skip vectorzing any such loops when it just isn't possible. Also, you can, of course, use both multicore and vectorized programming at the same time. The compiler can take your C++ loops and create vectorized assembly code for you. But just how good is the vectorized code? And do you have much control over the generated code? First, it's obviously important that you “RTFM” and fully understand compiler options (especially since the defaults may not be what you want or think you're getting). But even then, do you trust that the compiler is generating the best code for you? That's what I'm going to find out here. I'm going to look at the actual generated assembly code and share it with you so we can see exactly what we're getting.

Prior Tests From Other People

We don't want to re-invent the wheel here. As such, I decided to start with where some of the earlier tests have left off. Here's an excellent test of how g++ can create vectorized code, when it doesn't, and how to fuss with it to get it to do so. In this article, the author digs into the generated assembly code to see if vectorization is occurring. I would say one important take-home message from the piece is that, even if you think you're going to get vectorized code, you still might not. There are two things you can do: First, the compilers can actually tell you if a function is vectorized (it's an option, and you'll want to turn it on). Second, you still might want to look at the assembly code, not only to make sure you're getting vectorized code, but that the code you're getting is actually better than the serial version. So for our tests, I'm going to take several of the author's functions and try them out on the Intel compiler and then compare the resulting code, as well as what we have to do to get vectorized code.

Initial Tests on Intel Compiler

For the first test, I went with the first test presented in the article, compiled with the Intel compiler:
#include <stdlib.h> #include <math.h> #define SIZE      (1L << 16) void test1(double *a, double *b) { int i;   for (i = 0; i < SIZE; i++) { a[i] += b[i]; } } int main() { double *a = new double[SIZE]; double *b = new double[SIZE]; test1(a, b); }
After turning on the diagnostics to actually show me what's vectorized,
icc -S -vec-report2 test1.cpp
I can see that the loop was indeed vectorized with this message:
LOOP WAS VECTORIZED
(The -S tells the compiler to generate a file containing the assembly code.) But how good is the vectorization? Adding the -s option, we can look at the resulting assembler code. For space reasons, I won't list the whole thing here, just the code inside the loop:
movsd     (%rsi,%rdx,8), %xmm0 movsd     16(%rsi,%rdx,8), %xmm1 movsd     32(%rsi,%rdx,8), %xmm2 movsd     48(%rsi,%rdx,8), %xmm3 movhpd    8(%rsi,%rdx,8), %xmm0 movhpd    24(%rsi,%rdx,8), %xmm1 movhpd    40(%rsi,%rdx,8), %xmm2 movhpd    56(%rsi,%rdx,8), %xmm3 addpd     (%rdi,%rdx,8), %xmm0 addpd     16(%rdi,%rdx,8), %xmm1 addpd     32(%rdi,%rdx,8), %xmm2 addpd     48(%rdi,%rdx,8), %xmm3 movaps    %xmm0, (%rdi,%rdx,8) movaps    %xmm1, 16(%rdi,%rdx,8) movaps    %xmm2, 32(%rdi,%rdx,8) movaps    %xmm3, 48(%rdi,%rdx,8)
We're working with double-precision numbers here, which are 64 bits. But how big are our registers? Well, we didn't specify a SIMD architecture. By default, we're getting size 128. The first eight lines of this code are juggling our numbers around, and then there are four actual addition operations. Each addition is performed with the opcode added. But there are four of these additions—one each for registers xmm0, xmm1, xmm2, xmm3. That's because something separate from vectorization is also happening here, something called loop unrolling. I'll discuss that shortly; it’s sufficient for now to say it's an optimization that is independent of the vectorization. Let's see if we can tweak the compiler a bit. I want to target two different generations of SIMD: SSE4.2 and the newer AVX. We can do so using the -x option. Without changing the C++ code, and then adding the option -xSSE4.2, we end up with this assembler code:
movups    (%rsi,%rdx,8), %xmm0 movups    16(%rsi,%rdx,8), %xmm1 movups    32(%rsi,%rdx,8), %xmm2 movups    48(%rsi,%rdx,8), %xmm3 addpd     (%rdi,%rdx,8), %xmm0 addpd     16(%rdi,%rdx,8), %xmm1 addpd     32(%rdi,%rdx,8), %xmm2 addpd     48(%rdi,%rdx,8), %xmm3 movups    %xmm0, (%rdi,%rdx,8) movups    %xmm1, 16(%rdi,%rdx,8) movups    %xmm2, 32(%rdi,%rdx,8) movups    %xmm3, 48(%rdi,%rdx,8)
And with the option -xAVX we get this:
vmovupd   (%rsi,%rax,8), %xmm0 vmovupd   32(%rsi,%rax,8), %xmm3 vmovupd   64(%rsi,%rax,8), %xmm6 vmovupd   96(%rsi,%rax,8), %xmm9 vinsertf128 $1, 48(%rsi,%rax,8), %ymm3, %ymm4 vinsertf128 $1, 16(%rsi,%rax,8), %ymm0, %ymm1 vinsertf128 $1, 80(%rsi,%rax,8), %ymm6, %ymm7 vinsertf128 $1, 112(%rsi,%rax,8), %ymm9, %ymm10 vaddpd    (%rdi,%rax,8), %ymm1, %ymm2 vaddpd    32(%rdi,%rax,8), %ymm4, %ymm5 vaddpd    64(%rdi,%rax,8), %ymm7, %ymm8 vaddpd    96(%rdi,%rax,8), %ymm10, %ymm11 vmovupd   %ymm2, (%rdi,%rax,8) vmovupd   %ymm5, 32(%rdi,%rax,8) vmovupd   %ymm8, 64(%rdi,%rax,8) vmovupd   %ymm11, 96(%rdi,%rax,8)
Without spending too much time explaining all this, the AVX ones are operating with twice the register sizes. The opcodes start with a “v” to signify AVX. But in these cases, we're dealing with unaligned data. We see that with the letter “u” in the move operations; we can get better performance by aligning it. The author of the LockLess article found the same to be true with g++.  In fact, we can use the same “restrict” keyword, provided we add on the restrict compiler option. Also, we can turn up our diagnostics report to let us know if our data is aligned. By using -vec-report6, we can see status reports such as:
reference a has aligned access
And if we forget to align one of our variables, we'll see in the report:
reference b has unaligned access
To align the data for Intel, we can use code similar to what the LockLess article used, except a different intrinsic:
void test1(double * restrict a, double * restrict b) {

      int i;

      __assume_aligned(a, 64);

      __assume_aligned(b, 64);

      for (i = 0; i < SIZE; i++) {

            a[i] += b[i];

      } }
And now inspecting the assembler, we see that we're using MOVAPS instead of MOVUPS. The “A” means aligned:
movaps    (%rdi,%rax,8), %xmm0 movaps    16(%rdi,%rax,8), %xmm1 movaps    32(%rdi,%rax,8), %xmm2 movaps    48(%rdi,%rax,8), %xmm3 addpd     (%rsi,%rax,8), %xmm0 addpd     16(%rsi,%rax,8), %xmm1 addpd     32(%rsi,%rax,8), %xmm2 addpd     48(%rsi,%rax,8), %xmm3 movaps    %xmm0, (%rdi,%rax,8) movaps    %xmm1, 16(%rdi,%rax,8) movaps    %xmm2, 32(%rdi,%rax,8) movaps    %xmm3, 48(%rdi,%rax,8)
But then I encountered a surprise. If any readers can provide an explanation here, I welcome it (and I might put in a call to Intel on this one): if I keep the code as-is, with the alignment all set, and then target the AVX, I get the opcodes that start with V—but they have a “U” for unaligned, instead of an “A” for aligned:
vmovupd   (%rdi,%rax,8), %ymm0 vmovupd   32(%rdi,%rax,8), %ymm2 vmovupd   64(%rdi,%rax,8), %ymm4 vmovupd   96(%rdi,%rax,8), %ymm6 vaddpd    (%rsi,%rax,8), %ymm0, %ymm1 vaddpd    32(%rsi,%rax,8), %ymm2, %ymm3 vaddpd    64(%rsi,%rax,8), %ymm4, %ymm5 vaddpd    96(%rsi,%rax,8), %ymm6, %ymm7 vmovupd   %ymm1, (%rdi,%rax,8) vmovupd   %ymm3, 32(%rdi,%rax,8) vmovupd   %ymm5, 64(%rdi,%rax,8) vmovupd   %ymm7, 96(%rdi,%rax,8)
...even though the -vec-report6 stated that the data was aligned. The only reference I can find is a discussion forum on intel.com where an Intel employee states, “The compiler never uses VMOVAPD even though it would be valid.” From the comments in my previous articles, it's clear some of you really know your stuff, so if you know what's up, feel free to chime in.

Loop Unrolling

When we turned up the diagnostic level to 6, we also saw a message:
unroll factor set to 4
This refers to the fact that, in addition to vectorizing our code, we're also using four separate registers to do four sets of additions, and as such as we four ADDPD calls in our code. This is called loop unrolling, and it's a completely separate issue from the vectorization code. I was surprised to see that the Intel compiler did that by default without me asking for it. We can turn off the option with -loopunroll0. Then we see three lines of vectorized addition, which is identical to that in the LockLess report:
movaps    (%rdi,%rax,8), %xmm0 addpd     (%rsi,%rax,8), %xmm0 movaps    %xmm0, (%rdi,%rax,8)
Because this loop unrolling is a separate issue from vectorization, I won't use it to say that Intel's generated vectorized code is somehow “better” because of it. The g++ compiler also supports unrolling under certain conditions.

Embedded Loops

The LockLess article tackles the issue of embedded loops, which gave the g++ compiler some troubles. Here's some code similar to theirs: void test5(double * restrict a, double * restrict b)
{

        int i,j;

        __assume_aligned(a, 64);

        __assume_aligned(b, 64);

        for (j = 0; j < SIZE; j++) {

                for (i = 0; i < SIZE; i++)

                {

                        a[i + j * SIZE] += b[i + j * SIZE];

                }

        }

}
I get a diagnostic report that the loop can't be vectorized. However, the author of the LockLess piece didn't tackle trying to just get the inner loop to vectorize. With the Intel compiler, we can do so with a pragma: void test1(double * restrict a, double * restrict b)
{

        int i,j;

        __assume_aligned(a, 64);

        __assume_aligned(b, 64);

        for (j = 0; j < SIZE; j++)

        {

#pragma simd

                for (i = 0; i < SIZE; i++)

                {

                        a[i + j * SIZE] += b[i + j * SIZE];

                }

        }

}
This works; the inner loop gets vectorized. The author instead combined the loops into a single loop, like so:
for (i = 0; i < SIZE * SIZE; i++) {

    x[i] += y[i];

}
which in the end is the same as any other loop, just a bigger upper limit. He then declares:
“So, a rule of thumb might be; two loops bad, one loop good.”
That's a common problem in parallel programming, and there are different ways you can tackle it. According to Intel, the default is to try to vectorize the innermost loop; although in my test, I didn't end up with a vectorized inner loop. The compiler tried, but it decided there was a loop dependency and decided not to do it. Only after adding the pragma did it do it. But what about g++? Did the LockLess author miss something? In that article the author was using g++ 4.7. I'm using 4.8.1. And it turns out I can force the inner loop to vectorize if I add an optimization level 3 like so:
g++ -S -O3 gcc5.cpp
(The -S gives me the assembly output in a file.) I can see the vectorized statements in the assembly output:
movhpd  8(%rcx,%rax), %xmm0 addpd   (%r8,%rax), %xmm0 movapd  %xmm0, (%r8,%rax)
And I can also see it in a report if I turn on the diagnostics like so:
g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cpp
which gives me:
Vectorizing loop at gcc5.cpp:16
(Line 16 is the first line of the inner loop.) So did the 4.7 not allow it? If you're writing code that uses advanced techniques like vectorization, you're generally going to want to make sure you're running the latest version of the compiler. However, it turns out the 4.7 can vectorize the inner loop. I installed the 4.7 compiler and compiled the above with the same options:
g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cpp
and indeed the inner loop did vectorize:
16: LOOP VECTORIZED. gcc5.cpp:7: note: vectorized 1 loops in function.
So it appears the inner loop can get vectorized, with just a tiny bit of coaxing. (Incidentally, the assembly code generated by the 4.7 and 4.8.1 compilers is a good bit different in ways aside from the vectorization, which is interesting. I don't have space to put the entire listings here. Perhaps you might want to explore that yourself; or if there's interest here I can do an analysis in a future article to find out why and whether the differences could potentially impact your development projects.)

More Operations

A simple add operation isn't all that useful. The processor vectorization includes a large set of different operations from which you can construct sophisticated operations. The LockLess article describes an operation that looks like this:
for (i = 0; i < SIZE; i++) {

    x[i] = ((y[i] > z[i]) ? y[i] : z[i]);

}
Compiling this on with the Intel compiler results in the following vectorized assembly:
movaps    X2(,%rax,8), %xmm0 movaps    16+X2(,%rax,8), %xmm1 movaps    32+X2(,%rax,8), %xmm2 movaps    48+X2(,%rax,8), %xmm3 maxpd     X3(,%rax,8), %xmm0 maxpd     16+X3(,%rax,8), %xmm1 maxpd     32+X3(,%rax,8), %xmm2 maxpd     48+X3(,%rax,8), %xmm3 movaps    %xmm0, X(,%rax,8) movaps    %xmm1, 16+X(,%rax,8) movaps    %xmm2, 32+X(,%rax,8) movaps    %xmm3, 48+X(,%rax,8)
Aside from the unrolled loops, this code is basically the same as what g++ produced. The next step becomes a bit more complex. It looks like a small change, but it's significant. The author of the Lockless article changed the algorithm to include the left item, x[i], inside the operation itself, like so:

x[i] = ((y[i] > z[i]) ? x[i] : z[i]);

With this one, we're seeing something different happen from the g++ experiments. The LockLess report states that this didn't vectorize with g++. But here with my tests on the Intel compiler, it did just fine. The code is a bit long due to the complexity and the loop unrolling, but it has several operations that use the packed notation, and the diagnostics stated that vectorization took place. As a final test, the LockLess author performed this operation:
double test21(double * restrict a) {

      size_t i;

      double *x = __builtin_assume_aligned(a, 16);

      double y = 0;

      for (i = 0; i < SIZE; i++)

      {

            y += x[i];

      }

      return y;

}
When I see this code, an alarm goes off in my mind. It's actually fine, and will indeed vectorize. But if we try to go another step and add multi-core to it, we're going to have a problem with the y variable and some race conditions. These race conditions will require what's called a reducer. For now, when we're not trying to implement a multicore solution, and only use vectorization on a single core, we'll be fine—and indeed, this code vectorized just fine as well. What's interesting is, not only did it vectorize, but I saw the same elegant results as with the g++ compiler in the LockLess report. The compiler ultimately did what's called a horizontal operation using an UNPCKHPD opcode.

Conclusions

In general, both compilers give you plenty of options for generating vectorized code. Sometimes you have to do a little work, though, to get the compiler to actually vectorize your code. But it was clear that the Intel compiler was more willing to vectorize the code. The g++ compiler resisted at times, requiring different command-line options and occasionally reworking the code a bit; this is actually no surprise, since one might argue that Intel is staying a step ahead with their compilers, since they can actually be building the compilers for the processors before the processors are even released. The g++ team, however, can only work with what's available to them. (That and they don't have huge amounts of dollars to dump into it like Intel does.) Either way, the g++ compiler did well up against the Intel compiler. I was troubled by how different the generated assembly code was between the 4.7 and 4.8.1 compilers—not just with the vectorization but throughout the code. But that's another comparison for another day.   Image: sergign/Shutterstock.com