Friday, 31 October 2014

Adventures with ARM GCC Auto-Vectorization

I have been experimenting with how well gcc (4.8.2) works with compiling a simple dot product of two vectors with 8 elements in each. I have discovered some interesting things (whether they are bugs or quirks I don't know) and I wanted to share my findings...

The first thing to know is how to generate assembly with gcc. Simply add -S to the gcc or g++ command. I am using Eclipse CDT and I have created several build configurations (some of which generate assembly for inspection).

I have been experimenting with the Eigen Library since it aparently supports ARM NEON. What I found is that it works extremely well with x86 SIMD instructions but not so well with ARM NEON. It seems that well coded C++ is better for NEON that Eigen...

x86 SSE2 with Eigen

movaps      32(%rdi), %xmm0
movaps      16(%rdi), %xmm1
mulps       32(%rsi), %xmm0
mulps       16(%rsi), %xmm1
addps       %xmm0, %xmm1
movaps      %xmm1, %xmm0
movhlps     %xmm1, %xmm0
addps       %xmm1, %xmm0
movaps      %xmm0, %xmm1
shufps      $1, %xmm0, %xmm1
addss       %xmm1, %xmm0

x86 SSE2 without Eigen

movq      (%rdi), %rdx
movq      (%rsi), %rax
movss     (%rdx), %xmm1
mulss     (%rax), %xmm1
movss     4(%rdx), %xmm0
mulss     4(%rax), %xmm0
addss     .LC0(%rip), %xmm1
addss     %xmm0, %xmm1
movss     8(%rdx), %xmm0
mulss     8(%rax), %xmm0
addss     %xmm0, %xmm1
movss     12(%rdx), %xmm0
mulss     12(%rax), %xmm0
addss     %xmm0, %xmm1
movss     16(%rdx), %xmm0
mulss     16(%rax), %xmm0
addss     %xmm0, %xmm1
movss     20(%rdx), %xmm0
mulss     20(%rax), %xmm0
addss     %xmm0, %xmm1

ARM NEON-VFPv4 with Eigen

flds        s0, .L2
ldr         r3, [r1]
ldr         r2, [r0]
flds        s15, [r3]
flds        s14, [r2]
vfma.f32    s0, s14, s15
flds  s6,   [r2, #4]
flds  s7,   [r3, #4]
flds  s8,   [r2, #8]
flds  s9,   [r3, #8]
flds  s10,  [r2, #12]
flds  s11,  [r3, #12]
flds  s12,  [r2, #16]
flds  s13,  [r3, #16]
flds  s14,  [r2, #20]
flds  s15,  [r3, #20]
vfma.f32    s0, s6, s7
vfma.f32    s0, s8, s9
vfma.f32    s0, s10, s11
vfma.f32    s0, s12, s13
vfma.f32    s0, s14, s15

ARM NEON-VFPv4 without Eigen

ldr     r3, [r0]
vmov.f32    q8, #0.0  @ v4sf
ldr     r2, [r1]
vld1.32     {q9}, [r3]!
vld1.32     {q10}, [r2]!
vmul.f32    q10, q10, q9
vst1.64     {d20-d21}, [sp:64]
vld1.32     {q9}, [r3]
vld1.32     {q11}, [r2]
vmov     q12, q10  @ v4sf
vfma.f32    q12, q11, q9
vadd.f32    d18, d24, d25
vpadd.f32   d16, d18, d18
vmov.32     r3, d16[0]

It took quite some effort to get the non-Eigen ARM code to be better then the Eigen code. The "naive" version with a simple dot-product for-loop (shown below) was similar to what Eigen produced. The a and b variables have been __restrict__ed and are pointers to aligned memory.

for (int i = 0; i < 8; i++)
{
    out += a[i] * b[i];
}

The results are not what I would expect. I decided to split these two operations into two separate loops and I got the NEON version shown above! 

float prods[8];

for (int i = 0; i < 8; i++)
{
    prods[i] = a[i] * b[i];
}
for (int i = 0; i < 8; i++)
{
    out += prods[i];
}

I should also add that without -funsafe-math-optimizations, the auto-vectorization doesn't work. I'm going to keep working on it to see if I can shed a few more instructions, but so far so good!