The first thing to know is how to generate assembly with gcc. Simply add -S to the gcc or g++ command. I am using Eclipse CDT and I have created several build configurations (some of which generate assembly for inspection).
I have been experimenting with the Eigen Library since it aparently supports ARM NEON. What I found is that it works extremely well with x86 SIMD instructions but not so well with ARM NEON. It seems that well coded C++ is better for NEON that Eigen...
x86 SSE2 with Eigen
movaps 32(%rdi), %xmm0
movaps 16(%rdi), %xmm1
mulps 32(%rsi), %xmm0
mulps 16(%rsi), %xmm1
addps %xmm0, %xmm1
movaps %xmm1, %xmm0
movhlps %xmm1, %xmm0
addps %xmm1, %xmm0
movaps %xmm0, %xmm1
shufps $1, %xmm0, %xmm1
addss %xmm1, %xmm0
x86 SSE2 without Eigen
movq (%rdi), %rdx
movq (%rsi), %rax
movss (%rdx), %xmm1
mulss (%rax), %xmm1
movss 4(%rdx), %xmm0
mulss 4(%rax), %xmm0
addss .LC0(%rip), %xmm1
addss %xmm0, %xmm1
movss 8(%rdx), %xmm0
mulss 8(%rax), %xmm0
addss %xmm0, %xmm1
movss 12(%rdx), %xmm0
mulss 12(%rax), %xmm0
addss %xmm0, %xmm1
movss 16(%rdx), %xmm0
mulss 16(%rax), %xmm0
addss %xmm0, %xmm1
movss 20(%rdx), %xmm0
mulss 20(%rax), %xmm0
addss %xmm0, %xmm1
ARM NEON-VFPv4 with Eigen
flds s0, .L2
ldr r3, [r1]
ldr r2, [r0]
flds s15, [r3]
flds s14, [r2]
vfma.f32 s0, s14, s15
flds s6, [r2, #4]
flds s7, [r3, #4]
flds s8, [r2, #8]
flds s9, [r3, #8]
flds s10, [r2, #12]
flds s11, [r3, #12]
flds s12, [r2, #16]
flds s13, [r3, #16]
flds s14, [r2, #20]
flds s15, [r3, #20]
vfma.f32 s0, s6, s7
vfma.f32 s0, s8, s9
vfma.f32 s0, s10, s11
vfma.f32 s0, s12, s13
vfma.f32 s0, s14, s15
ARM NEON-VFPv4 without Eigen
ldr r3, [r0]
vmov.f32 q8, #0.0 @ v4sf
ldr r2, [r1]
vld1.32 {q9}, [r3]!
vld1.32 {q10}, [r2]!
vmul.f32 q10, q10, q9
vst1.64 {d20-d21}, [sp:64]
vld1.32 {q9}, [r3]
vld1.32 {q11}, [r2]
vmov q12, q10 @ v4sf
vfma.f32 q12, q11, q9
vadd.f32 d18, d24, d25
vpadd.f32 d16, d18, d18
vmov.32 r3, d16[0]
It took quite some effort to get the non-Eigen ARM code to be better then the Eigen code. The "naive" version with a simple dot-product for-loop (shown below) was similar to what Eigen produced. The a and b variables have been __restrict__ed and are pointers to aligned memory.
for (int i = 0; i < 8; i++)
{
out += a[i] * b[i];
}
The results are not what I would expect. I decided to split these two operations into two separate loops and I got the NEON version shown above!
float prods[8];
for (int i = 0; i < 8; i++)
{
prods[i] = a[i] * b[i];
}
for (int i = 0; i < 8; i++)
{
out += prods[i];
}
I should also add that without -funsafe-math-optimizations, the auto-vectorization doesn't work. I'm going to keep working on it to see if I can shed a few more instructions, but so far so good!