Thursday, 19 December 2013

FFTW Benchmarks on Cortex-A7

The FFT algorithm has many scientific uses. The most obvious uses are in radio astronomy, for the frequency analysis of signals and is vital to Software Defined Radio (SDR) which is used extensively in the Square Kilometer Array (SKA). In line with the goals of the MAC Project, I am curious about how well an ARM processor (specifically the Cortex-A7) can do FFT - which leads to these benchmarks.

I discovered some existing benchmarks of FFTW done on the Cortex-A8 and A9 by Vesperix here. I used their modified FFTW 3.2.2 for ARM NEON and also ran benchmarks using the latest official version of FFTW: 3.3.3. Both sets of results are presented below with a short discussion afterwards.

I was unable to get the FFTW 3.3.3 NEON version working. I was repeatedly hit by a segmentation fault which I think is due to different memory alignment in the newer NEON and VPPv4 FPU's. I will post these specific benchmarks when the error is resolved.

System Specifications

The tests were run on a Cubieboard2 with the following specifications:
  • Allwinner A20 Dual-Core Cortex-A7 SoC @ ~1GHz
  • VFPv4 and NEONv2 FPU
  • 256 kB L2 Cache
  • 8GB Class 10 MicroSD Card
  • sunxi kernel 3.4.67+
  • Linaro 13.04 (with GCC 4.7.3)

Benchmark Methodology

I am only presenting the results for a complex 1D FFT with powers of two and non-powers of two. These are the types of FFTs that are most useful to radio astronomy since signal phase and amplitude are represented as a complex number. I ran several sets of benchmarks with various optimisations for comparison, each of which I will describe below.

I first tested the Vesperix FFTW 3.2.2 and then the FFTW 3.3.3. In all cases I used the following configure flags for single precision and the only available timer on the ARM processor:

--enable-single --with-slow-timer

I ran non-SIMD (no NEON) test without any extra flags, and NEON SIMD tests with the flag below:


I also tried out the fused multiply-add flag since the Cortex-A7 has this instruction in the VFPv4 FPU but I found that this flag actually caused performance to decrease! A short description of why this is can be found in the FFTW 3.3.3 tests section.


In all cases I modified the configure script to optimise for the CPU with the '-mcpu=cortex-a7' flag. I also modified the configure script to try out different GCC FPU options where appropriate, but in general I am only presenting the fastest results in this post. The options I tried are listed below for reference:


I repeated the tests with NEON on a threaded version of FFTW to see at which point multiple threads (on multiple cores) makes a difference and by how much. To enable the threaded version, FFTW must be recompiled with the flag below. Note that FFTW can be compiled once with this flag and used in both a threaded or unthreaded way.


I plan on running an MPI version with more threads (4 to 16) on our Cubieboard and Wandboard clusters at a later stage.

I used the script provided by Vesperix to automate the benchmarks. For the threaded tests I modified the script to contain the '-onthreads=2'. The number can be adjusted to suit the number of cores available on the system. The modified script is shown below.

for TYPE in 'c'; do
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (2 threads)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench -onthreads=2 $OPTS ${PLACE}${TYPE}${SIZE}
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (1 thread)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench $OPTS ${PLACE}${TYPE}${SIZE}

MFLOPS Result Interpretation:

The result provided by FFTW is 'MFLOPS': this is not true MFLOPS. It is estimated by FFTW based on an assumption of algorithmic complexity for the standard Cooley-Tukey FFT algorithm:

Although not necessarily totally accurate in the classic FLOPS sense, it is calculated the same way in all cases, so it works as a way to compare between runs. For comparison to other algorithms, I would rather use the actual time the algorithm takes to run on a specific FFT size (N).

FFTW 3.2.2 (Vesperix):

Please examine the various graphs below. Clearly, NEON makes quite a large difference and is a 'no-brainer' for any application. I am showing one set of non-power of two benchmarks to illustrate why they should not be used.

I ran a test of the threaded version of FFTW 3.2.2 and the results are promising for a scaled-up system. The Cubieboard2 is only a dual-core system but I plan on running MPI tests with more cores at a future date.

FFTW 3.3.3 (Official):

I was unable to get the NEON version of FFTW 3.3.3 working. I was able to run benchmarks of the scalar version of the code which shows a performance improvement over the 3.2.2 scalar results. I compiled one graph comparing all the different scalar versions, with FMA instructions and without.

Note how the FMA versions have slightly lower performance. In the Benchmark Methodology section I mentioned that the --enable-fma flag actually causes performance to decrease. The reason for this is not intuitive as one would think that a Fused Multiply Add (FMA) instruction would save cycles as it replaces separate Multiply and Add instructions. In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference.


The results from this set of benchmarks are very similar to those attained by Vesperix on Cortex-A9 boards. The multi-threaded version is also significantly better for larger FFT sizes. The results at different FFT sizes are very dependant on the processor implementation details such as cache sizes and memory access times. With smaller FFT's the overhead associated with calculating the FFT is a large factor and this is clearly visible up to sizes of 128.

The scalar results for FFTW 3.3.3 are better than those from 3.2.2 so it is logical to assume that the newer version's NEON performance will be better as well.

Since FFTW works by creating a 'plan' before actually calculating the FFT, it chooses to not use more than one thread in the multi-threaded version before a certain FFT size. This is clearly visible as it chooses to use multi-threading at greater than size 128. The overhead associated with doing this causes the result to be poor at size 256 and based on the results, multi threading should only be enabled for sizes over 1024.

The power usage of the Cortex-A7 processor is lower than that of the Cortex-A9 and so if a large cluster of these devices is used for a computational task such as radio astronomy, one could speculate that it may be worthwhile to use more Cortex-A7's over fewer Cortex-A9's since the performance is similar. 

Friday, 6 December 2013

Current Measurement Board

Since we are interested in power measurements for the different ARM platforms, I decided to quickly design and build a simple current measurement board that we can connect an oscilloscope to to plot current (and hence calculate power with a corresponding voltage measurement on the second channel).

The concept is based on Ohms law: the voltage across a resistor is equal to the resistance multiplied by the current through it. The board design in the schematic caters for a known 0.01 Ohm resistor with a 1% tolerance and a gain of 100 with a 1500 Hz low pass filter. The gain of 100 results in a voltage output that is proportional to the current: 1 A current gives 1 V output. This is so that we can use the oscilloscopes built in multiplication to see power in real-time. The low pass filter is there so that we don't get too much noise on the current measurement, but still enough response to see a spike when we start benchmarks.

I have posted images of the schematic and photos of the finished board. If you would like the Cadsoft Eagle design files, I'm happy to share them - just put a request in the comments below. Something I should also mention is the op-amp is pretty high end, unnecessarily! To be honest, it is a free sample that I gratefully received from Maxim, so I used it...

Input Current: ~50 mA - 5 A
Output Voltage: ~50 mV - 5 V (dependant on how close to the negative / ground rail the op-amp can go)
Frequency Response: -3 dB @ 1591 Hz
Supply Voltage: 5 V (dependant on the op-amp)
Power Loss in 0.01 Ohm Resistor @ 5 A: 0.25 W (one can use a spreadsheet to compensate for this error)

Wednesday, 4 December 2013

Wandboard PCI-Express Adapter: Preliminary PCB 3D Images

I have been designing an adapter board to connect two Wandboards via their PCI-Express ports. The design will be finished by the end of this week, after which it will be sent in for manufacturing! This post is simply to show off two quick 3D renderings of the boards so far.

We will use this board for testing and benchmarking the potential throughput of the Freescale i.MX6 processor. The gigabit ethernet on the i.MX6 is limited to ~400 Mbps according to the datasheet, but we should be able to attain close to 5 Gbps with extremely low latency using the PCI-Express Gen 2 x1 port!

More details will follow when the board has been manufactured and tests are complete!